Failure and the Chaos Monkey
Part 1: Failure Isn't an Option
Most hyperscalers seem to live in a different world when it comes to failure. The well known case, of course, is Chaos Monkey, the invention of NetFlix, which is designed to intentionally break the network in unpredictable ways. Chaos Monkey has several points.For instance, randomly breaking things exercises the humans who stand between network outages and complete operational failure. Practice makes perfect, the old saying goes (although really it should be perfect practice makes perfect performance—because imperfect practice never has a good outcome in the real world). This can be more formally stated through the virtue ethic, an ancient way of looking at the development of moral and intellectual skills going all the way back to Aristotle. Randomly induced failures also exercise the machines that stand between a single failure and a complete network meltdown, exposing what are likely unseen holes that can open up, and fragile things that can break, at just the wrong moment. The interaction surfaces between systems must often be seen to be understood.
But, as I said in the first sentence, hyperscalers seem to live in a different world when it comes to failure. As I have been told many times, these sorts of Chaos Monkey things simply cannot be applied to the average “enterprise grade” network. For instance, “if you think the hospital administration is going to allow me to break things randomly to see how these systems work, you are crazy—people’s lives depend on this stuff!” Or, “this is all fine when you are dealing with maybe losing your photographs of that last hamburger you ate; you’d think twice if you were losing your checking account.”
My response to such statements is normally silence; I’m not the brash youth who likes to get into arguments any longer. Okay, maybe that’s not entirely true, but still… My normal response, when I respond, is something like this: “The first bank, the first hospital, that pushes Chaos Monkey into their network will get my business.”
The answer begins farther back, and yet closer, than you might think. My life as an engineer runs back through a time when I was working on airfield equipment, including things like Very High Frequency Omnirange Radios (VORs), Tactical Navigation systems (TACANs), weather RADAR, and airfield landing systems. Human lives literally depended on these pieces of equipment on a daily basis. I learned to live with pressure by being called up at 2AM to fix an instrument landing system while watching the lights of an airplane full of people circle the airfield. At some point they were going to run out of gas, and then things would get really ugly.
We had a regular maintenance schedule, of course, much like many of the networks I started working on way back when had regular downtime. The idea was to test different systems while it was not being used, and repair anything that appeared to be close to tolerances, or close to “that time” when it might break. The networking world has moved past this stage, of course. Most networks today simply do not have regularly scheduled downtime during which tests can be run, and regular maintenance performed.
This lack of downtime is, honestly, a testament to modern engineering methods. That we can build a network of thousands of devices that never really fails is an amazing feat, especially considering the reality that network design and engineering are still as much art as they are “real” engineering.