Failure and the Chaos Monkey
Part 2: Why Failure Is a Critical Requirement
The lessons of trying to prevent failures, rather than live with them, is something I saw this in my time in electronics, as well. World events conspired, at one point, to cause us to push our airfield equipment into operation 24 hours a day, 7 days a week, for a period of about 2 years. For the first 6 months or so, the shop was a very lazy place; we almost took to not coming in to work. We painted a lot of walls, kept the floor really shiny, reorganized all the tools (many times!), and even raised a few shops and sheds at various people’s houses. After this initial stretch, however, the problems started. Every failure turned into a cascading failure, taking many hours to troubleshoot and repair, and many replaced parts along the way.
The lesson? Not breaking things on a regular basis was causing us to miss components that were on the edge of failing. Not troubleshooting our own breakages was costing us our “touch” with these complex pieces of equipment. We were losing ground, and it was only a matter of time before something truly catastrophic happened. In the course of history, the crisis was relieved, and we went back to a regular maintenance schedule. Things returned to normal.
Now we cannot change the world of networks today. We are stuck in a 24x7 world regardless of whether or not we like it. The ideas of regular maintenance and scheduled downtime are lost in the mists of history. But this doesn’t mean we are just stuck with the consequences. Time to reconsider the Chaos Monkey.
The idea behind the Chaos Monkey is often misunderstood. The impression is of a large scale self-induced failure on a random basis, without warning. The truth is that you do not need to implement the concept this way. To start, you should follow a process that closely mirrors change control processes:
- Start with the safest failure, the one you are most confident will not cause a general outage
- Think through what you think will happen
- Think through where you should measure, how you should measure, and what you would like to collect for permanent future reference
- Think through what you think can learn from inducing this outage
- Think through when you could do this with the least chance of collateral damage
And then you choose, at first (at least), some very simple things to try. Some things you might be able to do:
- Block a traffic flow along one of two parallel links, and see what traffic shifts where in the network. This can not only help you understand what applications are sending which traffic, but also what paths traffic will take in the case of a link failure.
- Unplug some piece of equipment that you know is redundant. Watch what the network does, and how applications react.
Planned outages should be planned, rather than unplanned. Begin small—and possibly remain small. There is no reason to tear the entire network down in order to probe and learn. But we need to change the way we work fundamentally before the entire house of cards we have built comes crashing down around our ears.
We work on networks that are increasingly complex to the point that no human can really understand all the interaction surfaces, and in which not every possible combination can be tested. The way we build networks is like this: we test each component of the car, and then we carefully put the car together, making certain all the bits fit right, etc. What we never, ever, do, is take the network out and slam it into a wall at a high rate of speed. Which is a shame, because we might learn a lot by actually doing this sort of test every now and again.
In fact, I do not think there is a single reader, right now, who is thinking, “cars carry people, so we should not slam cars into walls to see if they work right.” I am certain we are all happy car manufacturers have implemented a form of Chaos Monkey that makes sense in their world.
And this, fundamentally, is the reason I would trust a network that does implement an appropriate form of Chaos Monkey like systems over one that tries to keep all the bolts shiny and new, and properly tightened, but never tests for failure.
If we want to survive as a mission critical part of real businesses, we had better get used to failure and plan for it, rather than trying to stem it off as long as possible with our old fashioned duct tape, bailing wire, and bubble gum—because failure, in complex systems, is not an option.
Instead, failure is a requirement.