I am sure by now you have read about the recent large systems failure (see https://bit.ly/3DtSSDt—Ed.), and I wondered if you would share your thoughts on how such a large company-famous for having so many smart people at its disposal-can fail so miserably at infrastructure. I am probably lobbing a softball question, but how is it possible these large and pervasive failures happen?
Making Popcorn While Waiting for Your Reply
Some people would say it is unfair to pick on any company after such a spectacular failure, and that it is not nice to kick people—or companies—when they are down. KV, of course, is not one of those people.
Like the rest of the world, I watched in amusement as one of the wealthiest companies on earth seemingly shot itself in the foot with a configuration error. Some watched in horror, and yes, KV watched in amusement. It goes without saying I know only what I have read in the news and on various "feeds," but some of the failure seems fairly well externally documented.
The lessons here are broader than just a simple "Don't do that," and there are numerous examples of companies doing similar things to themselves before this most recent incident. The real root cause that made this all so catastrophic is not just the pushing of a bad configuration; the actual cause is something that concerns nearly all modern computing infrastructure, and it has to do with cake.
Modern computing is no longer done on just one or even one small set of computers but is carried out on thousands of machines spread across the globe. The way this infrastructure has been built up, both logically and physically, often resembles a layer cake, but one where the icing is not so so sweet. In fact, in the best case, not only is the icing bitter, but often the icing between the layers becomes rancid. Furthermore, each layer has been baked by a different chef and then slapped on top of whatever rancid icing is currently at the top of the cake. Oh, and the chefs do not communicate because that would violate cake layering or something.
The pièce de résistance in this most recent catastrophe was the notably rookie-level error of seeming to hook everything to the same network such that a single failure brought down not only the externally visible site, but also its internal tools, and even locked personnel out of their conference rooms and datacenters. Supposedly, the only way to start resetting the systems required using an angle grinder to get access to equipment in a locked rack.
Two intertwined issues made the failure far worse than it had to be. The first is the untracked coupling of systems without sufficient forethought to what happens when one of the icing layers is rancid. The other is putting all the layers on one cake.
Putting all the layers on one cake is simply foolish, and honestly, probably the more shocking revelation. KV cannot think of anyone who would knowingly put the control network for any sort of physical infrastructure on the same network as the one that serves pictures of cats. All distributed systems must be constructed while bearing in mind the concept of separation of concerns, which may result in making several cakes instead of one.
The other failure—not tracking how the cake is layered—is, alas, all too common. Modern systems seem to be less designed and more like an accretion of systems and functionality over time. Given how often people in technology change jobs, the ability to retain clear, institutional knowledge of how the cake was made is a nontrivial exercise. Documentation, like code, rots if it is not maintained, and this is the biggest risk in building up a large system. Quarterly system reviews that bring together people from multiple disciplines within a company—including DevOps, NetOps, BizOps, SecOps, FooOps—or whatever the name group vogue du jour—are probably one of the best ways to ensure the icing has not gone off and the chefs all know where their layers are supposed to go.
These types of failures are always failures of communication: first at the human layer and then, eventually, at the technological layer.
Too Big to Fail
Automating Software Failure Reporting
Resilience Engineering: Learning to Embrace Failure
GameDay Exercises Case Study
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
No entries found