There has been a lot of discussion about the power outage at the Super Bowl. Power failure is a subject near and dear to most computer users, especially those in the Data Center. A lot has been written about the outage, including finger pointing in all different directions. I especially like the statements that the “faulty device was manufactured in Chicago”. As if the readers will conclude “Oh, that explains everything!”
One piece of commentary I've particularly found interesting is the personal blog post of Amazon Distinguished Engineer James Hamilton. I'll admit that my first reaction on hearing about this blog post was “I'm not sure Amazon is one to be talking about preventing blackouts”. Coming shortly after the Christmas Eve 17-hour outage of Netflix (which uses Amazon's infrastructure) and an hour outage of Amazon's own site. Particularly as it seems unlikely that an Amazon engineer has any direct knowledge of the specific inside workings of the venue where the outage occurred.
The blog entry is worth a read and does offer some good suggestions. However, it also offers some more problematic suggestions.
The good suggestions are:
However, I would call into question the recommendation of installing backup generators. Admittedly, I'm making this call a week after the blog posting, and we have more information about the cause of the failure. I did make this call a week ago, I just haven't had time to write about it until now. That's my story and I'm sticking to it.
My primary concern here is that you have to be extremely careful when adding complexity to a system. As the saying goes, “For every difficult problem, there's a simple, obvious, solution that is completely wrong.” Adding UPSs and generators to a system that has insufficient testing, is almost certainly a bad thing.
At this point it looks like the fault was caused by an incorrectly configured breaker designed to protect equipment at the stadium. It sounds likely that the breaker was configured with too low a set-point, which caused it to trip prematurely.
The biggest gain here in reliability would be better testing procedures. That likely would have found the configuration issue with the breaker before it impacted the event. However, improved testing isn't “sexy”. The sound-bite we are hearing is “Amazon engineer says spending $10M (60 seconds of advertising worth) on generators would have prevented blackout”.
The next priority I would look at is decreasing the lighting startup time. Any power outage causing 20 minutes of darkness seems like a big red flag.
One thing the blog doesn't go into is how you select the results of the brainstorming on possible countermeasures. That is the more important part of the Service Outage Analysis process: you brainstorm solutions and then you analyze them and pick the appropriate ones, usually in terms of cost/benefit.
In a Data Center environment, they tend to be pretty rare. In our facility, we haven't had a single power outage to any of our cabinets since 2004 when we moved in there.
Most outages I know of in the data center are human caused, typically smaller scale issues. For example, cabinets accidentally or intentionally being run too close to the limit. Or whole-room incidents like the livejournal outages years ago where an EPO (Emergency Power Off) button was pushed by someone working in the room.
Twice over a couple of years.
I can only imagine that pressing the EPO button generates a RG interrupt. You know: Resume Generation…
Many of these issues can be addressed by high availability clusters running with no shared single point of failure. For example, several years ago we had a transfer switch which, instead of being 30A was two separate 15A sides. But all monitoring for it was on the 30A side… The total load was well under rated capacity, but one half was just a bit too high, so that half triggered a circuit breaker.
In our case, redundant machines are always in other cabinets, so those gracefully took over after a few seconds.
In that case, it was decided that the appropriate countermeasure was to replace this gear with gear that had monitoring at the individual breaker level. That's been the only significant power outage we've had at our facility in 7 years.
And that's the name of the game: taking appropriate countermeasures.comments powered by Disqus