Your Linux Data Center Experts

There has been a lot of discussion about the power outage at the Super Bowl. Power failure is a subject near and dear to most computer users, especially those in the Data Center. A lot has been written about the outage, including finger pointing in all different directions. I especially like the statements that the “faulty device was manufactured in Chicago”. As if the readers will conclude “Oh, that explains everything!”

Backup Power FTW?

One piece of commentary I've particularly found interesting is the personal blog post of Amazon Distinguished Engineer James Hamilton. I'll admit that my first reaction on hearing about this blog post was “I'm not sure Amazon is one to be talking about preventing blackouts”. Coming shortly after the Christmas Eve 17-hour outage of Netflix (which uses Amazon's infrastructure) and an hour outage of Amazon's own site. Particularly as it seems unlikely that an Amazon engineer has any direct knowledge of the specific inside workings of the venue where the outage occurred.

The blog entry is worth a read and does offer some good suggestions. However, it also offers some more problematic suggestions.

The good suggestions are:

  • Replacing the lighting with equipment that doesn't take 20 minutes to come online. Half the outage was after power was restored.
  • Splitting power up into smaller zones and interleaving the lights. Instead of half the Superdome going black, maybe it could have been designed so that every 4th or 6th light went out.
  • Automated Recovery could have shortened the power outage from 15 minutes to seconds. Though the lights still would have taken 20 minutes to come back on.
  • Improved testing procedures seem likely to have found this fault, if testing at maximum expected utilization and at overload had been done.

However, I would call into question the recommendation of installing backup generators. Admittedly, I'm making this call a week after the blog posting, and we have more information about the cause of the failure. I did make this call a week ago, I just haven't had time to write about it until now. That's my story and I'm sticking to it.

Why Not Backup Power?

My primary concern here is that you have to be extremely careful when adding complexity to a system. As the saying goes, “For every difficult problem, there's a simple, obvious, solution that is completely wrong.” Adding UPSs and generators to a system that has insufficient testing, is almost certainly a bad thing.

At this point it looks like the fault was caused by an incorrectly configured breaker designed to protect equipment at the stadium. It sounds likely that the breaker was configured with too low a set-point, which caused it to trip prematurely.

My Conclusions

The biggest gain here in reliability would be better testing procedures. That likely would have found the configuration issue with the breaker before it impacted the event. However, improved testing isn't “sexy”. The sound-bite we are hearing is “Amazon engineer says spending $10M (60 seconds of advertising worth) on generators would have prevented blackout”.

The next priority I would look at is decreasing the lighting startup time. Any power outage causing 20 minutes of darkness seems like a big red flag.

One thing the blog doesn't go into is how you select the results of the brainstorming on possible countermeasures. That is the more important part of the Service Outage Analysis process: you brainstorm solutions and then you analyze them and pick the appropriate ones, usually in terms of cost/benefit.

Our Experience with Power Outages

In a Data Center environment, they tend to be pretty rare. In our facility, we haven't had a single power outage to any of our cabinets since 2004 when we moved in there.

Most outages I know of in the data center are human caused, typically smaller scale issues. For example, cabinets accidentally or intentionally being run too close to the limit. Or whole-room incidents like the livejournal outages years ago where an EPO (Emergency Power Off) button was pushed by someone working in the room.
Twice over a couple of years.

I can only imagine that pressing the EPO button generates a RG interrupt. You know: Resume Generation…

Many of these issues can be addressed by high availability clusters running with no shared single point of failure. For example, several years ago we had a transfer switch which, instead of being 30A was two separate 15A sides. But all monitoring for it was on the 30A side… The total load was well under rated capacity, but one half was just a bit too high, so that half triggered a circuit breaker.

In our case, redundant machines are always in other cabinets, so those gracefully took over after a few seconds.

In that case, it was decided that the appropriate countermeasure was to replace this gear with gear that had monitoring at the individual breaker level. That's been the only significant power outage we've had at our facility in 7 years.

And that's the name of the game: taking appropriate countermeasures.

comments powered by Disqus

Join our other satisfied clients. Contact us today.