Your Linux Data Center Experts

The Problem of Redundant Power Supplies

By  Kyle Anderson Date February 3, 2011

Redundant power supplies are good things. All power supplies fail eventually, they rank right up there with harddrives failing in my book. You have to plan for them to fail. This means on critical servers, you put two in there! It's like RAID-1 for power supplies if that works for you.

Ok great so you have two now. Let's say that you do this because you want to protect yourself from a PDU failure in the rack. Good call, PDUs and circuits fail all the time too. Let's say you have two circuits and two PDUs, A and B. You load up the rack with all of your redundantly equipped servers, and plug one power supply into A and one into B. Let's say you load it up to a safe and cool, 80% capacity of the PDU. We are totally safe now right?

Err. Wrong. Think about it, what happens when one of the PDUs or Circuit fails on A. Now all the power that was evenly distributed is now shoved onto the other, B. What is B running at now? 160%? Guess what, it's going to cascade and fail B as well. So much for your redundancy. Hopefully you only have to learn this lesson once, and hopefully you can learn it without causing a rack to fail.

But barring this problem, dual power supplies are great for simple setups on single circuits too. You don't have to worry about cascading, and you are protected in the inevitable case that one of your power supplies will fail.

Ok, so let's think this out, one of your power supplies fails. What will happen? Power will be routed to the other, the server will stay up. Great. But if a power supply fails, and no server goes down, does it make a sound?

It does! Usually servers will beep when a power supply fails. Good thing you live in the datacenter so you will hear it right away. Oh you don't? Maybe you should monitor it then. :) Otherwise one will fail, not get replaced, then the other will fail eventually too (it does have more load on it now by the way). Then you will get the server down alert and say to yourself, a power supply failure? But we have two! We are totally protected! By the way, do you get alerts when your RAID fails? (I often hear an analogous story about a RAID-5 array that was not monitored, and when it fails it was discovered that two drives were dead. Maybe they died at the same time? Maybe it just wasn't monitored.)

The Solution: Monitoring

The solution here is to monitor your redundant power supplies. You cannot rely on someone hearing the beep. How do you do this? On SuperMicro machines you can setup email alerts. That is kinda good, but email servers change, spam filters, dropped packets... do you sleep well at night?

No, the real solution is an active checking system. You need active checks to know that it is good and working, and then the check fails, someone needs to know. A silently failing email alert is not good enough. At tummy.com we use the open source staple, Nagios.

IPMItool to the Rescue

IPMItool is an open source utility to work with the IPMI management cards in some servers. Depending on your particular Linux distribution, you can probably "apt-get install ipmitool" or "yum install ipmitool" to get it. It is basically a command line tool that can be used instead of the IPMI web interface.

Get the Plugin

The plugin for checking Supermicro power supplies can be found on the tummy.com FTP site. This plugin is written for the X8 class motherboards, and may need changes in the IPMI raw commands to work with other boards.

You can drop this in your nagios plugins directory, usually /usr/lib/nagios/plugins. As with any script I use I suggest at least looking at it to get an idea about how it works. With no arguments it will prompt you with the needed command line format:

# ./check_ipmi_powersupply 
USAGE: -H host -U ipmi_username -P ipmi_password

Not too complicated, and it looks like most any other nagios plugin. Here is some nagios command glue to help you use it:

define command{
	command_name	check_ipmi_powersupply
	command_line	$USER1$/check_ipmi_powersupply -H $HOSTADDRESS$ -U ADMIN -P $ARG1$
}

And to use it as a service for some host:

define service{
        use                             generic-service
        host_name                       My-Really-Important-Server
        service_description             POWERSUPPLY
        contact_groups                  admin
        check_command                   check_ipmi_powersupply!supersecretpassword
}

You can see in this way I have the password as the first argument, allowing me to use the same command description on multiple different hosts. I found that the Admin account was the only account that had the privilege of sending the raw commands necessary to check the power supply in this way.

The IPMI Raw Command

So a nagios plugin that checks power supplies, no big deal right? Maybe, but if you want to get the job done right, you have to monitor the server completely, from the health of the power supply all the way up to the status code of the apache page. The real magic in this thing comes from the raw IPMI command that the IPMItool sends. This raw command does a very low level query to the data bus that the power supply is connected to. Here is the explanation from the Supermicro engineer I worked with to make this check:

# ipmitool -H <IP Address> -U <User ID> -P <User Password> raw 0x06 0x52 0x07 0x78 0x01 0x78
>>
>> NetFn:  0x06
>> Cmd  :  0x52
>> Data  : 0x07  // bus 3 for X8 motherboard
>>         0x78  // slave address of PS (it can be 0x78, 0x7a, 0x7c for 3 redundant PS
>>         0x01  // read 1 byte
>>         0x78  // where 78 is offset of the PS, 0-bad, 1-good
>>
>> If the power supply is installed but failed, it will return value 0.
>> If the power supply totally lose the power, it will reply an error message.

And this is the main reason for this blog post, to get this ipmi raw command out in the open. A special thanks goes out to the Supermicro engineer who was able to pass down these special commands from deep within the bowels of their documentation.

It is worth noting that particular command will only work on X8 class motherboards. Other motherboard types will need to be looked up. If you are deploying this on a Supermicro 4-Node 6026TT then only the blade in the A slot has access to this data bus.

References

comments powered by Disqus