Heartbeat 2.0.2 with ipmilan STONITH. (tummy.com, ltd. Journal Entry)
tummy.com: we do linux

Thursday August 09, 2007 at 01:59
Subject: Heartbeat 2.0.2 with ipmilan STONITH.
Keywords: Heartbeat, IPMI, STONITH, Technical
Posted by: Sean Reifschneider

I spent most of the day today trying to get IPMI STONITH working with Heartbeat. IPMI is a system management protocol, usually implemented via an auxiliary controller, for doing various management functions including getting sensor data (fan speed, temp) and turning a server on and off. The IPMI controller is on even if the system is otherwise powered off. However, the ipmilan STONITH plugin is in pretty rough shape.

if you've gotten here via Google and are hoping this will help you get IPMI set up on your cluster, let me cut to the chase: The ipmilan STONITH driver appears to be completely unusable. You will probably have to do what I'm doing and implement a STONITH external script that uses ipmitool to do the job.

The first problem I ran into was that when I tried to set up STONITH with ipmilan, according to the README, it would report:

CRITICAL **: Unable to setup connection: 16

Google wasn't very helpful, it just pointed out someone else asking about this error from 18 months ago with no response...

I dug into the code, and found that the "auth" and "priv" fields, which the documentation says accept values like "none", "md5", and "admin" are passed through the "atoi()" C library call to convert them into integers. Since none of the documented values are actually integer strings, they all silently get converted to 0.

That is the core of the problem causing the error above. The "priv" field needs to be the integer 4 for "admin" in my case, but is instead 0. If you change the "priv" field to "4", and the "auth" field to "2" for "md5" it stops reporting the above error.

However, it then starts core dumping due to an invalid pointer de-reference.

The IPMI library is incredibly poorly documented, and to make it worse the STONITH ipmilan plugin is using a deprecated function.

My opinion is that ipmilan needs to be scrapped and re-written, hopefully by someone who knows the OpenIPMI API or at least someone who can find some documentation on it.

I was able to get ipmilan to reboot the remote machine, right before it seg faults, as well as correcting the argument passing problems above I've sent that patch to the Heartbeat maintainers, but I've also recommended to them that they either completely remove IPMI or at least disable it from the default build.

I just wanted to get this up there where Google could find it so that other people could give up earlier than I did. :-(
(Post Reply)

Comment
Author: Sean Reifschneider
Subject: Oops, it was actually 2.1.2...
Kevin pointed out that I said 2.0.2, when actually it was the latest heartbeat, version 2.1.2. However, 2.0.2 is almost certainly similarly impacted.

Sean

Comment
Steve Webb
Subject: heartbeat with port monitor?
Got any tips for using heartbeat with a port monitor? I'm trying to get a mysqld monitor working that tells heartbeat to switch on a failure. Ever done this?

- Steve

Comment
Fredrik Carlsson
Subject: Stonith
Is there any change that you will post the script you created?
Comment
Author: Sean Reifschneider
Subject: Sorry...
Sorry, I just don't have the time to package, release, and maintain it. Between my other released software and our normal client work, I'm just swamped. Sean
Comment
Allon Herman
Subject: Thanks
Thanks Sean,
I was just about to do an ltrace myself after having no success with strace. Anyway, using numeric values instead of symbolic names for auth and priv, seems to have solved my problem!
Comment
Allon Herman
Subject: more about strange behavior
Another strange thing about ipmilan's behavior is that stonith with -T on turns the system off, and with -T off turns the systems on... It seg faults in all cases, but only after the good deed is done, so for the time being, I'll live with it.