Your Linux Data Center Experts

Last month I had wanted to try to get some real monitoring of a bunch of my personal systems. I spent weeks dabbling with setting up a serious monitoring system (tried Zabbix, Zenoss, nagios, and Opsview), but I was making little if any progress with those. Largely that was related to running a RHEL 6 derivative and running into compatibility problems…

After spending around a day of effort on them, I decided to roll my own. My needs were modest, I wanted to be able to run some nagios check scripts and get an e-mail if checks failed continually for 15 minutes. But not get e-mails every 15 minutes if they were down, and only alert if they failed several times.

Read after the fold if you're interested in my solution…

I was very reluctant to invent something new, but on the other hand it wasn't looking like the path I was going down, using existing tools, was making any progress. nagios was the closest to being functional, but the config files make my eyes bleed and my head hurt.

So I bit the bullet and put together a bit under 200 lines of Python which did what I was looking for. I call it “nanomon”, and it involves a single Python program, and a config file. The scheduler is “cron”, to keep it simple, and it runs every minute.

It's available on the wonderful github under linsomniac/nanomon.

The checks are external scripts, in my case everything I wanted to check was already available as nagios scripts. Scripts can either be checked for success or failure by their exit code, looking for a string or any Python function which takes a string and returns non-False on success – such as regular expression matching.

For example:

command('/usr/lib/nagios/plugins/', success = 'OK:')
command('/path/to/customescript', success = 0)
command('/usr/lib/nagios/plugins/ -p data',
      success = re.compile(r'^OK:').match)
command('/usr/lib/nagios/plugins/check_disk -c 10% -p /',
      success = re.compile(r'^DISK OK ').match)

The first runs a nagios plugin and looks for the string “OK:” in it. The next runs a custom script and checks that the exit code is 0. The last two use a regex to match the output of other nagios checks.

The above would alert if any of those checks produce failures 15 times in succession.

It's been running for almost a month now, and performed spectacularly when I recently had a big power outage at my house – I manually need to bring up ZFS on one machine so it alerted as down and then back up. I'm extremely pleased with the results.

Let me know if you take a look at it.

comments powered by Disqus

Join our other satisfied clients. Contact us today.