Last month I had wanted to try to get some real monitoring of a bunch of my personal systems. I spent weeks dabbling with setting up a serious monitoring system (tried Zabbix, Zenoss, nagios, and Opsview), but I was making little if any progress with those. Largely that was related to running a RHEL 6 derivative and running into compatibility problems…
After spending around a day of effort on them, I decided to roll my own. My needs were modest, I wanted to be able to run some nagios check scripts and get an e-mail if checks failed continually for 15 minutes. But not get e-mails every 15 minutes if they were down, and only alert if they failed several times.
Read after the fold if you're interested in my solution…
I was very reluctant to invent something new, but on the other hand it wasn't looking like the path I was going down, using existing tools, was making any progress. nagios was the closest to being functional, but the config files make my eyes bleed and my head hurt.
So I bit the bullet and put together a bit under 200 lines of Python which did what I was looking for. I call it “nanomon”, and it involves a single Python program, and a config file. The scheduler is “cron”, to keep it simple, and it runs every minute.
It's available on the wonderful github under linsomniac/nanomon.
The checks are external scripts, in my case everything I wanted to check was already available as nagios scripts. Scripts can either be checked for success or failure by their exit code, looking for a string or any Python function which takes a string and returns non-False on success – such as regular expression matching.
command('/usr/lib/nagios/plugins/check_mdstat.sh', success = 'OK:') command('/path/to/customescript', success = 0) command('/usr/lib/nagios/plugins/check_zfs.sh -p data', success = re.compile(r'^OK:').match) command('/usr/lib/nagios/plugins/check_disk -c 10% -p /', success = re.compile(r'^DISK OK ').match)
The first runs a nagios plugin and looks for the string “OK:” in it. The next runs a custom script and checks that the exit code is 0. The last two use a regex to match the output of other nagios checks.
The above would alert if any of those checks produce failures 15 times in succession.
It's been running for almost a month now, and performed spectacularly when I recently had a big power outage at my house – I manually need to bring up ZFS on one machine so it alerted as down and then back up. I'm extremely pleased with the results.
Let me know if you take a look at it.comments powered by Disqus