munin smart plugin: ignore error in the past

As a hard drive in my server failed, my hosting provider exchanged the drive with another one which obviously had some sort of error in its past, but now seems to be fully ok again. I would have wished to receive a drive without any problems but as my server is RAID 1, I can live with that.

I do my monitoring with Munin and for monitoring my hard drives I use the smart plugin. Now this plugin also monitors the exit code of smartctl, where smartctl sets bit no 6 if there was an error in the past, so now while everything is alright, the exit code is always numeric 64.

Now the smart plugin reports this as an error, if the exit code is > 0, i.e. now it always reports a problem.

I could set the threshold to 65, but then I wouldn't be notified of other errors which essentially makes the plugin useless.

I asked at Serverfault but no one seems to have a solution for that.

So I attacked the problem on my own and patched the plugin. In the source code the important line is here:


if exit_status!=None :
# smartctl exit code is a bitmask, check man page.
num_exit_status=int(exit_status/256)

which I have modified to look like this:

if exit_status!=None :
# smartctl exit code is a bitmask, check man page.
num_exit_status=int(exit_status/256)
# filter out bit 6
num_exit_status &= 191
if num_exit_status<=2 : exit_status=None if exit_status!=None :

Now it doesn't bug me anymore when bit 6 is set, but if any other bit goes on again, I will still be notified. The most interesting part is the line where there is a bitwise operation with 191: this is 0x11011111 in binary, so doing an AND operation with the current value it will just set bit no 6 to 0 while letting the other values untouched.

Therefore a value of 64 (as mine does) will be reported as 0 while a value of 8 would remain at 8. But also, very importantly, a value of 72 (bit 6 set as always and bit 3 set because the disk is failing) it would also report 8.

And there we have another reason, why it is good to be firm with knowledge about how bits and bytes behave in a computer. Saved me from a warning message every 5 minutes :-)