Rusty's Blog

Thoughts and musings of someone who's not sure what 'normal' is…

Wednesday, March 10, 2010

Poor Man’s UPS monitor.

There are essentially three different types of UPS devices. In order of ‘cost’ they are unmonitored, Serial port monitored, and Ethernet monitored. That’s usually an indicator of the cost as well, So there are a few people out there who are using either serial port or unmonitored UPSs. If you are paying for it out of pocket, and don’t have a lot of cash coming in, that’s probably OK, but in the long term you very definitely want to go to an Ethernet monitored platform as by and large they also give you significantly more information about the health of the UPS than the others.

But for now we are probably interested in knowing whether we have power or are running off the battery.

Even with a serial port monitored device, you may not be able to safely use a platform like UPSd to monitor the UPS. If you can, then go for it. However there are a lot of people running servers where the only serial port on the server is being used as a console for remote access, or to log errors to, or something. If that’s the case, then UPSd may not be able to monitor even a serial port based system. What to do?

Well, since I presume you are not into spending money, o you probably would have gone with the more expensive solution to begin with, I’m going to presume you haven’t exactly splurged on the switches in your network either. So what happens when you don’t plug an unmanaged switch or hub into mains power, and the local power drops?

The simple answer is that you can’t ping anything but localhost. You may not be able to ping the Ethernet interface of your server, but that depends on the network stack, and some will spoof the up interface for you.

The problem with Ping is not that it takes too long to get a response, or that it may give you unreliable results. The problem is that it requires the use of the network stack, which perhaps you are not using, or which you may be using in a different way from the standard IPv4 stack.

So, what to do? A very simple way to see if your switch is up and attached is to check to see if you have a link up state on the interface. Presuming you are doing standard syslog, there is a way to monitor for loss of network link, just by monitoring syslog.

If you unplug your network cable from your computer, you should see a message in the syslog from the Kernel that reads: “tg3: eth0: Link is down” I’ll leave out the tg3 and eth0 information, as they may not be correct for your setup. So what can we do with that?

Well, if you ‘tail -F /var/log/syslog | grep ‘Link is down” >> monitor_ups.txt’ you will capture all the link down messages. But we might just want to know if the link has come up along the way, so let’s change that to ‘tail -F /var/log/syslog | grep “Link is” >> monitor_ups.txt’ and now we will get both ‘link down’ and ‘link up’ messages.

OK, we’re capturing the link state messages. Anything else we can do? Well, the only thing I can think of that’s worse to deal with than a server that didn’t shut down properly, is one that shut down un-necessarily. The UPS will most likely bridge brief outages, Not all of them of course, but usually that’s because the battery needed to be replaced anyway, or you vastly overloaded the UPS. In either case you really don’t want to be running in that situation. If you are, go get a better UPS, or at least new batteries.

A reasonable expectation or design is to set up UPS loading to give you about 15 min of power off the UPS. You should plan for the possibility that an outage may last more than 15 min, but if possible don’t have the computer providing a load for all of that 15 min. If the network connection is gone for more than 5 min, then it’s reasonable to assume that it will be off longer. Many servers take several minutes to do a clean shutdown, as they close local files, and clear any write caches. So the process should monitor for link is down messages, and once one is received, start a 300 second countdown. At the end of 300 seconds we check again to see if the most recent message is ‘link is up’ and if it isn’t we initiate a system halt.

But what if we had an up event in between? Shouldn’t you reset you’re timer? Actually in that case I’m even more interested in shutting things down. If power has been going up and down, then the UPS is in an unknown charge state. It may have initiated a discharge timer to discharge the batteries so that they can take a full charge. The batteries may actually only be available for 4 min, or less. I don’t know. And resetting the timer isn’t going to help matters any. Now is a good time to shut down.

If the latest link state is ‘up’ then the shutdown process should exit gracefully.

If you have installed upsd, then you can very likely find a library call you can use to initiate the shutdowns. Otherwise you can have a root level cron job execute every minute, checking to see if you have set a shutdown flag, and if you have, then it shuts the system down for you.

I’ll leave the rest to you. Nothing described above is particularly difficult. It’s not as nice as using upsd as an overall package, with a managed ups, or even a single managed ups in the environment, and a bunch of more capable dumb UPSs and upsd shares with other systems the state of the managed UPS. But that’s the nature of saving money. Sometimes you have to make the best of a bad situation.

posted by Rusty at 10:21 pm  

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress