Every day I deal with tens of critically important servers. Database servers, web servers, mail servers – pretty much any machine used in a live setup is important, which makes checking the health of the server critical too. Every decent application produces logs, but turning these logs into something that you actually want to check daily is the key to making sure you know the most about your servers.
I want to give you two examples taken from live severs to demonstrate the usefulness for monitoring servers, and in particular graphing their stats to show problems and illustrate long term trends which may need addressing in future.
If you’re processing hundreds of thousands of emails a day it’s hard, if not impossible to spot trends in your activity. If one day you send 12,000 messages instead of 8,000 how can you easily notice, and more importantly if it’s extraordinary?
Firstly doesn’t that look pretty? OK, maybe in quite a geeky way, but it shows you some important things which lets you make some presumptions.
On the whole the mail service seem pretty health, and shows a steady weekly pattern. It’s worth pointing out this server isn’t under huge load so the numbers aren’t massive. However it demonstrates the point well.
Now for a graphs which shows how various serious extraordinary activities can be easily identified in a longer time period. Take a plot from another server over the last year, this time of its load average.
Again, a pretty looking graph, with three key events:
Let’s address these points in order. The gap in graphing could represent the server going down (a power outage, hardware failure etc). Now in reality it is actually due to the graphing system itself being upgraded, but for this article let’s call it an outage to demonstrate what it would like look if it really had happened. We can see after the outage the machine returned to around normal (for that period) load.
The second point, the massive spike in load was due to a DDOS attack against one of the hosted websites. It didn’t bring the server down (due to well configured apache, and quick action by the administrators) but it made the server work a lot harder than for the rest of the entire year. The results of this attack made us look at the general load levels of the server, and with a little more tweaking after the attack you can see the load average was leveled out to a more even average.
Four months later, and after trying to reduce the average load and memory usage further we decided to update the RAM in the server. The use of other graphs (not shown here) indicated that swap usage was increasing, as a physical memory upgrade was on the books. The results of this upgrade (which took so little time that you can’t see it on the graph) has dropped the average load to a fraction of the amount.
Graphing your stats provides a long term record of health and performance, and gives an interesting interactive method of keeping track of your servers. I certainly wouldn’t pore over pages of numbers to check the server daily, but instead I can at a glance see things are normal. For those wanting to try it themselves, I would recommend the powerful (bit a little complex) Cacti graphing suite, which is based on SNMP and rrdtool. There are simpler systems such as Munin too, but all run on LAMP systems well.