News

Welcome to End Point’s blog

Ongoing observations by End Point people

Monitoring with Purpose

If you work on Internet systems all day like we do, there's a good chance you use some sort of monitoring software. Almost every business knows they need monitoring. If you're a small company or organization, you probably started out with something free like Nagios. Or maybe you're a really small company and prefer to outsource your alerts to a web service like Pingdom. Either way, you understand that it's important to know when your websites and mailservers are down. But do you monitor with purpose?

All too often I encounter installations where the Systems Administrator has spent countless hours setting up their checks, making sure their thresholds and notifications work as designed, without really considering what their response might be in the face of disaster (or an inconvenient page at 3am). Operations folk have been trained to make sure their systems are pingable, their CPU temperature is running cool and the system load is at a reasonable level. But what do you do when that alert comes in because the website load is running at 10 for the last 15 minutes? Is that bad? How can you be certain?

The art of monitoring isn't simply reactive in nature. A good SysAdmin will understand that real monitoring takes an active presence. Talk to your DBAs, software engineers and architects. Learn how the various components of your system(s) interact and relate, both in good times and bad. Review your performance trends (graphs) to see how each metric evolves over time. Without understanding the functional scope of your systems, you can't expect to set meaningful thresholds on them.

Last but not least, every alert should be actionable. Getting paged because your application server is down is useless unless you have the proper remediation path documented and tested. Know what actions are needed, who should perform them, and what parties to escalate to in case the remediation fails. Focusing your energies on purposeful monitoring results in fewer false alarms, faster recovery from failures and regression, and an acute understanding of your entire application stack.

7 comments:

Christa Dixon said...

This is the most incredible blog post I have ever read. From now on, I am monitoring with PURPOSE! This is brilliant!!!

Jon Jensen said...

It's good to see Jason has such an enthusiastic following. :)

Anonymous said...

What the normal/large companies use for monitoring if pingdom/nagios just are for small/really small companies.

I know several really large organizations that use nagios and they are really happy with it.....

Jason Dixon said...

Yes, you'll find Nagios deployed throughout companies of all sizes. It's the leader in open source monitoring because it's relatively simple to setup, has an enormous community following and checks for virtually any scenario.

Many large companies look to commercial offerings because of compliance requirements and internal demand for commercial support. Examples include Nimsoft, IBM Tivoli, Hyperic HQ, HP OpenView, BMC SIM and CA's SIM.

Leendert Brouwer said...

That's a good view on how to approach monitoring. Monitoring something "just because you can" is useless.

About an earlier comment on Nagios - large companies definitely use it as well. We use Nagios with Opsview on top of it.

One can say a lot about monitoring, aside from the technical aspect of it. It's important that it is part of development processes inside an organisation, and good monitoring will often demand custom monitoring solutions. That's why I like flexible, straight forward systems such as Nagios. We have tons of scripts that check very specific transactions inside applications - for example doing calls to web services and make sure a valid response is received, knowing how to deal with soap faults, etc.

It's a huge territory and (to my surprise) literature on it, especially the organizational side of it, hardly scratches the surface.

Dmitriy said...

"Talk to your DBAs, software engineers and architects. Learn how the various components of your system(s) interact and relate, both in good times and bad. Review your performance trends (graphs) to see how each metric evolves over time."

Could you please clarify this? One might get impression that you regard a sysadmin as someone who receives pages for all of the above systems and is expected to be able to fix them. I don't think it's what you are trying to say, but it sure sounds like it.

Jason Dixon said...

Dmitriy-

I think you'd be surprised just how commonplace this is. Many organizations are squeezed for resources and typically utilize the operations staff for triage support before escalating to the engineering team for more advanced troubleshooting.