Maturity model for service monitoring

Over the years I've encountered organizations and individuals that all have different levels of service monitoring maturity. I figured it was time to talk a little about these. And by monitoring I mean the things you look at to understand the health of a service. And I'm especially considering how good vs bad health is quickly determined.

The Novice

The Novice is going to monitor a service from the systems point of view. This means monitoring CPU levels, memory usage etc. While this is data that is important to look at when a service is having a problem certain behavior in these metrics do not mean that the service is having a problem. And since the system as a whole is monitored you get lots of false alarms because of unrelated things on the machine that never affected the health of your service.
A typical alert rule is: alert if CPU is over 90%.

The Logger

The Logger is going to search logs for certain strings (like "ERROR") and count the number of occurrences only to then apply some simple threshold logic on those values. While this is a huge improvement compared to the Novice since the Logger is monitoring application specific metrics it does not scale since thresholds needs to be tweaked as the service is getting more load.
A typical alert rule is: alert if event 4711 happens more than 10 times in one second.

The Analyzer

Similar to the Logger the Analyzer uses data from the service to count certain events. But rather than looking at individual metrics the Analyzer looks at a few of them determine what is good or bad. This approach scales and you should be very happy if your organization is this mature. The only real danger is if the metrics monitored do not really match the health of your service very well.
A typical alert rule is: alert if number of failures is more than 5% of attempts per second.

The Expert

I named this maturity level the Expert since it is really about knowing your domain and how your service behaves in this domain. For a lot of services there is something obvious that will change rapidly if your service has a problem compared to being healthy. In Xbox Live that was the number of online users (when I worked there) and for Netflix it is the number of starts per second. When things don't work as they should your users will deviate from the normal behavior patterns and all you need to do in order to quickly know your service health is to look at that one metric and compare it with history.
Please note that this does not mean the expert is not doing all the other things. The data is still there but used to diagnose problems rather than trigger the investigation.

The key to quickly being able to determine if a service is healthy or not is not necessarily to look at a lot of data but rather to look at the right data. And I bet most services have a fairly easy way to measure that in terms of some action the users do. Naturally you need a lot of users for this model to work since statistical variances need to be small. So the Expert should always use the Analyzer to improve the monitoring.

No comments:

Post a Comment