Being Cellfish: The Good Diagnostics Paradox

If you want your system to have good diagnostics (aka logging) I find there is a kind of catch-22 in achieving it. And that is regardless of if we are thinking about diagnostics in order to understand why things went wrong or if it is to do business analysis. And rather than calling it catch-22 I'll call it a paradox. Because it sounds cooler...

The paradox is most obvious when we think about diagnostics as tools to understand why something went wrong in our system. I believe that in order to get good diagnostics you actually need to investigate problems. But in order to do these investigations you actually need to know something is wrong so you need good monitoring. But in order to get good monitoring you need good diagnostics...

So let's take a step back. When I say good monitoring I mean two things. First of all monitoring alerts (or notifies) based on real problems. Alerting because the CPU usage is high is in general not good (unless it is the only thing you can monitor). Second good monitoring rarely cries wolf so there are very few (if any) false positives.

And when I say good diagnostics I mean performance counters and log messages that can explain why there is/was a problem without debugging the code. Because if you need to start the debugger to understand what your code is doing you have kind of failed.

So there you have it; if you want good diagnostics you actually need to iterate and improve diagnostics and monitoring several times. Just accept it will not be perfect from the start and will take time. Naturally experienced developers will shorten the time you need to iterate but assuming you will get it right the first time is naive.

Being Cellfish

Pages

2014-11-20

The Good Diagnostics Paradox

No comments:

Post a Comment