So this happened a few years ago in the days between Christmas and New Year. I was using the slow days after Christmas to get some work done in the office when suddenly a guy from the operations team came running down the hallways shouting "a developer, a developer, a kingdom for a developer"!
Actually that is not what he shouted (but the it sounds more fun that way). But he was actually running and he was looking for a random developer because he needed somebody who "could look at some code".
Since I was the only one in the office who apparently could be trusted to "look at random" code I asked what was going on. Turned out that an alert that the operations team had never seen before had triggered and the message indicated a very large number of cache misses in a system and they needed to know what the problem was.
I shivered when I realized that this was in the "old legacy service" that we tried to not touch even with a pole as it was being replaced but gave it my best. A quick text search found the one place where the error mentioned in the alert was logged. It took a few minutes to figure out what the code was supposed to be doing and then it dawned on me...
I called up the operations team and asked to hear how many active users we estimated at the moment. They gave me a number and I asked; "is that an all time high by any chance?" The operations team confirmed we were indeed experiencing an all time high, something this team usually did a few days after Christmas.
What was happening was that somebody had set a threshold for cache misses to a value that several years before was considered ridiculously high. But a certain percentage of requests were always expected to be cache misses given the implementation. The problem was that the alert used an absolute value as a threshold rather than a relative value to load. Hence it triggered once traffic increased enough.
And this is why I hate absolute thresholds for alerting. Alerts should (almost) always be triggered relative to some other metric in my opinion.
No comments:
Post a Comment