Predictive Analytics for IT Operations – The Prequel: Silencing Alert Noise

06 Jan, 2022

Artificial Intelligence (AI) and Machine Learning (ML) are the next hot things in analytics. In fact, they have been for the past 25 years. Since these techniques are not going anywhere anytime soon, let’s take a moment to consider what these approaches produce.

The most useful application of an AI/ML model is to produce a probability estimate that a certain event will occur for an observed object within a specified time frame based on the current value of selected metrics.  For example, server abc123 is 71.5% likely to encounter an incident in the next 7 days. This probability is based upon the correlation of those values with occurrences of that event in the past. If the probability passes a certain threshold, a notification is produced either to a person or a machine such that action can be taken.

The upshot of this is that these models allow IT Operations to see the future! In many cases, operations can even react to the prediction in an automated fashion, so the problem is solved without any person ever really being aware it occurred. This is really no exaggeration. Look at the trading desks across all the major investment banks. They are using these techniques to execute trades that represent the annual salaries for the entire IT department without ever being witnessed by human eyes. These techniques are well tested in other industries and functions, and there are straightforward applications that IT is not taking advantage of – at least not broadly enough.

This probably sounds great . . . unless you’re already working in the world of IT Operations. Upon receiving the suggestion that there is a new mathematical model that will be providing incessant warnings, an infrastructure owner may be forgiven for experiencing the following internal monologue: “So, what you’re telling me is that there is another set of alarms going off, but instead of responding to things that are happening, they go off because something might happen. Hmm, no thanks. I may have shortages of budget, staff, executive attention, understanding of the value of my contributions to organizational success, but I do not have a shortage of alarm bells.”

The Alert Environment

The root of this problem is the noise produced within the existing alert environment. It would be important to address even if one were not interested in introducing predictive analytics, but it is nearly certain that failing to eliminate this noise will cause the adoption of the output from predictive models to founder. This output takes the form of new alerts introduced into your environment alongside those already being produced by your monitoring system.

The vast majority of alerts are produced by monitoring systems. These systems look at a piece of infrastructure and evaluate the state of operations for a variety of metrics – like CPU Usage, Memory Usage, etc. These systems produce a message if the value of a metric goes above or below a certain threshold. If this sounds a little familiar, that’s the point.

Let us also recognize that monitoring systems are amazing even if we are all completely irritated by the noise that they produce. Properly utilized, monitoring systems would produce much of the benefit that we expect to get from predictive analytics. They warn us when it looks like something is about to go wrong without requiring highly trained eyes to be devoted to looking at an indicator.

I come not to bury monitoring, but to praise it [sic]. There can be no meaningful predictive models if an organization has not installed reliable monitoring systems as a foundation. However, there are a few differences between alerts produced by monitoring systems and those produced by AI/ML, and the differences highlight the value of introducing AI/ML models.

First of all, the rules for a monitoring alert are typically created by a person. These are usually common-sense evaluations or reflect a problem that the infrastructure owner has encountered previously. In contrast, an AI/ML alert is created by the impartial observation that certain values correspond with a negative outcome.

Monitoring alerts usually react to only one metric – e.g. CPU Utilization is > 98%. AI/ML alerts are typically compound – e.g. servers with CPU Utilization > 98% and Memory Utilization < 25% have a probability of 73% of experiencing an incident.  Clearly, memory utilization <25% is not worthy of concern by itself, but in conjunction with high CPU, it can be a real red flag.

Finally, monitoring alerts remain in place until removed. This will become important in a moment (see Phase II). AI/ML rules remain relevant until they are not. This horizon depends on how often a model is refreshed, the range of dates included in the model, and the weight that is placed on historical data. We will discuss how 2020 massacred predictive models, and how to recover from that in a subsequent article.

Monitoring Optimization

The only thing more dangerous than an alarm that doesn’t work is an alarm that goes off so often that nobody pays attention to it. – Fred B. Schneider

The most relevant statement that I have ever heard concerning security applies equally to monitoring optimization, as the practices share a lot of DNA. Optimizing the monitoring system means optimizing the work that goes into configuring monitoring. That work consists of two phases that are born out of the quote above:

Phase I: Alarms that don’t work

Phase I involves deploying the relevant monitors and rules to all the critical infrastructure. A piece of infrastructure unmonitored is a broken alarm. This phase never leaves us, but it has a heavy initial investment in deploying the system. The recurring work here is to make sure new pieces of infrastructure have the monitors deployed and that new rules are configured to capture newly encountered situations. To maintain the ongoing work, visibility is important. At a glance, one should be able to identify infrastructure that is not covered by a relevant monitor.

Phase II: Alarms that nobody pays attention to

Phase II involves paring back alerts that are no longer relevant. Most organizations are solidly in this phase, but they don’t have a systemic method for handling this critical workload. This gives rise to what we all encounter as noise. Most of this occurs because an alert was created to handle a situation that is no longer relevant. Whether this is a threshold that is set too low for the current situation, or monitors looking for a transaction or piece of infrastructure that no longer exists, the chatter created by these irrelevant messages may swamp that of relevant alerts. We all tune this out to some extent, but not enough of us turn it off. Bear in mind that no matter how low priority your alerts are, they still have the potential to divert attention from your AI/ML alerts, because they have the benefit of actually happening.

IT Analytics - Northcraft - Willow

Visibility is again key here, as is responsibility. Eliminating the noise is largely a blocking and tackling activity. Ranking the number of alerts by the rule that produces them and then being able to quickly identify their source by a piece of infrastructure should do it. This is similar to the rollout portion of Phase I. The initial project will take some time, but the ongoing maintenance is relatively minimal with clear and effective reporting (see above). The benefit is a relatively silent background against which the alerts from your AI/ML models can be heard and acted upon. The benefits can be enormous.

“Where did you go to, if I may ask?” said Thorin to Gandalf as they rode along.

“To look ahead,” said he.

“And what brought you back in the nick of time?”

‘Looking behind,’ said he.