At AIMS, we strive to alert as early as possible that some problem is arising for a business-critical system. We want to arrive as far as to make predictions to alert of a possible problem even before the problem itself arises. However, this is an incredibly complicated task, especially considering that AIMS builds monitoring tools that are technology agnostic and that simultaneously monitor an incredible number of parameters.
For this reason, we decided that this was the time to venture into research and team up with the experts of the Norwegian Computing Center (Norsk Regnesentral). The Norwegian Research Council will support our research project for the next three years. The name of our project is PReVENT (prediction + events) which would focus on finding new ways of predicting problems combining analysis of time series data (data that contain a numerical value and a timestamp) and natural language processing on what we call events. "Events" are all messages and logs which indicate that an event occurred in an IT system and AIMS collects in text form.
General manager and one of the founders of AIMS Innovation Ivar Sagemo (left), PhD in theoretical physics Alessandra Cagnazzo, and researcher at the Norwegian Computing Center Annabelle Redelmeier. Photo: Odd Richard Valmot
What is the difference between reacting and predicting?
Monitoring tools for the most part are designed to alert when something out of normality happens. However, as precious as this information is, it allows us to react only when something is happening. But what if by looking simultaneously at different factors, that if considered singularly are perfectly normal, we can predict that things could go in the wrong direction, trying to find patterns through events and data and pre-alert a problem?
Let’s take for example the situation in which while using a browser I notice that my laptop slows down to an annoying point. It does not always happen, but it happens if I have more than 5 tabs open and two of these are a news website and a social media page, and only if the social media has been open for more than 5 minutes. Having more than 5 tabs open does not create any problem, as well as having the news website open or the social media page. Also, if I close the social media page soon enough I don’t get any performance problems. I will only have a problem when all the factors are present at the same time. In this scenario, making a prediction would mean for example to pre-alert that something might go wrong as soon as I open a social media page when having 5 tabs open with the news site among them, without that 5 minutes have passed. It might be that I will not have the social media page open for 5 minutes or that in the meanwhile I will close some of the other tabs, but when we pre-alert based on a prediction that is what we get: the pre-alert that something might go wrong, but not always necessary.
This example gives me the chance to speak about another problem, which is how to prevent something from going wrong. Predicting that something might go wrong does not mean you are automatically proposing a solution. In the example, the solution was clear, close something before the laptop slows down, but in real life, the factors that interplay can be thousands, and pinpointing which one is the one we have to intervene to prevent our problem is not an easy task.
Just thinking about a laptop, numerous factors combinations might slow it down, and it is a gargantuan task to find all of them, even in a small system like a laptop. Imagine the struggle for a business that has hundreds or thousands of machines.
Our project will focus on unveiling patterns to make predictions and alert that something might potentially go wrong or that is on the way to be resolved by itself if nothing major happens in the meanwhile. It is out of the scope of the project to find a way to propose solutions, but having a better understanding of the patterns within data and events will guide us also on the path to the resolution.
Why combine time series and events?
Many monitoring solutions focus either on time series/numerical data or on events (text data that describe that something has happened). But if we want to unveil the full picture, we need to combine both types of information. In the example of the browser, to know what is going on, we need to keep track of numerical factors, like the number of tabs open and how many minutes the social media page has been open, and of events, like the fact that we opened a news web page or a social media page.
There are many techniques and studies on the prediction made on time series, and an increased number of positive applications of natural language processing, but when we started to look into possible applications of these we were faced with many challenges. Prediction sometimes is included in monitoring solutions that are highly custom to a single technology/company. These types of models are costly to develop, maintain and require extensive and constant training. How can we efficiently predict in a technology-agnostic setting with thousands of variables? How can we apply NLP to text that is not a human language? How can we combine time series and events in an automatised way and with a close-to-real-time response?
All these questions are not standard problems that find solutions in the literature, so we need to research answers, explore limits and push the boundaries!
Topics from this blog: aiops