Anomaly Detection is a fundamental requirement to succeed in IT Operations as the IT complexities keep increasing.
In this guide, we dive into the key components, variants, fundamental differences between approaches, and end with some notes on what you may consider in your evaluation of Anomaly Detection.
This definition is probably uncontroversial. However, the Wikipedia article continues with:
“Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text.”
This is wrong. Anomaly Detection is as much about identifying positive outliers or observations from the majority of data. The definition seems to be influenced by a domain-specific usage of Anomaly Detection for certain scenarios where the objective is to identify problems.
There is no anomaly detection without data and data availability has increased extraordinarily during the last years. Humans are better equipped to identify anomalies when the data sets are small as humans can leverage experience and context better than any machine.
When the volume of data explodes – with software and sensors everywhere – humans no longer cope. Adding additional human resources to solve a massive data analytics problem is not feasible or efficient.
Data in itself does not have an opinion. Hence in Anomaly Detection, the use case is impacted by the data you harvest. You can look at Anomaly Detection as a source to identify anomalous signals that represent problems or opportunities.
The overall driver of the market – the pain and the opportunity – is the explosion of data that is represented by a fast-increasing digitalized world across consumers and enterprises.
In IT Operations, Anomaly Detection is a fundamental Building Blocks of AIOps – Artificial Intelligence in IT Operations. IT Operations is essentially about ensuring SLAs and the availability of IT systems. Hence, Anomaly Detection in IT Operations traditionally tends to be focused on avoiding system disruption.
The key driver of the need for Anomaly Detection in IT Operations are:
This is driven by business digital transformation but a consequence is that the complexity of the broad range of IT Operations has exploded beyond what traditional IT Operations is capable of handling with traditional tools and human labour.
Performance problems can now have an immediate impact on a company’s P&L as exemplified by British Airways shutting down all flights for a weekend (Heathrow and Gatwick) due to problems with their IT systems.
To identify the root cause of a problem is now a needle in a haystack. And the complexity and interconnectivity now mean that finding the needle in the haystack is required to avoid a ripple effect that can bring down the business.
Merge increasing data, increasing complexity, increasing vulnerability, and the need for business insight and it becomes clear that IT Operations will fail without AI and machine learning.
As further explained in the AIOps Building Blocks – AIOps (artificial intelligence for IT operations) requires Anomaly Detection, while Anomaly Detection is the foundation for AIOps that allows taking action to resolve an issue identified by the Anomaly Detection.
Anomaly Detection is also necessary for Observability. Observability in essence means monitoring or analyzing the state of a system based on the external outputs of a system. A system's external behavior is measured in the form of metrics that can be harvested. Scale observability from one system to the thousands of systems enterprises today tend to have and you have a data problem that humans cannot analyze. Again, Anomaly Detection comes to the rescue.
Hence, without massive data harvesting, there is no anomaly detection.
So, with all that data available, harvesting the data becomes important. Data comes in many forms. The “three V’s” are commonly referred to in data science:
Volume: is the massive amount of data available which continuously increases. Historically, company employees created data. Now, data is produced by software, hardware, sensors, and consumers (social media) which drives the following two V’s.
Variety: is the differing nature of data from various sources, structured and unstructured. Time series and text from systems or third-party sources such as Twitter.
Velocity: is the speed of data – often real-time.
Various sources suggest that companies collect large volumes of data and more than 50% of data is unused. A significant amount of this data is likely unstructured.
What you should consider from an anomaly detection perspective is:
What data is relevant for your use case.
What data is currently available in a structured/unstructured format.
What is the gap and how can you capture that data.
Anomaly Detection platforms should have a flexible API to let you harvest and feed data from existing sources. There are also several proprietary and open source agents/data sources available that should be supported. For IT operations consider using open source agents such as Prometheus, Influx Telegraf, and StatsD. These open-source agents support hundreds of technologies. You should also be able to tap into APIs exposed by public cloud providers to harvest performance data for services and infrastructure running in their clouds. Lastly, you should have the capability to custom create scripts or “agents” using published open-source examples to fill any gaps. These gaps could be custom code or business data that is required to enrich the anomaly insight to understand business impact.
For massive data harvesting to work, data needs to be cleansed or normalized. An Anomaly Detection engine can be agnostic, domain-specific, with re-informed learning or not. Hence, how data is treated and interpreted will vary depending on the engine. Still, data harvesting is not sufficient – you need to ensure the quality of the data. Quality of data can have several meanings including necessary meta-data, time resolution aggregation, and labelling.
Anomaly Detection engines are typically built to handle a set of time resolutions for time series data: f.ex. per minute and hour data. If the source provides millisecond data or 5-minute data, there needs to be a Normalization of that data into the appropriate time resolutions. That can be a massive undertaking if the platform does not have such pre-processing capability.
According to Harvard Business Review, less than 1% of unstructured data is being used by organizations. Normalization of data allows behaviour analysis.
With normal behavior patterns, we can identify anomalies. A key component of Anomaly Detection is the capability of deciding what normal looks like. Normalized, structured data allows algorithms (using machine learning) to learn the normal behaviour of every data metric.
Such a normal behaviour metric could be represented as one or more patterns of cyclical behaviors that represent the learned behavior of that data. Let’s say we harvest minute resolution data for how many flight purchases are made on an airline website, you should expect to get a rich pattern of the expected purchases made on a minute basis cyclically through a week. With that expected normal behavior we start to build the “Digital DNA” over the business – or how the Business behaves based on learning from digital signals. With this behavior analysis, we can start to look for deviations or anomalies from the normal.
Expanding the data sources to include information from all relevant IT systems that the purchasing of tickets relies on we can widen the behavior analysis and start to build a rich Digital DNA of how a business process behaves. And this can be expanded to cover any relevant data source.
Harvesting of data, cleansing & normalization, and behavior analysis allows algorithms to identify deviations, outliers, or anomalies. In the simplest sense, a deviation or anomaly is abnormal behavior of one metric from the behavior pattern learned (digital DNA).
Accurate data, data context, and data width provides better data. In a broader sense an anomaly can represent an abnormal situation that breaches several previously experienced behaviors, keeps trending out of the normal behavior, shows correlated deviations across several other related metrics (systems/services), and is identified to impact the completion of flight ticket purchases.
But there is more to it. Human feedback loops, through reinforced learning, can be a benefit or a pitfall.
The term “junk-in-junk-out” obviously applies to Anomaly Detection. Your use case for Anomaly Detection combined with the data requirement (scope & quality) needs to be considered when considering if an autonomous/generic model is appropriate or if you need a customizable model that allows for reinforced learning.
With reinforced learning, we mean a feedback loop whereby the algorithms for anomaly detection are adjusted by human evaluation based on the evaluation of the result. The need for a model that can be adjusted based on the result depends on the use case and the out-of-the-box available algorithm and the data.
“The discovery of correlations in a large dataset makes that dataset much more useful. IT operations teams can begin to see how different aspects of a complex IT system relate to one another and take steps toward predicting system behavior and planning to circumvent possible outages or even brownouts.”
Context does not start and stop with knowledge databases. Anomaly Detection, and in particular domain agnostic Anomaly Detection, support any source of data. This essentially means that knowledge about systems generating the data could be limited. For example, it is easy to connect systems and services running in a public cloud. This is typically done in minutes and the connection authorization is typically granted with rights on subscription or account level. Hence, your organization’s use of cloud resources would not need to be extensive before you do not know about the systems connected. This represents an opportunity as it illustrates the ease of capturing a wide footprint of systems in your observability journey. However, it also means that you may not necessarily understand the nature of the data harvested.
Here enter multiple elements of context:
The context should be used to prioritize anomalies, match insight with appropriate stakeholders and build relevant business dashboards.
This quote from Gartner’s Market Guide for AIOps platforms says a lot:
“Domain-centric approaches to AIOps are relevant for organizations that have limited data variety (that is, only a few point solutions) and that prioritize a small number of focused use cases. Such organizations have limited need or ability to look at data across multiple silos simultaneously. As use cases within organizations grow, they are likely to move to domain-agnostic tools.”
The drive for AIOps, requiring Anomaly Detection, comes from two parts of an organization - the IT Operations team and the Business. Business executives are now asking for appropriate governance and insight from IT systems as the performance directly impacts business performance. When you combine the need for insight for both IT Operations and Business executives it should become clear that IT Ops Domain-Specific Anomaly Detection platforms do not support the necessary breadth in performance data.
Harvard Business Review (Building an Insight Engine) agrees:
According to the i2020 research, 67% of the executives at overperforming firms (those that outpaced competitors in revenue growth) said that their company was skilled at linking disparate data sources, whereas only 34% of the executives at underperformers made the same claim.
Let’s get back to the first part of the definition:
“In data analysis, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.”
Without context, it is impossible to decide if an anomaly is an important event. Adding context brings actionable insight to the anomaly to sort, filter, and prioritize those that are important events. Without context, Anomalies are simply “observations differing significantly from the majority of data.”
Implementing Anomaly Detection is not a trivial task. Researching the available options in the market is critical to be able to choose an appropriate implementation that is strategically right for your organization.
Sustainability and Scalability should be considered together with product capabilities and ROI.
Sustainability means the ability to easily build, extend and scale in “all directions” without massive overhead. With Anomaly Detection you look for automation of manual processes. Make sure your choice of platform is sustainable from a manual process perspective. Ensure that all phases of data science are sustainable from data harvesting, normalization, machine learning algorithms, anomaly detection, and context.
Scalability means that your choice of platform will last for a decade (at least). Choose a platform you can grow with to extend the use and a vendor that has a robust product development roadmap that gives you comfort that your choice is future-proof.
Together with evaluating the steps of Anomaly Detection you should keep in mind that the platform is only as good as its weakest link. If data harvesting is limited – your use cases are limited (and not sustainable). If data cleansing and normalization are limited, you have a “junk-in-junk-out” situation. If the machine learning is poor, then you have another problem, etc.
So, think through your use cases short term and long term and start evaluating the solutions in the market. Be careful with platforms that require a long time to value and prioritize opportunities that allow you to demonstrate value with limited investments. With Anomaly Detection and AIOps, you are making decisions for the long term. Get started, test, do proofs-of-concept, prove value to the business, and keep iterating by extending and growing value.
Achieve effortless IT monitoring with truly automated AIOps platform.