Download Your Copy of Anomaly Detection in IT Operations

  • What Anomaly Detection.
  • Why do you need it.
  • Understand the fundamental capabilities, differences and approaches.
  • What you should consider when evaluating your Anomaly Detection platform.

Get it in your inbox ;)

An anomaly is a deviation from the normal

What is Anomaly Detection?

Let’s get the definition of Anomaly Detection set for this article. According to Wikipedia, the definition is:
 

 “In data analysis, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.”

 This definition is probably uncontroversial. However, the Wikipedia article continues with:

 

“Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text.”

 This is wrong. Anomaly Detection is as much about identifying positive outliers or observations from the majority of data. The Wikipedia definition seems to be influenced by a domain-specific usage of Anomaly Detection for certain use cases where the objective is to identify problems.

Pain or the opportunity?

Data explosion and complexity drive Anomaly detection.

There is no anomaly detection without data and data availability has increased extraordinarily during the last years. Humans are better equipped to identify anomalies when the data sets are small as humans can leverage experience and context better than any machine.

 When data explodes – with software and sensors everywhere – humans no longer cope. Adding additional human resources to solve a massive data analytics problem is not feasible or efficient.

Data in itself does not have an opinion. Hence in Anomaly Detection, the use case is impacted by the data you harvest. You can look at Anomaly Detection as a source to identify anomalous signals that represent problems or opportunities.

 The overall driver of the market – the pain and the opportunity – is the explosion of data that is represented by a fast-increasing digitalized world across consumers and enterprises.

Data is exploding driven by consumers

  • In 2020, people created 1.7 MB of data every second.

  • By 2022, 70% of the globe’s GDP will have undergone digitization.

  • In 2021, 68% of Instagram users view photos from brands.

  • By 2025, 200+ zettabytes of data will be in cloud storage around the globe.

  • In 2020, users sent around 500 million Tweets per day.

  • By the end of 2020, 44 zettabytes will make up the entire digital universe.

  • Every day, 306.4 billion emails are sent, and 500 million tweets are made.

Anomaly Detection is primarily driven by business needs

The Key drivers for Anomaly Detection in IT Operations

In IT Operations, Anomaly Detection is a fundamental Building Blocks of AIOps – Artificial Intelligence in IT Operations. IT Operations is essentially about ensuring SLAs and the availability of IT systems. Hence, Anomaly Detection in IT Operations traditionally tends to be focused on avoiding system disruption.

The key driver of the need for Anomaly Detection in IT Operations are:

  1. The enterprise core business is now digitalized and generating data.
  2. Agile development.
  3. Distributed applications.
  4. Technology proliferation.
  5. Interconnectivity of applications, services, and processes.
  6. Dynamic and elastic nature of applications and deployments.

This is driven by business digital transformation but a consequence is that the complexity of the IT Operations tasks has exploded beyond what traditional IT Operations is capable of handling with traditional tools and human labour.

Performance problems can now have an immediate impact on a company’s P&L as exemplified by British Airways shutting down all flights for a weekend (Heathrow and Gatwick) due to problems with their IT systems.

Finding the cause of a problem is now a needle in a haystack. And the complexity and interconnectivity now mean that finding the needle in the haystack is required to avoid a ripple effect that can bring down the business.

If you prefer video ;)
AIMS Concept Anomaly

 

IT Operations without anomaly detection sets you up for failure

The importance of Anomaly detection in Observability and AIOps

Merge increasing data, increasing complexity, increasing vulnerability, and the need for business insight and it becomes clear that IT Operations will fail without Anomaly Detection.
 
As further explained in the AIOps Building Blocks – AIOps requires Anomaly Detection, while Anomaly Detection is the foundation for AIOps that allows taking action to resolve an issue identified by the Anomaly Detection.
Anomaly Detection is also necessary for Observability. Observability in essence means monitoring or analyzing the state of a system based on the external outputs of a system. A system's external behavior is measured in the form of metrics that can be harvested. Scale observability from one system to the thousands of systems enterprises today tend to have and you have a data problem that humans cannot analyze. Again, Anomaly Detection comes to the rescue.
 
Hence, without massive data harvesting, there is no anomaly detection.
Junk data in - junk information out

Data harvesting

So, with all that data available, harvesting the data becomes important. Data comes in many forms. The “three V’s” are commonly referred to in data science: 
 
  • Volume: is the massive amount of data available which continuously increases. Historically, company employees created data. Now, data is produced by software, hardware, sensors, and consumers (social media) which drives the following two V’s.

  • Variety: is the differing nature of data from various sources, structured and unstructured. Time series and text from systems or third-party sources such as Twitter.

  • Velocity: is the speed of data – often real-time.

Various sources suggest that companies collect a vast amount of data and more than 50% of data is unused. A significant amount of this data is likely unstructured.
 
What you should consider from an anomaly detection perspective is:
 
  • What data is relevant for your use case.

  • What data is currently available in a structured/unstructured format.

  • What is the gap and how can you capture that data.

Anomaly Detection platforms should have a flexible API to let you harvest and feed data from existing sources. There are also several proprietary and open source agents/data sources available that should be supported. For IT operations consider using open source agents such as Prometheus, Influx Telegraf, and StatsD. These open-source agents support hundreds of technologies. You should also be able to tap into APIs exposed by public cloud providers to harvest performance data for services and infrastructure running in their clouds. Lastly, you should have the capability to custom create scripts or “agents” using published open-source examples to fill any gaps. These gaps could be custom code or business data that is required to enrich the anomaly detection insight to understand business impact.
Get some structure to the data

Data Normalization

For massive data harvesting to work, data needs to be cleansed or normalized. An Anomaly Detection engine can be agnostic, domain-specific, with re-informed learning or not. Hence, how data is treated and interpreted will vary depending on the Anomaly Detection engine. Still, data harvesting is not sufficient – you need to ensure the quality of the data. Quality of data can have several meanings including necessary meta-data, time resolution aggregation, and labelling.

 Anomaly Detection engines are typically built to handle a set of time resolutions for time series data: f.ex. per minute and hour data. If the source provides millisecond data or 5-minute data, there needs to be a Normalization of that data into the appropriate time resolutions. That can be a massive undertaking if the Anomaly Detection platform does not have such pre-processing capability.

According to Harvard Business Review, less than 1% of unstructured data is being used by organizations.

Normalization of data allows behaviour analysis.

You need to learn what normal is

Behaviour Analysis

With normal behavior patterns, we can identify anomalies. A key component of Anomaly Detection is the capability of deciding what normal looks like. Normalized, structured data allows algorithms (machine learning) to learn the normal behaviour of every data metric.

Such a normal behaviour metric could be represented as one or more patterns of cyclical behaviors that represent the learned behavior of that data. Let’s say we harvest minute resolution data for how many flight purchases are made on an airline website, you should expect to get a rich pattern of the expected purchases made on a minute basis cyclically through a week. With that expected normal behavior we start to build the “Digital DNA” over the business – or how the Business behaves based on learning from digital signals. With this behavior analysis, we can start to look for deviations or anomalies from the normal.

 Expanding the data sources to include information from all relevant IT systems that the purchasing of tickets relies on we can widen the behavior analysis and start to build a rich Digital DNA of how a business process behaves. And this can be expanded to cover any relevant data source.

Deviations and Anomalies

Harvesting of data, cleansing & normalization, and behavior analysis allows algorithms to identify deviations, outliers, or anomalies. In the simplest sense, a deviation or anomaly is abnormal behavior of one metric from the behavior pattern learned (digital DNA).

Accurate data, data context, and data width provides better data. In a broader sense an anomaly can represent an abnormal situation that breaches several previously experienced behaviors, keeps trending out of the normal behavior, shows correlated deviations across several other related metrics (systems/services), and is identified to impact the completion of flight ticket purchases.

But there is more to it. Human feedback loops, through reinforced learning, can be a benefit or a pitfall.

A human feedback loop?

Autonomous – eat-all algorithms or reinforced learning.

The term “junk-in-junk-out” obviously applies to Anomaly Detection. Your use case for Anomaly Detection combined with the data requirement (scope & quality) needs to be considered when considering if an autonomous/generic model is appropriate or if you need a customizable model that allows for reinforced learning.

With reinforced learning, we mean a feedback loop whereby the algorithms for anomaly detection are adjusted by human evaluation based on the evaluation of the result. The need for a model that can be adjusted based on the result depends on the use case and the out-of-the-box available algorithm and the data.

Probable root cause or nearest root cause?

The root cause analysis and correlation.

The root cause is not the root cause. A root cause can always be further investigated. It's like peeling an onion, from layer to layer. What Anomaly Detection can provide is the sequence that an anomaly propagated or the rippling effect from the initial deviation with timeline impact through other systems. That allows you to search for the root cause to the extent you have harvested data.
 
The root cause also ties into correlation and causation. With correlation, it is possible to identify the dependencies and interactions between sources you harvest data from. Simply by running a sophisticated correlation analysis of the data behavior. With this correlation capability, an anomaly will be focused on data sources that relate to each other and will filter out potential ongoing deviations that are not relevant. This allows a finer focus on the right metrics to ease root cause analysis. This is not causation, i.e., deciding that one caused the other, but a statistical approach to suggest where to investigate. Causation is knowing how a change in one system impacts another.
 
The next step from identifying an anomaly and the probable root cause is the recommended action that should be taken to resolve the issue. For domain-specific anomaly detection engines, it may be possible to provide domain-specific guidance. For more flexible data sets and domain agnostic, Anomaly Detection providing recommended action needs to be built based on user knowledge databases that allow users to document potential resolutions to (previous similar) problems.
 
Gartner says:
“The discovery of correlations in a large dataset makes that dataset much more useful. IT operations teams can begin to see how different aspects of a complex IT system relate to one another and take steps toward predicting system behavior and planning to circumvent possible outages or even brownouts.”

 

Context is King!

The importance of context and knowledge.

Context does not start and stop with knowledge databases. Anomaly Detection, and in particular domain agnostic Anomaly Detection, support any data source. This essentially means that knowledge about systems generating the data could be limited. For example, it is easy to connect systems and services running in a public cloud. This is typically done in minutes and the connection authorization is typically granted with rights on subscription or account level. Hence, your organization’s use of cloud resources would not need to be extensive before you do not know about the systems connected. This represents an opportunity as it illustrates the ease of capturing a wide footprint of systems in your observability journey. However, it also means that you may not necessarily understand the nature of the data harvested.

Here enter multiple elements of context:

  1. Source/system context: from the source of the data in form of meta-data and properties.
  2. Correlation context: identification of dependencies between data sources (systems) from correlation can show groups of systems that support a business process or service. Together with “source/system context” this may provide highly useful insight. 
  3. External (CMDB) databases context: can provide additional context residing in CMDB databases. 
  4. User labeled context: provided by the user through tagging or labelling systems allows adding organizational human knowledge for customized context.

 The context should be used to prioritize anomalies, match insight with appropriate stakeholders and build relevant business dashboards.

Domain agnostic vs Domain-specific.

This quote from Gartner’s Market Guide for AIOps platforms says a lot:

“Domain-centric approaches to AIOps are relevant for organizations that have limited data variety (that is, only a few point solutions) and that prioritize a small number of focused use cases. Such organizations have limited need or ability to look at data across multiple silos simultaneously. As use cases within organizations grow, they are likely to move to domain-agnostic tools.”

 

The drive for AIOps, requiring Anomaly Detection, comes from two parts of an organization - the IT Operations team and the Business. Business executives are now asking for appropriate governance and insight from IT systems as the performance directly impacts business performance. When you combine the need for insight for both IT Operations and Business executives it should become clear that IT Ops Domain-Specific Anomaly Detection platforms do not support the necessary breadth in performance data.

Harvard Business Review (Building an Insight Engine) agrees:

According to the i2020 research, 67% of the executives at overperforming firms (those that outpaced competitors in revenue growth) said that their company was skilled at linking disparate data sources, whereas only 34% of the executives at underperformers made the same claim.

 

Identifying change or an important event?

Let’s get back to the first part of the Wikipedia definition:

“In data analysis, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.”

Without context, it is impossible to decide if an anomaly is an important event. Adding context brings actionable insight to the anomaly to sort, filter, and prioritize those that are important events. Without context, Anomalies are simply “observations differing significantly from the majority of data.”

Looking to get going?

Scalable & Sustainable Anomaly Detection

Implementing Anomaly Detection is not a trivial task. Researching the available options in the market is critical to be able to choose an appropriate implementation that is strategically right for your organization.

Sustainability and Scalability should be considered together with product capabilities and ROI.

Sustainability means the ability to easily build, extend and scale in “all directions” without massive overhead. With Anomaly Detection you look to automate manual processes. Make sure your choice of Anomaly Detection platform is sustainable from a manual process perspective. Ensure that all phases of data science are sustainable from data harvesting, normalization, machine learning algorithms, anomaly detection, and context.

Scalability means that your choice of platform will last for a decade (at least). Choose a platform you can grow with to extend use cases and a vendor that has a robust product development roadmap that gives you comfort that your choice is future-proof.

Together with evaluating the steps of Anomaly Detection you should keep in mind that the Anomaly Detection platform is only as good as its weakest link.   If data harvesting is limited – your use cases are limited (and not sustainable). If data cleansing and normalization are limited, you have a “junk-in-junk-out” situation. If the machine learning is poor, then you have another problem, etc.

So, think through your use cases short term and long term and start evaluating the solutions in the market. Be careful with platforms that require a long time to value and prioritize opportunities that allow you to demonstrate value with limited investments. With Anomaly Detection and AIOps, you are making decisions for the long term. Get started, test, do proofs-of-concept, prove value to the business, and keep iterating by extending and growing value.

Transform your IT Operations

We help organizations to capture the value of Artificial Intelligence in IT Operations - AIOps. 

Book a meeting to understand how to validate the value and feasibility of AIOps for your organization without a massive project.