What is AIOps?

February 21 2020

AIOps, short for Artificial Intelligence (AI) Operations (Ops), is a new buzzword that has emerged in IT. As with Cloud, Edge Computing, or any other buzzword you can think of, every vendor has their own definition. But when you cut through all the marketing, what really is AIOps?

As AIMS is a vendor selling an AIOps solution, it naturally has its own definition. Like every other vendor, the Very Official Marketing Definition of AIOps can be summed up as "AIOps is what AIMS sells. Now please shovel sacks of cash in the general direction of the AIMS sales team."

So let us dispense with marketing definitions, attempts to position AIMS favourably against competitors, and all the rest of what you'd expect to see on a vendor blog. This blog isn't about selling you AIMS's AIOpsy goodness. This is about the philosophy behind AIOps, and a cold, hard examination of what the state of the market for AIOps is really like.

More importantly, this blog aims to answer what value AIOps has to real-world IT practitioners, and will even address the omnipresent boogeyman of the robots coming for our jobs.

What is AIOps?

Boiled down to a single phrase, AIOps is a digital assistant for systems administrators. Think Siri, or Alexa. Instead of turning lightbulbs on, randomly buying stuff for you from Amazon, or playing the wrong song, AIOps is about being a digital assistant for your datacenter.

Terrifying, no?

In most cases, "Artificial Intelligence" is anything but intelligent. This is not exactly a secret. Apple Maps still randomly tells people to do crazy things like drive off bridges, or get lost in the desert, even 8 years after the first snarky news reports emerged. Letting AI anywhere near our precious mission-critical workloads sounds like pure lunacy!

This is both true and false at the same time. Unlike a consumer-focused AI assistant, or the AI underpinning maps applications, the utility of AIOps can be thought of as more quantum than binary. AIOps occupies a superposition of possible levels of utility that don't collapse into an actual outcome until the humans interfacing with it decide how much effort they're going to put into learning from it.

Let's consider Google Search for a moment. Not Google Search in 2018, but Google Search in the year 2000. In the year 2000 there were many search engines that had mainstream mindshare, and they were all crap. Google started gaining market share – and mind share – for the simple reason that it was less crap than the competition.

Google was founded in September of 1998. By the year 2000 its competitors were three and four years older than it. Google began indexing the web in 1996, but its prominent competitors had started in 1994, given them at least two years head start in indexing the web, and another two years head start in gathering data about user search behaviour. So how did this tiny startup end up being less awful than the competition?

The answer is that Google wasn't just about indexing the web. Google ranked sites with a proprietary algorithm. Google used machine learning to study everything from how to tell whether or not sites were malicious, to what drove user attention and clicks. On top of this knowledge, they built an empire.

This is the kind of AI that underpins AIOps. Every vendor's approach is different, but most boil down to some form of software that comes pre-loaded with some basic knowledge, and a learning engine that learns about your data center and/or applications, and how IT practitioners actually go about solving problems.

Ticketing machines

The above is rather vague. Most people who work in IT could guess that AI-anything involves some flavour of machine learning. Isn't there something more specific that can be said?

Again, the answer is both yes and no. Beyond the generalization provided above, the AIOps solutions offered by vendors really start to differ. Some AIOps solutions are heavily infrastructure-focused. Others are aimed at helping developers. AIMS is an AIOps solution focused on application administration and application insight. AIOps comes in flavours.

The broad strokes do serve a purpose, however, and there are some common threads to AIOps, regardless of the vendor. Perhaps the most important thread woven through AIOps offerings is that they can almost be thought of as a next-generation ticketing system, with added AI sauce.

Put aside for a moment how awful most ticketing systems are, and think about what a ticketing system is supposed to be. In a perfect world, ticketing systems not only serve as a means to track what needs to be done, they are also a knowledge base. If the humans involved do their jobs, every problem that IT encounters is logged in the ticketing system, along with a root cause analysis, and the eventual solution.

On day 1, the ticketing system is nothing more than a glorified to-do list. Ten years in, however, the knowledge contained in that ticketing system is absolutely invaluable. Hypothetically, of course.

Ticketing systems have a lot of problems. The search capabilities of ticketing systems are almost universally panned by practitioners. Humans are lazy and don't enter all relevant information. A lot of the time tickets are closed simply be rebooting something, and root cause analysis is never performed.

AIOps tries to automate as much of this as possible, and provide a search that's actually useful. Again, while every vendor in this space has a different area of interest, and a different approach, all of them are, on some level, trying to do the following things:

  • Learn what "good" looks like
  • Identify when things are not "good"
  • Find an adult if things are not "good"
  • Tell the adult what the AI thinks the solution is to get back to "good"

A practical example: AIMS

In order to usefully discuss AIOps in any more detail, one must pick a vendor and explore that specific solution. This being the AIMS blog, dissecting AIMS makes the most sense, so for the purposes of this blog, AIMS will serve as the standard candle against which other AIOps solutions are compared.

As mentioned above, AIMS is an AIOps solution focusing on application administration and application insight. Although the AIMS platform is generic, AIMS started out focusing on Microsoft integration technologies, in large part because this where its founders had the most experience.

AIMS uses OS agents to gather information. Windows Server is the operating system supported. AIMS also supports BizTalk, SQL Server, all Microsoft Azure infrastructure and services, IIS, generic file monitoring, and HTTP/S endpoint monitoring. Next is extension beyond Microsoft to AWS and other non-Microsoft technologies commonly used by enterprises in their core application integration scenarios.

Like other AIOps solutions, AIMS monitors the various technologies and products it supports. The AI builds and continuously updates baselines for the supported technologies and products, while continuously looking for anomalies based on correlation of metric deviations from the current baseline. The current baseline of thousands of metrics is cyclical with the nature of the business supported and becomes a dynamic, self-enhancing digital fingerprint or DNA of the system.

The metrics are primarily performance metrics such as message count on ports and orchestrations, CPU load, execution count on stored procedures in a database to in/out transfer rates to/from a cloud storage. All this data (most often north of 10,000 metrics) are relevant for business processes that are often critical for driving revenue or productivity for the business. With this data AIMS provides Business Insights or Business Signals directly from the underlying technologies that are the building blocks for digital transformation.

AIMS monitors as deep into the technologies and products it supports as possible, and looks for any irregularities.

A traditional, non-AI monitoring solution would be functionally useless at this task. Consider an application that relies on a database residing on shared storage. Anything else that uses that shared storage is going to cause deviations in performance. In order to prevent a flood of meaningless alerts, pre-AI monitoring software would have to use thresholding.

Thresholding is complicated. It requires a great deal of effort and expertise to determine useful thresholds for various performance counters, and in highly dynamic, shared environments (for example, public clouds,) the reality is that thresholds should really change on a regular basis.

The two uses of thresholding

Thresholds have two uses, depending on who you talk to. For one group of people, application performance is what really matters: if the application becomes too slow, then IT should be alerted, and they should do something about it.

IT operations teams, however, tend to have a more nuanced view. Yes, performance thresholds are useful, but applications operating on shared infrastructure dip below their target thresholds all the time. In the overwhelming majority of cases, these performance excursions are highly transient, and unnoticed.

The pattern of threshold violations, however, could help someone determine if there was an infrastructure problem, assuming anyone was willing to stare at the monitoring output for long enough. The reason staring at the monitoring output is required, is because thresholding and alerting are, despite decades of work, still a hot mess of false positives. There are vendors applying AI to try to stem to flood, but something better is called for.

This brings us back around to AIMS.

The sysadmin's digital assistant

AIMS learns what normal operations looks like. AIMS also keeps track of thousands of metrics, and performs metric correlation, so it has a much better chance of determining whether a deviation in performance is a legitimate problem, or whether it is just an irrelevant transient, and can be ignored.

AIMS can also be taught about the interconnected nature of applications, and can use this knowledge to perform automated root cause analysis. Before AIMS, operations teams trying to dig themselves out from under a crashed application would have to walk back through the cascade of failed components and services to determine what actually went boom. AIMS tracks all of that, and can surface the root cause to administrators quickly, and efficiently.

AIMS doesn't do anything a systems administrator can't do. And it doesn't really replace a ticketing system. What it does, however, is analyse both real-time monitoring data and its own historic archives to identify deviations in application behaviour from established behaviour in order to identify problems before they become noticeable to end users, and/or to perform root cause analysis that would take an experienced sysadmin digging through a traditional ticketing system hours or days.

The reality of AIOps

As discussed several times above, each AIOps vendor has a different area of focus. This keeps getting repeated because it is important. The AIOps market today is much like the search market in the late 90s. There are AIOps vendors, like AIMS, that specialize in doing one thing, and doing it extremely well.

There are also AIOps vendors that are essentially traditional marketing and ticketing system vendors with some AI slathered on top. These vendors can track way more data points than a targeted AIOps vendor, but lack AIs with domain-specific knowledge.

None of the AIOps vendors today have solutions that can monitor everything in a data center, let alone all of an organization's IT across multiple public cloud vendors, service providers, on-premises, mobile, IoT, etc. Getting there will take a decade or more of development, mergers, acquisitions, and so forth.

This doesn't mean that AIOps is useless. It also doesn't mean that a decade from now the creepy AIOps army is coming for everyone's job.

Google didn't wipe out research assistants, paralegals, journalists or other jobs focused on uncovering information. What Google did was make finding that information easier. It made individual researchers able to more quickly find the information relevant to them, and thus able to handle more, larger, and/or more complex research tasks than was possible in the physical paper era.

In a similar vein, AIOps products are simply tools which help IT operations teams do their jobs more efficiently. Like automation and orchestration platforms, AIOps products help IT practitioners manage more applications than would be possible without AIOPs.

The goal of every AIOps vendor is to build a tool that becomes an extension of the IT practitioner. A tool that becomes so much a part of us that we choose not to remember what life was like before it, just like most of us choose not to dwell on what life was like before Google, or what life was like before we all started carrying around the sum total of human knowledge in our back pocket in the form of a smartphone.

As an extension of the IT practitioner, AIOps can help bring a sense of pride and accomplishment back to beleaguered IT teams drowning under the sheer volume of workloads they have to manage. A competent IT team could do everything AIOps does. With AIOps, however, they can do these things faster, for more workloads, and have to face fewer meetings with angry suits where IT can only repeat "we don't know…yet".

That’s what AIOps is. It provides a means to automate arduous and miserably annoying tasks so that we can focus on more meaningful work. It allows us to take pride in what we do, because we aren’t bogged down with the mundane. So why not book an AIMS demo today?



Tags: Blog