Who is going to be my anomaly detector?

February 4 2019

One important aspect of monitoring integration is anomaly detection. After collecting the information and all the values about the performance of our integration we can use these to define what is the normal behavior of our integration and on this build an anomaly detection/alert mechanism.

But who is going to look at all our data to define what is normal and what is not? This is a delicate task that is going to weigh heavily on the overall efficiency of the monitoring. The options are mainly two: you ask a human being to do the dirty job manually or you rely on algorithms, in collaboration with a reduced and more targeted human interaction.

The human detector: the Modern Time of modern time

The human detector

When I think of manually monitoring an integration, I think about Chaplin as factory worker in Modern Time. A desperate job of a stressed man that tries to keep up with defining the minimum and the maximum allowed value for each of the parameters. A complex integration can have thousands of nodes, each of which has up to hundreds of parameters to be monitored. Human beings alone cannot deal efficiently with this multitude of data. If one wants to set up thresholds, one has first to clean the data and then to interpret them. Therefore, the golden standard of manual integration is setting static thresholds based on the all-time (or recent time) maximum and minimum of each parameter. These thresholds are called static because they don’t account for any daily, weekly, monthly, yearly, and other type of time cycles. Going static is risky, because you might miss important alerts.

The algorithmic detector: a dynamic approach

The algorithmic detector

Dynamic thresholds, that adapts and change according to the different time cycles of your business, make your anomaly detection way more accurate. A Dynamic threshold is a threshold that changes every week, every day, or even every hour. Dynamic thresholds are almost impossible to set correctly manually, also for small amounts of parameters, and because they are so sensibly connected to all the different time cycles, they need to be maintained constantly. The only way to define reactively these thresholds is to rely on algorithms.

First you need someone to write a nice algorithm for you and to maintain that. Even if you are using machine learning, which make the algorithm adaptive, there will always be the need to update the algorithm with new techniques and make sure the algorithm is always performing optimally. So even if it might be tempting to hire a consultant to write the algorithm for you, remember that in the long run this might be the wrong choice, after the consultant is gone there will be no one that knows the algorithm well enough to quickly change it and adapt it. Moreover, to make the algorithm perform well, you should have nice and clean data, so you will need someone responsible for cleaning and pre-processing the data. And this is not all, extracting clean data and processing them require memory and computational power, so you might end up with your anomaly detection engine impacting your performance. To avoid it you need skilled developers and data engineers. This is just to say that if you are serious about your anomaly detection you should have a team of data scientists, engineers, and developers working full time on this. Let’s not forget that monitoring means handling a lot of data and performing heavy calculations, so eventually you might want to run the algorithm on the cloud, and this can result in big costs for cloud services.

Not every company has the resources to have a monitoring team with a dedicated budget. If your company is among these, don’t despair, you still can count on monitoring tools doing the job for you. At this point you could use your resources to fix alerted problems, give occasional feedback to the monitoring tool to improve its analysis (not even a machine learning algorithm performs at its best when left without feedback), and, most importantly, to develop new business-related ideas!

Remember, if you want a reliable and ever improving anomaly detection, when choosing a monitoring tool, be sure that there is a team behind it that has the skills to deliver all you need.

Tags: Blog

Integration monitoring tool checklist

How does your Azure, BizTalk or SQL monitoring stack up?