A Beginners Guide to Anomaly Detection Techniques in Data Science – KDnuggets

Anomaly Detection is a very important task that you can meet or youll meet in the future eventually if you are dealing with data. Its very applied in many fields, like manufacturing, finance and cybersecurity.

Getting started with this topic for the first time can be challenging by yourself without a guide that orients you step by step. In my first experience as a data scientist, I remember that I struggled a lot to be able to master this discipline.

First of all, Anomaly Detection involves the identification of rare observations with values that deviate drastically from the rest of the data points. These anomalies, often called outliers, are a minority, while most of the items belong to the normal class. This means that we are dealing with an imbalanced dataset.

Another challenge is that most of the time there is no labelled data when working in the industry and its challenging to interpret the predictions without any target. This means that you cant use evaluation metrics typically used for classification models and you need to undertake other methods to interpret and trust the output of your model. Lets get started!

Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. These nonconforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities, or contaminants in different application domains. Credit Anomaly Detection: A Survey

This is a good definition of anomaly detection in a few words. Anomalies are often associated with errors obtained during data collection and, then, they finish to be eliminated. But there are also cases when there are new items with a completely different variability compared to the rest of the data and there is a need for appropriate approaches to recognize this type of observation. The identification of these observations can be very useful for making decisions in companies operating in many sectors, such as finance and manufacturing.

There are three main types of anomalies: point anomalies, contextual anomalies and collective anomalies.

As you may deduce, point anomalies constitute the simplest case. It happens when a single observation is anomalous compared to the rest of the data, so its identified as an outlier/anomaly. For example, lets suppose that we want to make credit card fraud detection in the transactions of clients in a bank. In that case, a point anomaly can be considered a fraudulent activity of a client.

Another case of anomaly can be a contextual anomaly. You can meet this type of anomaly only in a specific context. An example can be the summer heat waves in the United States. You can notice that there is a huge spike in 1930, which represents an extreme event that happened in the United States, called Dust Bowl. Its called that way because it was a period of dust storms that damaged the south-central United States.

The third and last type of anomaly is the collective anomaly. The most intuitive example is to think about the absence of precipitations we are having this year from months in Italy. If we compare the data in the last 50 years, there havent ever been similar behaviours. The single data instances in an anomalous collective may not be identified as outliers by themselves, but all these data points together indicate a collective anomaly. In this context, a single day without precipitation is not anomalous by itself, while a lot of days without precipitation can be considered anomalous compared to the data of previous years.

There are several approaches that can be applied to anomaly detection:

In an unsupervised setting, there are no evaluation metrics that can help you to understand the rate of correct positive predictions (precision) or the rate of the actual positives (recall).

Without any possibility of evaluating the performance of the model, its more important than ever to provide an explanation of model predictions. This can be achieved by using interpretability approaches, like SHAP and LIME.

There are two possible interpretations: global and local. The aim of global interpretability is to provide explanations of the model as a whole, while the local interpretability aims at explaining the model prediction of a single instance.

I hope you found useful this fast overview of anomaly detection techniques. As you have noticed, its a challenging problem to solve and the suitable technique changes depending on the context. I also should highlight that its important to make some explorative analysis before applying any anomaly detection model, like PCA to visualize the data in a lower dimensional space and boxplots. If you want to go deeper, check the resources below. Thanks for reading! Have a nice day!

Eugenia Anello is currently a research fellow at the Department of Information Engineering of the University of Padova, Italy. Her research project is focused on Continual Learning combined with Anomaly Detection.

Here is the original post:

A Beginners Guide to Anomaly Detection Techniques in Data Science - KDnuggets

Related Posts

Comments are closed.