An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism
Hawkins [Haw80I]

Anomalies are characterized by their deviation from the rest of the data (image source)

About the workshop

Anomaly detection in predictive maintenance can prevent system failures (image source) Detecting anomalies is of high interest in multiple industries for identifying safety and security risks, ensuring production quality, or finding new business opportunities. However, anomaly detection faces some unique challenges. First, identifying anomalies by hand is difficult, especially in multidimensional data. Second, anomalies are usually poorly represented in datasets.

Anomaly detection must therefore rely largely on unsupervised learning with possibly contaminated nominal data. These methods need additional assumptions about the data to be able to identify anomalies reliably. In this workshop, we will review common approaches for anomaly detection and discuss their strengths and weaknesses in different application areas.

Learning outcomes

Understanding qualitative and quantitative definitions of anomalies.
Overview of theoretical foundations and practical implementations of multiple anomaly detection algorithms.
Understand which algorithms are suitable for which application areas.
Learn how to evaluate and compare performances of different algorithms.
Learn to find thresholds for anomaly detection using extreme value theory.

Structure of the workshop

Part 1: Introduction to anomaly detection

We start with a brief introduction to anomaly detection and its applications. We discuss the special challenges of anomaly detection and the different types of anomalies. We then introduce the contamination framework. Finally, we introduce evaluation metrics for anomaly detection and discuss the class imbalance problem. Visualization of “few, sparse, different” assumption under the contamination framework

Evaluation metrics for anomaly detection.

The informal notion of anomaly and definition attempts.
The contamination framework.
Class imbalance.
Evaluation metrics.
Mahalanobis distance.

Part 2: Anomaly detection via density estimation and robustness

Density estimation is a common approach for anomaly detection. It rests on the assumption that anomalies appear in unlikely areas of the feature space. We discuss the Kernel density estimation (KDE) algorithm as a generic example of a density estimation method. When the training data might contain unrecognized anomalies, robustness is an important property of the estimation procedure. We discuss robust variants of KDE and apply them to a real-world dataset with mislabelled data. Kernel density estimation (image source)

Kernel density estimation.
Robust variants of kernel density estimation.
Example: Housing prices and mislabelled data.

Part 3: Anomaly detection via isolation

The isolation forest algorithm is a tree-based approach for anomaly detection. It is based on the assumption that anomalies are rare and isolated. It has drawn a lot of attention in the last years and is considered a state-of-the-art algorithm for anomaly detection with excellent performance across a multitude of benchmarks. We use the isolation forrest for network intrusion detection in the KDD99 dataset. Partition diagram of an isolation tree

Isolation depth of a nominal point (green) and an anomaly (red) in an isolation tree

Isolation Forrest.
Example: Network intrusion detection with KDD99 dataset.

Part 4: Anomaly detection via reconstruction error

Anomaly detection is particularly challenging when the data is high-dimensional. The previously introduced methods suffer from the curse of dimensionality and usually quickly degrade in performance as the dimension rises above a couple dozens. Auto-encoders are a class of neural networks that can be used to learn a representation of the data that is more compact than the original data. Auto-encoders can be used to detect anomalies by comparing the reconstruction error of the original data with the reconstruction error of the anomalous data. We apply auto-encoders to the MNIST dataset to detect corrupted images. Schematic view of an auto-encoder

Auto-encoder reconstruction error

Auto-encoders
Example: Identify corrupted images in the MNIST dataset.

Part 5: Anomaly detection in time series

Time series are a special type of data that is often used in anomaly detection. We discuss the challenges of anomaly detection in time series and give a bit of background in time series analysis that is useful for anomaly detection. We then introduce the SARIMA model as a simple forecasting model that can be used to detect anomalies in time series. Finally, we apply the SARIMA model to the New York taxi dataset to detect anomalies in the number of taxi rides.

Anomaly detection in time series with SARIMA

Introduction to time series analysis.
Anomaly types: Point, context and pattern anomalies.
Preprocessing techniques for anomaly detection in time series.
Anomaly detection via forecasting error: SARIMA models.
Example: Detecting anomalies in New York taxi data.

Part 6: Extreme value theory and GEV distributions

Most anomaly detection methods return a score for each data point. The score indicates how anomalous a point is. However, the algorithms usually do not provide thresholds for classifying a point as anomalous. We introduce extreme value theory (EVT) as a method for finding thresholds for anomaly detection. EVT is based on the assumption that the scores of anomalous points are significantly higher than the scores of nominal points. It estimates the tail of the score distribution and uses this to find a probabilistically interpretable threshold. We apply EVT to the New York taxi dataset to find a threshold for detecting anomalies in the number of taxi rides. Peaks over threshold method

Fitting a GEV distribution

Relevance of EVT for anomaly detection.
GEV distributions.
Fitting GEV distributions.
Example: Find detection threshold for anomalies in New York taxi data.

Prerequisites

We assume some prior exposure to machine learning and the underlying concepts.
Basic knowledge of Python is required to complete the exercises.

References

[Haw80I]

Identification of Outliers, D. M. Hawkins.

1980

Any applied statistician who has analysed a number of sets of real data is likely to have come across ‘outliers’. The intuitive definition of an outlier would be ‘an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism’. An inspection of a sample containing outliers would show up such characteristics as large gaps between …

[Agg17O]

Outlier Analysis, Charu C. Aggarwal.

2017

Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature. In most applications, the data is created by one or more generating processes, which could either reflect activity in the system or observations collected about entities. When the generating process behaves unusually, it results in the creation of outliers. Therefore, …

[Col13I]

An Introduction to Statistical Modeling of Extreme Values., Stuart Coles.

2013

Directly oriented towards real practical application, this book develops the basic theoretical framework of extreme value models and the statistical inference techniques for using these models in practice. Intended for statisticians and non-statisticians alike, the theoretical treatment is elementary, with heuristics often replacing detailed mathematical proof. Most aspects of extreme modeling …

[Cas97F]

Fitting the Generalized Pareto Distribution to Data, Enrique Castillo, Ali S. Hadi.

Dec 1997

The generalized Pareto distribution (GPD) was introduced by Pickands to model exceedances over a threshold. It has since been used by many authors to model data in several fields. The GPD has a scale parameter (\[sgrave] > 0) and a shape parameter (−∞ < k < ∞). The estimation of these parameters is not generally an easy problem. When k > 1, the maximum likelihood estimates do not exist, and when k …

[Cha19D]

Deep Learning for Anomaly Detection: A Survey, Raghavendra Chalapathy, Sanjay Chawla.

Jan 2019

Anomaly detection is an important problem that has been well-studied within diverse research areas and application domains. The aim of this survey is two-fold, firstly we present a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore, we review the adoption of these methods for anomaly across various application domains and assess their …

[Cha09A]

Anomaly detection: A survey, Varun Chandola, Arindam Banerjee, Vipin Kumar.

Jul 2009

Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into …

[Cli11N]

Novelty Detection with Multivariate Extreme Value Statistics, David Andrew Clifton, Samuel Hugueny, Lionel Tarassenko.

Dec 2011

Novelty detection, or one-class classification, aims to determine if data are “normal” with respect to some model of normality constructed using examples of normal system behaviour. If that model is composed of generative probability distributions, the extent of “normality” in the data space can be described using Extreme Value Theory (EVT), a branch of statistics concerned with describing the …

[Dau18U]

The UCR time series classification archive, , Hoang Anh, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, , Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, .

Oct 2018

The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 …

[Guh16R]

Robust Random Cut Forest Based Anomaly Detection on Streams, Sudipto Guha, Nina Mishra, Gourav Roy, Okke Schrijvers.

Jun 2016

In this paper we focus on the anomaly detection problem for dynamic data streams through the lens of random cut forests. We investigate a robust random cut data structure that can be used as a sket...

[Hum22R]

Robust Kernel Density Estimation with Median-of-Means principle, Pierre Humbert, Batiste Le Bars, Ludovic Minvielle.

Jun 2022

In this paper, we introduce a robust non-parametric density estimator combining the popular Kernel Density Estimation method and the Median-of-Means principle (MoM-KDE). This estimator is shown to achieve robustness for a large class of anomalous data, potentially adversarial. In particular, while previous works only prove consistency results under very specific contamination models, this work …

[Kim12R]

Robust kernel density estimation, JooSeuk Kim, Clayton D. Scott.

Sep 2012

We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample …

[Luc06F]

Fitting the generalized Pareto distribution to data using maximum goodness-of-fit estimators, Alberto Luceño.

Nov 2006

Some of the most powerful techniques currently available to test the goodness of fit of a hypothesized continuous cumulative distribution function (CDF) use statistics based on the empirical distribution function (EDF), such as those of Kolmogorov, Cramer-von Mises and Anderson-Darling, among others. The use of EDF statistics was analyzed for estimation purposes. In this approach, maximum …

[Pan21D]

Deep Learning for Anomaly Detection: A Review, Guansong Pang, Chunhua Shen, Longbing Cao, Anton Van Den Hengel.

Mar 2021

Anomaly detection, a.k.a. outlier detection or novelty detection, has been a lasting yet active research area in various research communities for several decades. There are still some unique problem complexities and challenges that require advanced approaches. In recent years, deep learning enabled anomaly detection, i.e., deep anomaly detection, has emerged as a critical direction. This article …

[Pim14R]

A review of novelty detection, Marco A.F. Pimentel, David A. Clifton, Lei Clifton, Lionel Tarassenko.

Jun 2014

Novelty detection is the task of classifying test data that differ in some respect from the data that are available during training. This may be seen as “one-class classification”, in which a model is constructed to describe “normal” training data. The novelty detection approach is typically used when the quantity of available “abnormal” data is insufficient to construct explicit models for …

[Sif17A]

Anomaly Detection in Streams with Extreme Value Theory, Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, Christine Largouet.

Aug 2017

Anomaly detection in time series has attracted considerable attention due to its importance in many real-world applications including intrusion detection, energy management and finance. Most approaches for detecting outliers rely on either manually set thresholds or assumptions on the distribution of data according to Chandola, Banerjee and Kumar. Here, we propose a new approach to detect outliers …

[Tav09D]

A detailed analysis of the KDD CUP 99 data set, Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, Ali A. Ghorbani.

Jul 2009

During the last decade, anomaly detection has attracted the attention of many researchers to overcome the weakness of signature-based IDSs in detecting novel attacks, and KDDCUP'99 is the mostly widely used data set for the evaluation of these systems. Having conducted a statistical analysis on this data set, we found two important issues which highly affects the performance of evaluated systems, …

[Van14R]

Robust Kernel Density Estimation by Scaling and Projection in Hilbert Space, Robert A Vandermeulen, Clayton Scott.

2014

While robust parameter estimation has been well studied in parametric density estimation, there has been little investigation into robust density estimation in the nonparametric setting. We present a robust version of the popular kernel density estimator (KDE). As with other estimators, a robust version of the KDE is useful since sample contamination is a common issue with datasets. What …

[Wu21C]

Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress, Renjie Wu, Eamonn J. Keogh.

Aug 2021

Time series anomaly detection has been a perennially important topic in data science, with papers dating back to the 1950s. However, in recent years there has been an explosion of interest in this topic, much of it driven by the success of deep learning in other domains and for other time series tasks. Most of these papers test on one or more of a handful of popular benchmark datasets, created by …