data analysis Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...

, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data. Anomaly detection finds application in many domains including

cybersecurity Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and networks from thr ...

medicine Medicine is the science and Praxis (process), practice of caring for patients, managing the Medical diagnosis, diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, ...

machine vision Machine vision is the technology and methods used to provide image, imaging-based automation, automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance, usually in industry. Machine vision ...

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

neuroscience Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions, and its disorders. It is a multidisciplinary science that combines physiology, anatomy, molecular biology, ...

law enforcement Law enforcement is the activity of some members of the government or other social institutions who act in an organized manner to enforce the law by investigating, deterring, rehabilitating, or punishing people who violate the rules and norms gove ...

and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers. Three broad categories of anomaly detection techniques exist. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. However, this approach is rarely used in anomaly detection due to the general unavailability of labelled data and the inherent unbalanced nature of the classes. Semi-supervised anomaly detection techniques assume that some portion of the data is labelled. This may be any combination of the normal or anomalous data, but more often than not, the techniques construct a model representing normal behavior from a given ''normal'' training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application.

Definition

Many attempts have been made in the statistical and computer science communities to define an anomaly. The most prevalent ones include the following, and can be categorised into three groups: those that are ambiguous, those that are specific to a method with pre-defined thresholds usually chosen empirically, and those that are formally defined:

Ill defined

* An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. * Anomalies are instances or collections of data that occur very rarely in the data set and whose features differ significantly from most of the data. * An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. * An anomaly is a point or collection of points that is relatively distant from other points in multi-dimensional space of features. * Anomalies are patterns in data that do not conform to a well-defined notion of normal behaviour.

Specific

* Let T be observations from a univariate Gaussian distribution and O a point from T. Then the z-score for O is greater than a pre-selected threshold if and only if O is an outlier.

History

Intrusion detection

The concept of intrusion detection, a critical component of anomaly detection, has evolved significantly over time. Initially, it was a manual process where system administrators would monitor for unusual activities, such as a vacationing user's account being accessed or unexpected printer activity. This approach was not scalable and was soon superseded by the analysis of audit logs and system logs for signs of malicious behavior. By the late 1970s and early 1980s, the analysis of these logs was primarily used retrospectively to investigate incidents, as the volume of data made it impractical for real-time monitoring. The affordability of digital storage eventually led to audit logs being analyzed online, with specialized programs being developed to sift through the data. These programs, however, were typically run during off-peak hours due to their computational intensity. The 1990s brought the advent of real-time intrusion detection systems capable of analyzing audit data as it was generated, allowing for immediate detection of and response to attacks. This marked a significant shift towards proactive intrusion detection. As the field has continued to develop, the focus has shifted to creating solutions that can be efficiently implemented across large and complex network environments, adapting to the ever-growing variety of security threats and the dynamic nature of modern computing infrastructures.

Applications

Anomaly detection is applicable in a very large number and variety of domains, and is an important subarea of unsupervised machine learning. As such it has applications in cyber-security,

intrusion detection An intrusion detection system (IDS) is a device or software application that monitors a network or systems for malicious activity or policy violations. Any intrusion activity or violation is typically either reported to an administrator or collec ...

, fraud detection, fault detection, system health monitoring, event detection in sensor networks, detecting ecosystem disturbances, defect detection in images using

, medical diagnosis and law enforcement.

Intrusion detection

Anomaly detection was proposed for

intrusion detection systems An intrusion detection system (IDS) is a device or software application that monitors a network or systems for malicious activity or policy violations. Any intrusion activity or violation is typically either reported to an administrator or collec ...

(IDS) by Dorothy Denning in 1986. Anomaly detection for IDS is normally accomplished with thresholds and statistics, but can also be done with soft computing, and inductive learning. Types of features proposed by 1999 included profiles of users, workstations, networks, remote hosts, groups of users, and programs based on frequencies, means, variances, covariances, and standard deviations. The counterpart of anomaly detection in

is misuse detection.

Fintech fraud detection

Anomaly detection is vital in

fintech Financial technology (abbreviated as fintech) refers to the application of innovative technologies to products and services in the financial industry. This broad term encompasses a wide array of technological advancements in financial services, ...

for

fraud In law, fraud is intent (law), intentional deception to deprive a victim of a legal right or to gain from a victim unlawfully or unfairly. Fraud can violate Civil law (common law), civil law (e.g., a fraud victim may sue the fraud perpetrato ...

prevention.

Preprocessing

Preprocessing data to remove anomalies can be an important step in data analysis, and is done for a number of reasons. Statistics such as the mean and standard deviation are more accurate after the removal of anomalies, and the visualisation of data can also be improved. In

supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

, removing the anomalous data from the dataset often results in a statistically significant increase in accuracy.

Video surveillance

Anomaly detection has become increasingly vital in video surveillance to enhance security and safety. With the advent of deep learning technologies, methods using Convolutional Neural Networks (CNNs) and Simple Recurrent Units (SRUs) have shown significant promise in identifying unusual activities or behaviors in video data. These models can process and analyze extensive video feeds in real-time, recognizing patterns that deviate from the norm, which may indicate potential security threats or safety violations. An important aspect for video surveillance is the development of scalable real-time frameworks. Such pipelines are required for processing multiple video streams with low computational resources.

IT infrastructure

IT infrastructure Information technology infrastructure is defined broadly as a set of information technology (IT) components that are the foundation of an IT service; typically physical components (Computer hardware, computer and networking hardware and facilitie ...

management, anomaly detection is crucial for ensuring the smooth operation and reliability of services. These are

complex systems A complex system is a system composed of many components that may interact with one another. Examples of complex systems are Earth's global climate, organisms, the human brain, infrastructure such as power grid, transportation or communication s ...

, composed of many interactive elements and large data quantities, requiring methods to process and reduce this data into a human and machine interpretable format. Techniques like the IT Infrastructure Library (ITIL) and monitoring frameworks are employed to track and manage system performance and user experience. Detected anomalies can help identify and pre-empt potential performance degradations or system failures, thus maintaining productivity and business process effectiveness.

IoT systems

Anomaly detection is critical for the security and efficiency of Internet of Things (IoT) systems. It helps in identifying system failures and security breaches in complex networks of IoT devices. The methods must manage real-time data, diverse device types, and scale effectively. Garbe et al. have introduced a multi-stage anomaly detection framework that improves upon traditional methods by incorporating spatial clustering, density-based clustering, and locality-sensitive hashing. This tailored approach is designed to better handle the vast and varied nature of IoT data, thereby enhancing security and operational reliability in smart infrastructure and industrial IoT systems.

Petroleum industry

Anomaly detection is crucial in the

petroleum industry The petroleum industry, also known as the oil industry, includes the global processes of hydrocarbon exploration, exploration, extraction of petroleum, extraction, oil refinery, refining, Petroleum transport, transportation (often by oil tankers ...

for monitoring critical machinery. Martí et al. used a novel segmentation algorithm to analyze sensor data for real-time anomaly detection. This approach helps promptly identify and address any irregularities in sensor readings, ensuring the reliability and safety of petroleum operations.

Oil and gas pipeline monitoring

In the oil and gas sector, anomaly detection is not just crucial for maintenance and safety, but also for environmental protection. Aljameel et al. propose an advanced machine learning-based model for detecting minor leaks in oil and gas pipelines, a task traditional methods may miss.

Methods

Many anomaly detection techniques have been proposed in literature. The performance of methods usually depend on the data sets. For example, some may be suited to detecting local outliers, while others global, and methods have little systematic advantages over another when compared across many data sets. Almost all algorithms also require the setting of non-intuitive parameters critical for performance, and usually unknown before application. Some of the popular techniques are mentioned below and are broken down into categories:

Statistical

Parameter-free

Also referred to as frequency-based or counting-based, the simplest non-parametric anomaly detection method is to build a

histogram A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...

with the training data or a set of known normal instances, and if a test point does not fall in any of the histogram bins mark it as anomalous, or assign an anomaly score to test data based on the height of the bin it falls in. The size of bins are key to the effectiveness of this technique but must be determined by the implementer. A more sophisticated technique uses kernel functions to approximate the distribution of the normal data. Instances in low probability areas of the distribution are then considered anomalies.

Parametric-based

Z-score In statistics, the standard score or ''z''-score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores ...

, *

Tukey's range test Tukey's range test, also known as Tukey's test, Tukey method, Tukey's honest significance test, or Tukey's HSD (honestly significant difference) test, : Also occasionally described as "honestly", see e.g. is a single-step multiple comparison p ...

* Grubbs's test

Density

* Density-based techniques ( k-nearest neighbor, local outlier factor, isolation forests, and many more variations of this concept) * Subspace-base (SOD), correlation-based (COP) and tensor-based outlier detection for high-dimensional data * One-class

support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

(OCSVM, SVDD)

Neural networks

* Replicator

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

s, autoencoders, variational autoencoders, long short-term memory neural networks *

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Whi ...

s *

Hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

s (HMMs) * Minimum Covariance Determinant *

Deep Learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

Convolutional Neural Networks A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different type ...

(CNNs): CNNs have shown exceptional performance in the unsupervised learning domain for anomaly detection, especially in image and video data analysis. Their ability to automatically and hierarchically learn spatial hierarchies of features from low to high-level patterns makes them particularly suited for detecting visual anomalies. For instance, CNNs can be trained on image datasets to identify atypical patterns indicative of defects or out-of-norm conditions in industrial quality control scenarios. ** Simple Recurrent Units (SRUs): In time-series data, SRUs, a type of recurrent neural network, have been effectively used for anomaly detection by capturing temporal dependencies and sequence anomalies. Unlike traditional RNNs, SRUs are designed to be faster and more parallelizable, offering a better fit for real-time anomaly detection in complex systems such as dynamic financial markets or predictive maintenance in machinery, where identifying temporal irregularities promptly is crucial. ** Foundation models: Since the advent of large-scale foundation models that have been used successfully on most downstream tasks, they have also been adapted for use in anomaly detection and segmentation. Methods utilizing pretrained foundation models include using the alignment of image and text embeddings (CLIP, etc.) for anomaly localization, while others may use the inpainting ability of generative image models for reconstruction-error based anomaly detection.

Cluster-based

* Clustering:

Cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...

-based outlier detection * Deviations from association rules and frequent itemsets * Fuzzy logic-based outlier detection

Ensembles

* Ensemble techniques, using feature bagging, score normalization and different sources of diversity

Others

Histogram-based Outlier Score (HBOS) uses value histograms and assumes feature independence for fast predictions.

Anomaly detection in dynamic networks

Dynamic networks, such as those representing financial systems, social media interactions, and transportation infrastructure, are subject to constant change, making anomaly detection within them a complex task. Unlike static graphs, dynamic networks reflect evolving relationships and states, requiring adaptive techniques for anomaly detection.

Types of anomalies in dynamic networks

# Community anomalies # Compression anomalies # Decomposition anomalies # Distance anomalies # Probabilistic model anomalies

Explainable anomaly detection

Many of the methods discussed above only yield an anomaly score prediction, which often can be explained to users as the point being in a region of low data density (or relatively low density compared to the neighbor's densities). In explainable artificial intelligence, the users demand methods with higher explainability. Some methods allow for more detailed explanations: * The Subspace Outlier Degree (SOD) identifies attributes where a sample is normal, and attributes in which the sample deviates from the expected. * Correlation Outlier Probabilities (COP) compute an error vector of how a sample point deviates from an expected location, which can be interpreted as a counterfactual explanation: the sample would be normal if it were moved to that location.

Software

* ELKI is an open-source Java data mining toolkit that contains several anomaly detection algorithms, as well as index acceleration for them. *PyOD is an open-source Python library developed specifically for anomaly detection. *

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...

is an open-source Python library that contains some algorithms for unsupervised anomaly detection. *

Wolfram Mathematica Wolfram (previously known as Mathematica and Wolfram Mathematica) is a software system with built-in libraries for several areas of technical computing that allows machine learning, statistics, symbolic computation, data manipulation, network ...

provides functionality for unsupervised anomaly detection across multiple data types

Datasets

Anomaly detection benchmark data repository
with carefully chosen data sets of the Ludwig-Maximilians-Universität München
Mirror
at

University of São Paulo The Universidade de São Paulo (, USP) is a public research university in the Brazilian state of São Paulo, and the largest public university in Brazil. The university was founded on 25 January 1934, regrouping already existing schools in ...

.
ODDS
– ODDS: A large collection of publicly available outlier detection datasets with

ground truth Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference. Etymology The ''Oxford English Dictionary'' (s.v. ''ground ...

in different domains.
Unsupervised Anomaly Detection Benchmark
at Harvard Dataverse: Datasets for Unsupervised Anomaly Detection with ground truth.
KMASH Data Repository
at Research Data Australia having more than 12,000 anomaly detection datasets with ground truth.

References

{{Authority control Data mining Machine learning Data security Statistical outliers Reliability engineering

Definition

Ill defined

Specific

History

Intrusion detection

Applications

Intrusion detection

Fintech fraud detection

Preprocessing

Video surveillance

IT infrastructure

IoT systems

Petroleum industry

Oil and gas pipeline monitoring

Methods

Statistical

Parameter-free

Parametric-based

Density

Neural networks

Cluster-based

Ensembles

Others

Anomaly detection in dynamic networks

Types of anomalies in dynamic networks

Explainable anomaly detection

Software

Datasets

See also

References