HOME

TheInfoList



OR:

In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data. Anomaly detection finds application in many domains including cyber security, medicine, machine vision, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers. Three broad categories of anomaly detection techniques exist. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. However, this approach is rarely used in anomaly detection due to the general unavailability of labelled data and the inherent unbalanced nature of the classes. Semi-supervised anomaly detection techniques assume that some portion of the data is labelled. This may be any combination of the normal or anomalous data, but more often than not the techniques construct a model representing normal behavior from a given ''normal'' training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application.


Definition

Many attempts have been made in the statistical and computer science communities to define an anomaly. The most prevalent ones include: * An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. * Anomalies are instances or collections of data that occur very rarely in the data set and whose features differ significantly from most of the data. * An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. * An anomaly is a point or collection of points that is relatively distant from other points in multi-dimensional space of features. * Anomalies are patterns in data that do not conform to a well defined notion of normal behaviour. * Let T be observations from a univariate Gaussian distribution and O a point from T. Then the z-score for O is greater than a pre-selected threshold if and only if O is an outlier.


Applications

Anomaly detection is applicable in a very large number and variety of domains, and is an important subarea of unsupervised machine learning. As such it has applications in cyber-security
intrusion detection An intrusion detection system (IDS; also intrusion prevention system or IPS) is a device or software application that monitors a network or systems for malicious activity or policy violations. Any intrusion activity or violation is typically rep ...
,
fraud detection In law, fraud is intentional deception to secure unfair or unlawful gain, or to deprive a victim of a legal right. Fraud can violate civil law (e.g., a fraud victim may sue the fraud perpetrator to avoid the fraud or recover monetary compensa ...
, fault detection, system health monitoring, event detection in sensor networks, detecting ecosystem disturbances, defect detection in images using
machine vision Machine vision (MV) is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance, usually in industry. Machine vision refers to ...
, medical diagnosis and law enforcement. Anomaly detection was proposed for
intrusion detection systems An intrusion detection system (IDS; also intrusion prevention system or IPS) is a device or software application that monitors a network or systems for malicious activity or policy violations. Any intrusion activity or violation is typically rep ...
(IDS) by
Dorothy Denning Dorothy Elizabeth Denning (née Robling, born August 12, 1945) is a US-American information security researcher known for lattice-based access control (LBAC), intrusion detection systems (IDS), and other cyber security innovations. She published ...
in 1986. Anomaly detection for IDS is normally accomplished with thresholds and statistics, but can also be done with
soft computing Soft computing is a set of algorithms, including neural networks, fuzzy logic, and evolutionary algorithms. These algorithms are tolerant of imprecision, uncertainty, partial truth and approximation. It is contrasted with hard computing: al ...
, and inductive learning. Types of statistics proposed by 1999 included profiles of users, workstations, networks, remote hosts, groups of users, and programs based on frequencies, means, variances, covariances, and standard deviations. The counterpart of anomaly detection in
intrusion detection An intrusion detection system (IDS; also intrusion prevention system or IPS) is a device or software application that monitors a network or systems for malicious activity or policy violations. Any intrusion activity or violation is typically rep ...
is misuse detection. It is often used in preprocessing to remove anomalous data from the dataset. This is done for a number of reasons. Statistics of data such as the mean and standard deviation are more accurate after the removal of anomalies, and the visualisation of data can also be improved. In
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
, removing the anomalous data from the dataset often results in a statistically significant increase in accuracy. Anomalies are also often the most important observations in the data to be found such as in intrusion detection or detecting abnormalities in medical images.


Popular techniques

Many anomaly detection techniques have been proposed in literature. Some of the popular techniques are: * Statistical (
Z-score In statistics, the standard score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mean ...
,
Tukey's range test Tukey's range test, also known as Tukey's test, Tukey method, Tukey's honest significance test, or Tukey's HSD (honestly significant difference) test, Also occasionally as "honestly," see e.g. is a single-step multiple comparison procedure and ...
and Grubbs's test) * Density-based techniques (
k-nearest neighbor In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and reg ...
,
local outlier factor In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data poin ...
, isolation forests, and many more variations of this concept) * Subspace-, correlation-based and tensor-based outlier detection for high-dimensional data * One-class support vector machines * Replicator neural networks,
autoencoder An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data ( unsupervised learning). The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder lea ...
s, variational autoencoders,
long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ...
neural networks *
Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
s *
Hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an o ...
s (HMMs) *
Minimum Covariance Determinant In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ra ...
* Clustering: Cluster analysis-based outlier detection * Deviations from
association rules Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.P ...
and frequent itemsets * Fuzzy logic-based outlier detection * Ensemble techniques, using feature bagging, score normalization and different sources of diversity The performance of methods depends on the data set and parameters, and methods have little systematic advantages over another when compared across many data sets and parameters.


Software

*
ELKI ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally at the database ...
is an open-source Java data mining toolkit that contains several anomaly detection algorithms, as well as index acceleration for them. *PyOD is an open-source Python library developed specifically for anomaly detection. *
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
is an open-source Python library that has built functionality to provide unsupervised anomaly detection. *
Wolfram Mathematica Wolfram Mathematica is a software system with built-in libraries for several areas of technical computing that allow machine learning, statistics, symbolic computation, data manipulation, network analysis, time series analysis, NLP, optimizat ...
provides functionality for unsupervised anomaly detection across multiple data types
Mathematica documentation


Datasets


Anomaly detection benchmark data repository
with carefully chosen data sets of the
Ludwig-Maximilians-Universität München The Ludwig Maximilian University of Munich (simply University of Munich or LMU; german: Ludwig-Maximilians-Universität München) is a public research university in Munich, Germany. It is Germany's sixth-oldest university in continuous operatio ...

Mirror
at
University of São Paulo The University of São Paulo ( pt, Universidade de São Paulo, USP) is a public university in the Brazilian state of São Paulo. It is the largest Brazilian public university and the country's most prestigious educational institution, the bes ...
.
ODDS
– ODDS: A large collection of publicly available outlier detection datasets with ground truth in different domains.
Unsupervised Anomaly Detection Benchmark
at Harvard Dataverse: Datasets for Unsupervised Anomaly Detection with ground truth.
KMASH Data Repository
at Research Data Australia having more than 12,000 anomaly detection datasets with ground truth.


See also

*
Change detection In statistical analysis, change detection or change point detection tries to identify times when the probability distribution of a stochastic process or time series changes. In general the problem concerns both detecting whether or not a change ...
*
Statistical process control Statistical process control (SPC) or statistical quality control (SQC) is the application of statistical methods to monitor and control the quality of a production process. This helps to ensure that the process operates efficiently, producing ...
*
Novelty detection Novelty detection is the mechanism by which an intelligent organism is able to identify an incoming sensory pattern as being hitherto unknown. If the pattern is sufficiently salient or associated with a high positive or strong negative utility, i ...
*
Hierarchical temporal memory Hierarchical temporal memory (HTM) is a biologically constrained machine intelligence technology developed by Numenta. Originally described in the 2004 book ''On Intelligence'' by Jeff Hawkins with Sandra Blakeslee, HTM is primarily used today for ...


References

{{Authority control Data mining Machine learning Data security Statistical outliers