anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority ...

, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig,

Hans-Peter Kriegel Hans-Peter Kriegel (1 October 1948, Germany) is a German computer scientist and professor at the Ludwig Maximilian University of Munich and leading the Database Systems Group in the Department of Computer Science. He was previously professor at ...

, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours. LOF shares some concepts with

DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: gi ...

and

OPTICS Optics is the branch of physics that studies the behaviour and properties of light, including its interactions with matter and the construction of instruments that use or detect it. Optics usually describes the behaviour of visible, ultraviole ...

such as the concepts of "core distance" and "reachability distance", which are used for local density estimation.

Basic idea

The local outlier factor is based on a concept of a local density, where locality is given by ''k'' nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

s. The local density is estimated by the typical distance at which a point can be "reached" from its neighbors. The definition of "reachability distance" used in LOF is an additional measure to produce more stable results within clusters. The "reachability distance" used by LOF has some subtle details that are often found incorrect in secondary sources, e.g., in the textbook of Ethem Alpaydin.

Formal

Let be the distance of the object ''A'' to the ''k''-th nearest neighbor. Note that the set of the ''k'' nearest neighbors includes all objects at this distance, which can in the case of a "tie" be more than ''k'' objects. We denote the set of ''k'' nearest neighbors as . Reachability-distance

This distance is used to define what is called ''reachability distance'': In words, the ''reachability distance'' of an object ''A'' ''from'' ''B'' is the true distance of the two objects, but at least the of ''B''. Objects that belong to the ''k'' nearest neighbors of ''B'' (the "core" of ''B'', see DBSCAN cluster analysis) are considered to be equally distant. The reason for this is to reduce the statistical fluctuations between all points ''A'' close to ''B'', where increasing the value for ''k'' increases the smoothing effect. Note that this is not a

distance Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...

in the mathematical definition, since it is not symmetric. (While it is a common mistake to always use the , this yields a slightly different method, referred to as Simplified-LOF) The ''local reachability density'' of an object ''A'' is defined by which is the inverse of the average reachability distance of the object ''A'' ''from'' its neighbors. Note that it is not the average reachability of the neighbors from ''A'' (which by definition would be the ), but the distance at which ''A'' can be "reached" ''from'' its neighbors. With duplicate points, this value can become infinite. The local reachability densities are then compared with those of the neighbors using which is the ''average local reachability density of the neighbors'' divided by the object's own local reachability density. A value of approximately indicates that the object is comparable to its neighbors (and thus not an outlier). A value below indicates a denser region (which would be an inlier), while values significantly larger than indicate outliers. means Similar density as neighbors, means Higher density than neighbors (Inlier), means Lower density than neighbors (Outlier)

Advantages

Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors. While the geometric intuition of LOF is only applicable to low-dimensional vector spaces, the algorithm can be applied in any context a dissimilarity function can be defined. It has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection and on processed classification benchmark data. The LOF family of methods can be easily generalized and then applied to various other problems, such as detecting outliers in geographic data, video streams or authorship networks.

Disadvantages and Extensions

The resulting values are

quotient In arithmetic, a quotient (from lat, quotiens 'how many times', pronounced ) is a quantity produced by the division of two numbers. The quotient has widespread use throughout mathematics, and is commonly referred to as the integer part of a ...

-values and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. These differences can also occur within a dataset due to the locality of the method. There exist extensions of LOF that try to improve over LOF in these aspects: * ''Feature Bagging for Outlier Detection'' runs LOF on multiple projections and combines the results for improved detection qualities in high dimensions. This is the first

ensemble learning In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statisti ...

approach to outlier detection, for other variants see ref. * ''Local Outlier Probability'' (LoOP) is a method derived from LOF but using inexpensive local statistics to become less sensitive to the choice of the parameter ''k''. In addition, the resulting values are scaled to a value range of . * ''Interpreting and Unifying Outlier Scores'' proposes a normalization of the LOF outlier scores to the interval using statistical scaling to increase

usability Usability can be described as the capacity of a system to provide a condition for its users to perform the tasks safely, effectively, and efficiently while enjoying the experience. In software engineering, usability is the degree to which a soft ...

and can be seen an improved version of the LoOP ideas. * ''On Evaluation of Outlier Rankings and Outlier Scores'' proposes methods for measuring similarity and diversity of methods for building advanced outlier detection ensembles using LOF variants and other algorithms and improving on the Feature Bagging approach discussed above. * ''Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection''{{Cite journal , last1 = Schubert , first1 = E. , last2 = Zimek , first2 = A. , last3 = Kriegel , first3 = H. -P. , author-link3 = Hans-Peter Kriegel, doi = 10.1007/s10618-012-0300-z , title = Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection , journal = Data Mining and Knowledge Discovery , volume = 28 , pages = 190–237 , year = 2012 , s2cid = 19036098 discusses the general pattern in various local outlier detection methods (including, e.g., LOF, a simplified version of LOF and LoOP) and abstracts from this into a general framework. This framework is then applied, e.g., to detecting outliers in geographic data, video streams and authorship networks.

References

Statistical outliers Data mining Machine learning algorithms