ELKI
   HOME

TheInfoList



OR:

ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases)
software framework In computer programming, a software framework is an abstraction in which software, providing generic functionality, can be selectively changed by additional user-written code, thus providing application-specific software. It provides a standard ...
developed for use in research and teaching. It was originally at the database systems research unit of Professor
Hans-Peter Kriegel Hans-Peter Kriegel (1 October 1948, Germany) is a German computer scientist and professor at the Ludwig Maximilian University of Munich and leading the Database Systems Group in the Department of Computer Science. He was previously professor at ...
at the
Ludwig Maximilian University of Munich The Ludwig Maximilian University of Munich (simply University of Munich or LMU; german: Ludwig-Maximilians-Universität München) is a public research university in Munich, Germany. It is Germany's sixth-oldest university in continuous operatio ...
, Germany, and now continued at the
Technical University of Dortmund TU Dortmund University (german: Technische Universität Dortmund) is a technical university in Dortmund, North Rhine-Westphalia, Germany with over 35,000 students, and over 6,000 staff including 300 professors, offering around 80 Bachelor's an ...
, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.


Description

The ELKI framework is written in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
and built around a modular architecture. Most currently included algorithms belong to clustering,
outlier detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
and database indexes. The object-oriented architecture allows the combination of arbitrary algorithms, data types,
distance function In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setting ...
s, indexes, and evaluation measures. The Java
just-in-time compiler In computing, just-in-time (JIT) compilation (also dynamic translation or run-time compilations) is a way of executing computer code that involves compilation during execution of a program (at run time) rather than before execution. This may cons ...
optimizes all combinations to a similar extent, making benchmarking results more comparable if they share large parts of the code. When developing new algorithms or index structures, the existing components can be easily reused, and the
type safety In computer science, type safety and type soundness are the extent to which a programming language discourages or prevents type errors. Type safety is sometimes alternatively considered to be a property of facilities of a computer language; that is ...
of Java detects many programming errors at compile time. ELKI has been used in
data science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a br ...
for example to cluster
sperm whale The sperm whale or cachalot (''Physeter macrocephalus'') is the largest of the toothed whales and the largest toothed predator. It is the only living member of the genus ''Physeter'' and one of three extant species in the sperm whale famil ...
codas,
phoneme In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west o ...
clustering, for anomaly detection in
spaceflight Spaceflight (or space flight) is an application of astronautics to fly spacecraft into or through outer space, either with or without humans on board. Most spaceflight is uncrewed and conducted mainly with spacecraft such as satellites in o ...
operations, for
bike sharing A bicycle-sharing system, bike share program, public bicycle scheme, or public bike share (PBS) scheme, is a shared transport service where bicycles are available for shared use by individuals at low cost. The programmes themselves include bot ...
redistribution, and traffic prediction.


Objectives

The university project is developed for use in ''teaching and research''. The source code is written with extensibility and reusability in mind, but is also optimized for performance. The experimental
evaluation Evaluation is a systematic determination and assessment of a subject's merit, worth and significance, using criteria governed by a set of standards. It can assist an organization, program, design, project or any other intervention or initiative ...
of algorithms depends on many environmental factors and implementation details can have a large impact on the runtime. ELKI aims at providing a shared codebase with comparable implementations of many algorithms. As research project, it currently does not offer integration with
business intelligence Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...
applications or an interface to common database management systems via SQL. The
copyleft Copyleft is the legal technique of granting certain freedoms over copies of copyrighted works with the requirement that the same rights be preserved in derivative works. In this sense, ''freedoms'' refers to the use of the work for any purpose ...
( AGPL) license may also be a hindrance to an integration in commercial products; nevertheless it can be used to evaluate algorithms prior to developing an own implementation for a commercial product. Furthermore, the application of the algorithms requires knowledge about their usage, parameters, and study of original literature. The audience are
student A student is a person enrolled in a school or other educational institution. In the United Kingdom and most commonwealth countries, a "student" attends a secondary school or higher (e.g., college or university); those in primary or elementar ...
s,
researcher Research is "creative and systematic work undertaken to increase the stock of knowledge". It involves the collection, organization and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness t ...
s,
data scientists Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a ...
, and
software engineer Software engineering is a systematic engineering approach to software development. A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term ''p ...
s.


Architecture

ELKI is modeled around a
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
-inspired core, which uses a vertical data layout that stores data in column groups (similar to
column families {{Short description, A database project that organizes data in packed columns A column family is a database object that contains columns of related data. It is a tuple (pair) that consists of a key–value pair, where the key is mapped to a value t ...
in NoSQL databases). This database core provides
nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
, range/radius search, and distance query functionality with index acceleration for a wide range of dissimilarity measures. Algorithms based on such queries (e.g. k-nearest-neighbor algorithm,
local outlier factor In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data poin ...
and
DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: gi ...
) can be implemented easily and benefit from the index acceleration. The database core also provides fast and memory efficient collections for object collections and associative structures such as nearest neighbor lists. ELKI makes extensive use of Java interfaces, so that it can be extended easily in many places. For example, custom data types, distance functions, index structures, algorithms, input parsers, and output modules can be added and combined without modifying the existing code. This includes the possibility of defining a custom distance function and using existing indexes for acceleration. ELKI uses a service loader architecture to allow publishing extensions as separate jar files. ELKI uses optimized collections for performance rather than the standard Java API.
For loop In computer science a for-loop or for loop is a control flow statement for specifying iteration. Specifically, a for loop functions by running a section of code repeatedly until a certain condition has been satisfied. For-loops have two par ...
s for example are written similar to C++ iterators: for (DBIDIter iter = ids.iter(); iter.valid(); iter.advance()) In contrast to typical Java iterators (which can only iterate over objects), this conserves memory, because the iterator can internally use primitive values for data storage. The reduced
garbage collection Waste collection is a part of the process of waste management. It is the transfer of solid waste from the point of use and disposal to the point of treatment or landfill. Waste collection also includes the curbside collection of recyclabl ...
improves the runtime. Optimized collections libraries such as GNU Trove3, Koloboke, and fastutil employ similar optimizations. ELKI includes data structures such as object collections and heaps (for, e.g.,
nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
) using such optimizations.


Visualization

The visualization module uses SVG for scalable graphics output, and Apache Batik for rendering of the user interface as well as lossless export into
PostScript PostScript (PS) is a page description language in the electronic publishing and desktop publishing realm. It is a dynamically typed, concatenative programming language. It was created at Adobe Systems by John Warnock, Charles Geschke, Doug Br ...
and
PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
for easy inclusion in scientific publications in
LaTeX Latex is an emulsion (stable dispersion) of polymer microparticles in water. Latexes are found in nature, but synthetic latexes are common as well. In nature, latex is found as a milky fluid found in 10% of all flowering plants (angiosperms ...
. Exported files can be edited with SVG editors such as
Inkscape Inkscape is a free and open-source vector graphics editor used to create vector images, primarily in Scalable Vector Graphics (SVG) format. Other formats can be imported and exported. Inkscape can render primitive vector shapes (e.g. rec ...
. Since
cascading style sheets Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document written in a markup language such as HTML or XML (including XML dialects such as SVG, MathML or XHTML). CSS is a cornerstone techno ...
are used, the graphics design can be restyled easily. Unfortunately, Batik is rather slow and memory intensive, so the visualizations are not very scalable to large data sets (for larger data sets, only a subsample of the data is visualized by default).


Awards

Version 0.4, presented at the "Symposium on Spatial and Temporal Databases" 2011, which included various methods for spatial outlier detection, won the conference's "best demonstration paper award".


Included algorithms

Select included algorithms: *
Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
: ** K-means clustering (including fast algorithms such as Elkan, Hamerly, Annulus, and Exponion k-Means, and robust variants such as k-means--) **
K-medians clustering In statistics, ''k''-medians clusteringP. S. Bradley, O. L. Mangasarian, and W. N. Street, "Clustering via Concave Minimization," in Advances in Neural Information Processing Systems, vol. 9, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds. Cambrid ...
** K-medoids clustering (PAM) (including FastPAM and approximations such as CLARA, CLARANS) ** Expectation-maximization algorithm for Gaussian mixture modeling ** Hierarchical clustering (including the fast SLINK, CLINK, NNChain and Anderberg algorithms) **
Single-linkage clustering In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of el ...
**Leader clustering **
DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: gi ...
(Density-Based Spatial Clustering of Applications with Noise, with full index acceleration for arbitrary distance functions) **
OPTICS Optics is the branch of physics that studies the behaviour and properties of light, including its interactions with matter and the construction of instruments that use or detect it. Optics usually describes the behaviour of visible, ultraviole ...
(Ordering Points To Identify the Clustering Structure), including the extensions OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH **HDBSCAN **
Mean-shift Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing. ...
clustering **
BIRCH A birch is a thin-leaved deciduous hardwood tree of the genus ''Betula'' (), in the family Betulaceae, which also includes alders, hazels, and hornbeams. It is closely related to the beech-oak family Fagaceae. The genus ''Betula'' contains 30 ...
clustering ** SUBCLU (Density-Connected Subspace Clustering for High-Dimensional Data) **CLIQUE clustering **ORCLUS and PROCLUS clustering **COPAC, ERiC and 4C clustering **CASH clustering **DOC and FastDOC subspace clustering **P3C clustering **
Canopy clustering algorithm The canopy clustering algorithm is an unsupervised pre- clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000. It is often used as preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm ...
*
Anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
: ** k-Nearest-Neighbor outlier detection **
LOF Lof (Spanish: ''levo'' and ''lov'') or caví (Spanish: ''cahuín''); formed the basic social organization of the Mapuche, Mapuche-Huilliche and the extinct Picunche peoples, consisting of a familial clan or lineage that recognizes the authority o ...
(Local outlier factor) **LoOP (Local Outlier Probabilities) **
OPTICS Optics is the branch of physics that studies the behaviour and properties of light, including its interactions with matter and the construction of instruments that use or detect it. Optics usually describes the behaviour of visible, ultraviole ...
-OF **DB-Outlier (Distance-Based Outliers) **LOCI (Local Correlation Integral) **LDOF (Local Distance-Based Outlier Factor) ** EM-Outlier **SOD (Subspace Outlier Degree) **COP (Correlation Outlier Probabilities) * Frequent Itemset Mining and association rule learning **
Apriori algorithm AprioriRakesh Agrawal and Ramakrishnan SrikanFast algorithms for mining association rules Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994. is an algorithm for frequent ...
**Eclat **FP-growth *
Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
**
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
**
Multidimensional scaling Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of n objects or individuals" into a configurati ...
**
T-distributed stochastic neighbor embedding t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally d ...
(t-SNE) *
Spatial index A spatial database is a general-purpose database (usually a relational database) that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data. Most sp ...
structures and other search indexes: ** R-tree **
R*-tree In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a ...
** M-tree ** k-d tree **
X-tree In computer science tree data structures, an X-tree (for ''eXtended node tree'') is an index tree structure based on the R-tree used for storing data in many dimensions. It appeared in 1996, and differs from R-trees (1984), R+-trees (1987) an ...
**Cover tree **iDistance **NN descent **
Locality sensitive hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since ...
(LSH) *Evaluation: ** Precision and recall,
F1 score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the ...
, Average Precision **
Receiver operating characteristic A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of m ...
(ROC curve) ** Discounted cumulative gain (including NDCG) ** Silhouette index **
Davies–Bouldin index The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been d ...
**
Dunn index The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms. This is part of a group of validity indices including the Davies–Bouldin index or Silhouette (clustering), Silhouette index, in that it is an ...
**Density-based cluster validation (DBCV) *Visualization ** Scatter plots ** Histograms **
Parallel coordinates Parallel coordinates are a common way of visualizing and analyzing high-dimensional datasets. To show a set of points in an ''n''-dimensional space, a backdrop is drawn consisting of ''n'' parallel lines, typically vertical and equally spaced. ...
(also in 3D, using OpenGL) *Other: ** Statistical distributions and many parameter estimators, including robust MAD based and
L-moment In statistics, L-moments are a sequence of statistics used to summarize the shape of a probability distribution. They are linear combinations of order statistics ( L-statistics) analogous to conventional moments, and can be used to calculate qu ...
based estimators **
Dynamic time warping In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walk ...
** Change point detection in time series **
Intrinsic dimension The intrinsic dimension for a data set can be thought of as the number of variables needed in a minimal representation of the data. Similarly, in signal processing of multidimensional signals, the intrinsic dimension of the signal describes how many ...
ality estimators


Version history

Version 0.1 (July 2008) contained several Algorithms from
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
and
anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
, as well as some index structures such as the
R*-tree In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a ...
. The focus of the first release was on
subspace clustering Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology c ...
and
correlation clustering Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance. De ...
algorithms. Version 0.2 (July 2009) added functionality for
time series analysis In mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in m ...
, in particular distance functions for time series. Version 0.3 (March 2010) extended the choice of
anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
algorithms and visualization modules. Version 0.4 (September 2011) added algorithms for geo data mining and support for multi-relational database and index structures. Version 0.5 (April 2012) focuses on the evaluation of
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
results, adding new visualizations and some new algorithms. Version 0.6 (June 2013) introduces a new 3D adaption of
parallel coordinates Parallel coordinates are a common way of visualizing and analyzing high-dimensional datasets. To show a set of points in an ''n''-dimensional space, a backdrop is drawn consisting of ''n'' parallel lines, typically vertical and equally spaced. ...
for data visualization, apart from the usual additions of algorithms and index structures. Version 0.7 (August 2015) adds support for uncertain data types, and algorithms for the analysis of uncertain data. Version 0.7.5 (February 2019) adds additional clustering algorithms, anomaly detection algorithms, evaluation measures, and indexing structures. Version 0.8 (October 2020) adds automatic index creation, garbage collection, and incremental priority search, as well as many more algorithms such as
BIRCH A birch is a thin-leaved deciduous hardwood tree of the genus ''Betula'' (), in the family Betulaceae, which also includes alders, hazels, and hornbeams. It is closely related to the beech-oak family Fagaceae. The genus ''Betula'' contains 30 ...
.


Similar applications

*
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
: machine learning library in python *
Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus '' Gallirallus''. Four subspecies are recogni ...
: A similar project by the University of Waikato, with a focus on classification algorithms *
RapidMiner RapidMiner is a data science platform designed for enterprises that analyses the collective impact of organizations’ employees, expertise and data. Rapid Miner's data science platform is intended to support many analytics users across a broad A ...
: An application available commercially (a restricted version is available as open source) *
KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...
: An open source platform which integrates various components for machine learning and data mining


See also

*
Comparison of statistical packages The following tables compare general and technical information for a number of statistical analysis packages. General information Operating system support ANOVA Support for various ANOVA methods Regression Support for various regression ...


References


External links

* of ELKI with download and documentation. {{DEFAULTSORT:Environment For Developing Kdd-Applications Supported By Index-Structures Data mining and machine learning software Free artificial intelligence applications Free data analysis software Free science software Free software programmed in Java (programming language) Software using the GNU AGPL license