ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases)

software framework In computer programming, a software framework is an abstraction in which software, providing generic functionality, can be selectively changed by additional user-written code, thus providing application-specific software. It provides a standard ...

developed for use in research and teaching. It was originally at the database systems research unit of Professor

Hans-Peter Kriegel Hans-Peter Kriegel (1 October 1948, Germany) is a German computer scientist and professor at the Ludwig Maximilian University of Munich and leading the Database Systems Group in the Department of Computer Science. He was previously professor at ...

at the

Ludwig Maximilian University of Munich The Ludwig Maximilian University of Munich (simply University of Munich or LMU; german: Ludwig-Maximilians-Universität München) is a public research university in Munich, Germany. It is Germany's sixth-oldest university in continuous operatio ...

, Germany, and now continued at the

Technical University of Dortmund TU Dortmund University (german: Technische Universität Dortmund) is a technical university in Dortmund, North Rhine-Westphalia, Germany with over 35,000 students, and over 6,000 staff including 300 professors, offering around 80 Bachelor's an ...

, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

Description

The ELKI framework is written in

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...

and built around a modular architecture. Most currently included algorithms belong to clustering,

outlier detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...

and database indexes. The object-oriented architecture allows the combination of arbitrary algorithms, data types,

distance function In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general settin ...

s, indexes, and evaluation measures. The Java

just-in-time compiler In computing, just-in-time (JIT) compilation (also dynamic translation or run-time compilations) is a way of executing computer code that involves compilation during execution of a program (at run time) rather than before execution. This may cons ...

optimizes all combinations to a similar extent, making benchmarking results more comparable if they share large parts of the code. When developing new algorithms or index structures, the existing components can be easily reused, and the type safety of Java detects many programming errors at compile time. ELKI has been used in

data science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a br ...

for example to cluster

sperm whale The sperm whale or cachalot (''Physeter macrocephalus'') is the largest of the toothed whales and the largest toothed predator. It is the only living member of the genus ''Physeter'' and one of three extant species in the sperm whale famil ...

codas,

phoneme In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west o ...

clustering, for anomaly detection in

spaceflight Spaceflight (or space flight) is an application of astronautics to fly spacecraft into or through outer space, either with or without humans on board. Most spaceflight is uncrewed and conducted mainly with spacecraft such as satellites in or ...

operations, for

bike sharing A bicycle-sharing system, bike share program, public bicycle scheme, or public bike share (PBS) scheme, is a shared transport service where bicycles are available for shared use by individuals at low cost. The programmes themselves include bot ...

redistribution, and traffic prediction.

Objectives

The university project is developed for use in ''teaching and research''. The source code is written with extensibility and reusability in mind, but is also optimized for performance. The experimental evaluation of algorithms depends on many environmental factors and implementation details can have a large impact on the runtime. ELKI aims at providing a shared codebase with comparable implementations of many algorithms. As research project, it currently does not offer integration with

business intelligence Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...

applications or an interface to common

database management system In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...

s via SQL. The

copyleft Copyleft is the legal technique of granting certain freedoms over copies of copyrighted works with the requirement that the same rights be preserved in derivative works. In this sense, ''freedoms'' refers to the use of the work for any purpose, ...

( AGPL) license may also be a hindrance to an integration in commercial products; nevertheless it can be used to evaluate algorithms prior to developing an own implementation for a commercial product. Furthermore, the application of the algorithms requires knowledge about their usage, parameters, and study of original literature. The audience are

student A student is a person enrolled in a school or other educational institution. In the United Kingdom and most commonwealth countries, a "student" attends a secondary school or higher (e.g., college or university); those in primary or elementar ...

s, researchers, data scientists, and software engineers.

Architecture

ELKI is modeled around a

database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...

-inspired core, which uses a vertical data layout that stores data in column groups (similar to

column families {{Short description, A database project that organizes data in packed columns A column family is a database object that contains columns of related data. It is a tuple (pair) that consists of a key–value pair, where the key is mapped to a value t ...

in NoSQL databases). This database core provides nearest neighbor search, range/radius search, and distance query functionality with index acceleration for a wide range of dissimilarity measures. Algorithms based on such queries (e.g.

k-nearest-neighbor algorithm In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regressi ...

local outlier factor In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data poin ...

and DBSCAN) can be implemented easily and benefit from the index acceleration. The database core also provides fast and memory efficient collections for object collections and associative structures such as nearest neighbor lists. ELKI makes extensive use of Java interfaces, so that it can be extended easily in many places. For example, custom data types, distance functions, index structures, algorithms, input parsers, and output modules can be added and combined without modifying the existing code. This includes the possibility of defining a custom distance function and using existing indexes for acceleration. ELKI uses a service loader architecture to allow publishing extensions as separate jar files. ELKI uses optimized collections for performance rather than the standard Java API. For loops for example are written similar to C++ iterators: for (DBIDIter iter = ids.iter(); iter.valid(); iter.advance()) In contrast to typical Java iterators (which can only iterate over objects), this conserves memory, because the iterator can internally use primitive values for data storage. The reduced

garbage collection Waste collection is a part of the process of waste management. It is the transfer of solid waste from the point of use and disposal to the point of treatment or landfill. Waste collection also includes the curbside collection of recyclable m ...

improves the runtime. Optimized collections libraries such as GNU Trove3, Koloboke, and fastutil employ similar optimizations. ELKI includes data structures such as object collections and heaps (for, e.g., nearest neighbor search) using such optimizations.

Visualization

The visualization module uses SVG for scalable graphics output, and

Apache Batik Batik is a pure-Java library that can be used to render, generate, and manipulate SVG graphics (SVG is an XML markup language for describing two-dimensional vector graphics). IBM supported the project and then donated the code to the Apache Softw ...

for rendering of the user interface as well as lossless export into

PostScript PostScript (PS) is a page description language in the electronic publishing and desktop publishing realm. It is a dynamically typed, concatenative programming language. It was created at Adobe Systems by John Warnock, Charles Geschke, Doug Br ...

and

PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...

for easy inclusion in scientific publications in

LaTeX Latex is an emulsion (stable dispersion) of polymer microparticles in water. Latexes are found in nature, but synthetic latexes are common as well. In nature, latex is found as a milky fluid found in 10% of all flowering plants (angiosperms ...

. Exported files can be edited with SVG editors such as Inkscape. Since

cascading style sheets Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document written in a markup language such as HTML or XML (including XML dialects such as SVG, MathML or XHTML). CSS is a cornerstone techno ...

are used, the graphics design can be restyled easily. Unfortunately, Batik is rather slow and memory intensive, so the visualizations are not very scalable to large data sets (for larger data sets, only a subsample of the data is visualized by default).

Awards

Version 0.4, presented at the "Symposium on Spatial and Temporal Databases" 2011, which included various methods for spatial outlier detection, won the conference's "best demonstration paper award".

Included algorithms

Select included algorithms: *

Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

: **

K-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or ...

(including fast algorithms such as Elkan, Hamerly, Annulus, and Exponion k-Means, and robust variants such as k-means--) **

K-medians clustering In statistics, ''k''-medians clusteringP. S. Bradley, O. L. Mangasarian, and W. N. Street, "Clustering via Concave Minimization," in Advances in Neural Information Processing Systems, vol. 9, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds. Cambrid ...

** K-medoids clustering (PAM) (including FastPAM and approximations such as CLARA, CLARANS) ** Expectation-maximization algorithm for Gaussian mixture modeling **

Hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...

(including the fast SLINK, CLINK, NNChain and Anderberg algorithms) ** Single-linkage clustering **Leader clustering ** DBSCAN (Density-Based Spatial Clustering of Applications with Noise, with full index acceleration for arbitrary distance functions) **

OPTICS Optics is the branch of physics that studies the behaviour and properties of light, including its interactions with matter and the construction of instruments that use or detect it. Optics usually describes the behaviour of visible, ultraviole ...

(Ordering Points To Identify the Clustering Structure), including the extensions OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH **HDBSCAN ** Mean-shift clustering **

BIRCH A birch is a thin-leaved deciduous hardwood tree of the genus ''Betula'' (), in the family Betulaceae, which also includes alders, hazels, and hornbeams. It is closely related to the beech-oak family Fagaceae. The genus ''Betula'' contains 30 ...

clustering **

SUBCLU SUBCLU is an algorithm for clustering high-dimensional data by Karin Kailing, Hans-Peter Kriegel and Peer Kröger.Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data'. In: ''Proc. SIAM ...

(Density-Connected Subspace Clustering for High-Dimensional Data) **CLIQUE clustering **ORCLUS and PROCLUS clustering **COPAC, ERiC and 4C clustering **CASH clustering **DOC and FastDOC subspace clustering **P3C clustering ** Canopy clustering algorithm * Anomaly detection: ** k-Nearest-Neighbor outlier detection **

LOF Lof (Spanish: ''levo'' and ''lov'') or caví (Spanish: ''cahuín''); formed the basic social organization of the Mapuche, Mapuche-Huilliche and the extinct Picunche peoples, consisting of a familial clan or lineage that recognizes the authority o ...

(Local outlier factor) **LoOP (Local Outlier Probabilities) **

-OF **DB-Outlier (Distance-Based Outliers) **LOCI (Local Correlation Integral) **LDOF (Local Distance-Based Outlier Factor) ** EM-Outlier **SOD (Subspace Outlier Degree) **COP (Correlation Outlier Probabilities) * Frequent Itemset Mining and association rule learning **

Apriori algorithm AprioriRakesh Agrawal and Ramakrishnan SrikanFast algorithms for mining association rules Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994. is an algorithm for frequent ...

**Eclat **FP-growth * Dimensionality reduction **

Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...

** Multidimensional scaling **

T-distributed stochastic neighbor embedding t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally de ...

(t-SNE) *

Spatial index A spatial database is a general-purpose database (usually a relational database) that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data. Most spa ...

structures and other search indexes: **

R-tree R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found sign ...

R*-tree In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a ...

M-tree In computer science, M-trees are tree data structures that are similar to R-trees and B-trees. It is constructed using a metric and relies on the triangle inequality for efficient range and k-nearest neighbor (k-NN) queries. While M-trees can perf ...

k-d tree In computer science, a ''k''-d tree (short for ''k-dimensional tree'') is a space-partitioning data structure for organizing points in a ''k''-dimensional space. ''k''-d trees are a useful data structure for several applications, such as search ...

X-tree In computer science tree data structures, an X-tree (for ''eXtended node tree'') is an index tree structure based on the R-tree used for storing data in many dimensions. It appeared in 1996, and differs from R-trees (1984), R+-trees (1987) an ...

**Cover tree **iDistance **NN descent **

Locality sensitive hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since ...

(LSH) *Evaluation: **

Precision and recall In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space. Precision (also called ...

, F1 score, Average Precision **

Receiver operating characteristic A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of ...

(ROC curve) **

Discounted cumulative gain Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of documents in a search- ...

(including NDCG) ** Silhouette index **

Davies–Bouldin index The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been d ...

Dunn index The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms. This is part of a group of validity indices including the Davies–Bouldin index or Silhouette index, in that it is an internal evaluation sch ...

**Density-based cluster validation (DBCV) *Visualization **

Scatter plot A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. ...

s **

Histogram A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the ent ...

s ** Parallel coordinates (also in 3D, using

OpenGL OpenGL (Open Graphics Library) is a cross-language, cross-platform application programming interface (API) for rendering 2D and 3D vector graphics. The API is typically used to interact with a graphics processing unit (GPU), to achieve hardwa ...

) *Other: **

Statistical distributions In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

and many parameter estimators, including robust MAD based and L-moment based estimators ** Dynamic time warping ** Change point detection in time series ** Intrinsic dimensionality estimators

Version history

Version 0.1 (July 2008) contained several Algorithms from

cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

and anomaly detection, as well as some index structures such as the

. The focus of the first release was on subspace clustering and correlation clustering algorithms. Version 0.2 (July 2009) added functionality for time series analysis, in particular distance functions for time series. Version 0.3 (March 2010) extended the choice of anomaly detection algorithms and visualization modules. Version 0.4 (September 2011) added algorithms for geo data mining and support for multi-relational database and index structures. Version 0.5 (April 2012) focuses on the evaluation of

results, adding new visualizations and some new algorithms. Version 0.6 (June 2013) introduces a new 3D adaption of parallel coordinates for data visualization, apart from the usual additions of algorithms and index structures. Version 0.7 (August 2015) adds support for uncertain data types, and algorithms for the analysis of uncertain data. Version 0.7.5 (February 2019) adds additional clustering algorithms, anomaly detection algorithms, evaluation measures, and indexing structures. Version 0.8 (October 2020) adds automatic index creation, garbage collection, and incremental priority search, as well as many more algorithms such as

Similar applications

* scikit-learn: machine learning library in python *

Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus ''Gallirallus''. Four subspecies are recognize ...

: A similar project by the University of Waikato, with a focus on

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

algorithms *

RapidMiner RapidMiner is a data science platform designed for enterprises that analyses the collective impact of organizations’ employees, expertise and data. Rapid Miner's data science platform is intended to support many analytics users across a broad AI ...

: An application available commercially (a restricted version is available as open source) *

KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...

: An open source platform which integrates various components for machine learning and data mining

References

External links

* of ELKI with download and documentation. {{DEFAULTSORT:Environment For Developing Kdd-Applications Supported By Index-Structures Data mining and machine learning software Free artificial intelligence applications Free data analysis software Free science software Free software programmed in Java (programming language) Software using the GNU AGPL license