ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally at the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany, and now continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

Description

The ELKI framework is written in

Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...

and built around a modular architecture. Most currently included algorithms belong to clustering,

outlier detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...

and database indexes. The object-oriented architecture allows the combination of arbitrary algorithms, data types, distance functions, indexes, and evaluation measures. The Java just-in-time compiler optimizes all combinations to a similar extent, making benchmarking results more comparable if they share large parts of the code. When developing new algorithms or index structures, the existing components can be easily reused, and the

type safety In computer science, type safety and type soundness are the extent to which a programming language discourages or prevents type errors. Type safety is sometimes alternatively considered to be a property of facilities of a computer language; that i ...

of Java detects many programming errors at compile time. ELKI has been used in data science for example to cluster sperm whale codas,

phoneme In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-wes ...

clustering, for anomaly detection in

spaceflight Spaceflight (or space flight) is an application of astronautics to fly spacecraft into or through outer space, either with or without humans on board. Most spaceflight is uncrewed and conducted mainly with spacecraft such as satellites in o ...

operations, for bike sharing redistribution, and traffic prediction.

Objectives

The university project is developed for use in ''teaching and research''. The source code is written with extensibility and reusability in mind, but is also optimized for performance. The experimental

evaluation Evaluation is a systematic determination and assessment of a subject's merit, worth and significance, using criteria governed by a set of standards. It can assist an organization, program, design, project or any other intervention or initiative to ...

of algorithms depends on many environmental factors and implementation details can have a large impact on the runtime. ELKI aims at providing a shared codebase with comparable implementations of many algorithms. As research project, it currently does not offer integration with

business intelligence Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical p ...

applications or an interface to common database management systems via SQL. The

copyleft Copyleft is the legal technique of granting certain freedoms over copies of copyrighted works with the requirement that the same rights be preserved in derivative works. In this sense, ''freedoms'' refers to the use of the work for any purpose ...

( AGPL) license may also be a hindrance to an integration in commercial products; nevertheless it can be used to evaluate algorithms prior to developing an own implementation for a commercial product. Furthermore, the application of the algorithms requires knowledge about their usage, parameters, and study of original literature. The audience are students,

researcher Research is "creative and systematic work undertaken to increase the stock of knowledge". It involves the collection, organization and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness t ...

data scientists Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a ...

, and

software engineer Software engineering is a systematic engineering approach to software development. A software engineer is a person who applies the principles of software engineering to design, develop, maintain, test, and evaluate computer software. The term ''p ...

Architecture

ELKI is modeled around a

database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases s ...

-inspired core, which uses a vertical data layout that stores data in column groups (similar to column families in NoSQL databases). This database core provides

nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...

, range/radius search, and distance query functionality with index acceleration for a wide range of dissimilarity measures. Algorithms based on such queries (e.g. k-nearest-neighbor algorithm, local outlier factor and

DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: gi ...

) can be implemented easily and benefit from the index acceleration. The database core also provides fast and memory efficient collections for object collections and associative structures such as nearest neighbor lists. ELKI makes extensive use of Java interfaces, so that it can be extended easily in many places. For example, custom data types, distance functions, index structures, algorithms, input parsers, and output modules can be added and combined without modifying the existing code. This includes the possibility of defining a custom distance function and using existing indexes for acceleration. ELKI uses a service loader architecture to allow publishing extensions as separate jar files. ELKI uses optimized collections for performance rather than the standard Java API.

For loop In computer science a for-loop or for loop is a control flow statement for specifying iteration. Specifically, a for loop functions by running a section of code repeatedly until a certain condition has been satisfied. For-loops have two par ...

s for example are written similar to C++ iterators: for (DBIDIter iter = ids.iter(); iter.valid(); iter.advance()) In contrast to typical Java iterators (which can only iterate over objects), this conserves memory, because the iterator can internally use primitive values for data storage. The reduced

garbage collection Waste collection is a part of the process of waste management. It is the transfer of solid waste from the point of use and disposal to the point of treatment or landfill. Waste collection also includes the curbside collection of recyclabl ...

improves the runtime. Optimized collections libraries such as GNU Trove3, Koloboke, and fastutil employ similar optimizations. ELKI includes data structures such as object collections and heaps (for, e.g.,

) using such optimizations.

Visualization

The visualization module uses SVG for scalable graphics output, and Apache Batik for rendering of the user interface as well as lossless export into PostScript and PDF for easy inclusion in scientific publications in

LaTeX Latex is an emulsion (stable dispersion) of polymer microparticles in water. Latexes are found in nature, but synthetic latexes are common as well. In nature, latex is found as a milky fluid found in 10% of all flowering plants (angiosperms ...

. Exported files can be edited with SVG editors such as

Inkscape Inkscape is a free and open-source vector graphics editor used to create vector images, primarily in Scalable Vector Graphics (SVG) format. Other formats can be imported and exported. Inkscape can render primitive vector shapes (e.g. rec ...

. Since cascading style sheets are used, the graphics design can be restyled easily. Unfortunately, Batik is rather slow and memory intensive, so the visualizations are not very scalable to large data sets (for larger data sets, only a subsample of the data is visualized by default).

Awards

Version 0.4, presented at the "Symposium on Spatial and Temporal Databases" 2011, which included various methods for spatial outlier detection, won the conference's "best demonstration paper award".

Included algorithms

Select included algorithms: * Cluster analysis: ** K-means clustering (including fast algorithms such as Elkan, Hamerly, Annulus, and Exponion k-Means, and robust variants such as k-means--) ** K-medians clustering ** K-medoids clustering (PAM) (including FastPAM and approximations such as CLARA, CLARANS) ** Expectation-maximization algorithm for Gaussian mixture modeling ** Hierarchical clustering (including the fast SLINK, CLINK, NNChain and Anderberg algorithms) **

Single-linkage clustering In statistics, single-linkage clustering is one of several methods of hierarchical clustering. It is based on grouping clusters in bottom-up fashion (agglomerative clustering), at each step combining two clusters that contain the closest pair of el ...

**Leader clustering **

(Density-Based Spatial Clustering of Applications with Noise, with full index acceleration for arbitrary distance functions) **

OPTICS Optics is the branch of physics that studies the behaviour and properties of light, including its interactions with matter and the construction of instruments that use or detect it. Optics usually describes the behaviour of visible, ultrav ...

(Ordering Points To Identify the Clustering Structure), including the extensions OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH **HDBSCAN **

Mean-shift Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing. ...

clustering **

BIRCH A birch is a thin-leaved deciduous hardwood tree of the genus ''Betula'' (), in the family Betulaceae, which also includes alders, hazels, and hornbeams. It is closely related to the beech- oak family Fagaceae. The genus ''Betula'' contains ...

clustering ** SUBCLU (Density-Connected Subspace Clustering for High-Dimensional Data) **CLIQUE clustering **ORCLUS and PROCLUS clustering **COPAC, ERiC and 4C clustering **CASH clustering **DOC and FastDOC subspace clustering **P3C clustering **

Canopy clustering algorithm The canopy clustering algorithm is an unsupervised pre- clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000. It is often used as preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm ...

Anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...

: ** k-Nearest-Neighbor outlier detection **

LOF Lof (Spanish: ''levo'' and ''lov'') or caví (Spanish: ''cahuín''); formed the basic social organization of the Mapuche, Mapuche-Huilliche and the extinct Picunche peoples, consisting of a familial clan or lineage that recognizes the authority o ...

(Local outlier factor) **LoOP (Local Outlier Probabilities) **

-OF **DB-Outlier (Distance-Based Outliers) **LOCI (Local Correlation Integral) **LDOF (Local Distance-Based Outlier Factor) ** EM-Outlier **SOD (Subspace Outlier Degree) **COP (Correlation Outlier Probabilities) * Frequent Itemset Mining and association rule learning ** Apriori algorithm **Eclat **FP-growth *

Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

** Principal component analysis **

Multidimensional scaling Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of n objects or individuals" into a configurati ...

** T-distributed stochastic neighbor embedding (t-SNE) * Spatial index structures and other search indexes: ** R-tree ** R*-tree ** M-tree ** k-d tree ** X-tree **Cover tree **iDistance **NN descent ** Locality sensitive hashing (LSH) *Evaluation: ** Precision and recall,

F1 score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the ...

, Average Precision **

Receiver operating characteristic A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of m ...

(ROC curve) ** Discounted cumulative gain (including NDCG) ** Silhouette index ** Davies–Bouldin index **

Dunn index The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms. This is part of a group of validity indices including the Davies–Bouldin index or Silhouette (clustering), Silhouette index, in that it is an ...

**Density-based cluster validation (DBCV) *Visualization ** Scatter plots ** Histograms **

Parallel coordinates Parallel coordinates are a common way of visualizing and analyzing high-dimensional datasets. To show a set of points in an ''n''-dimensional space, a backdrop is drawn consisting of ''n'' parallel lines, typically vertical and equally spaced. ...

(also in 3D, using OpenGL) *Other: ** Statistical distributions and many parameter estimators, including robust MAD based and

L-moment In statistics, L-moments are a sequence of statistics used to summarize the shape of a probability distribution. They are linear combinations of order statistics ( L-statistics) analogous to conventional moments, and can be used to calculate qu ...

based estimators **

Dynamic time warping In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walk ...

** Change point detection in time series **

Intrinsic dimension The intrinsic dimension for a data set can be thought of as the number of variables needed in a minimal representation of the data. Similarly, in signal processing of multidimensional signals, the intrinsic dimension of the signal describes how many ...

ality estimators

Version history

Version 0.1 (July 2008) contained several Algorithms from cluster analysis and

anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...

, as well as some index structures such as the R*-tree. The focus of the first release was on

subspace clustering Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology c ...

and

correlation clustering Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance. De ...

algorithms. Version 0.2 (July 2009) added functionality for

time series analysis In mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in m ...

, in particular distance functions for time series. Version 0.3 (March 2010) extended the choice of

algorithms and visualization modules. Version 0.4 (September 2011) added algorithms for geo data mining and support for multi-relational database and index structures. Version 0.5 (April 2012) focuses on the evaluation of cluster analysis results, adding new visualizations and some new algorithms. Version 0.6 (June 2013) introduces a new 3D adaption of

parallel coordinates Parallel coordinates are a common way of visualizing and analyzing high-dimensional datasets. To show a set of points in an ''n''-dimensional space, a backdrop is drawn consisting of ''n'' parallel lines, typically vertical and equally spaced. ...

for data visualization, apart from the usual additions of algorithms and index structures. Version 0.7 (August 2015) adds support for uncertain data types, and algorithms for the analysis of uncertain data. Version 0.7.5 (February 2019) adds additional clustering algorithms, anomaly detection algorithms, evaluation measures, and indexing structures. Version 0.8 (October 2020) adds automatic index creation, garbage collection, and incremental priority search, as well as many more algorithms such as

Similar applications

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...

: machine learning library in python * Weka: A similar project by the University of Waikato, with a focus on classification algorithms *

RapidMiner RapidMiner is a data science platform designed for enterprises that analyses the collective impact of organizations’ employees, expertise and data. Rapid Miner's data science platform is intended to support many analytics users across a broad A ...

: An application available commercially (a restricted version is available as open source) *

KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...

: An open source platform which integrates various components for machine learning and data mining

References

External links

* of ELKI with download and documentation. {{DEFAULTSORT:Environment For Developing Kdd-Applications Supported By Index-Structures Data mining and machine learning software Free artificial intelligence applications Free data analysis software Free science software Free software programmed in Java (programming language) Software using the GNU AGPL license