In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
and
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, discretization refers to the process of converting or partitioning continuous
attributes,
features or
variables to discretized or
nominal attributes/features/variables/
intervals. This can be useful when creating probability mass functions – formally, in
density estimation. It is a form of
discretization
In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numeri ...
in general and also of
binning, as in making a
histogram
A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...
. Whenever
continuous data is discretized, there is always some amount of
discretization error. The goal is to reduce the amount to a level considered
negligible for the
modeling purposes at hand.
Typically data is discretized into partitions of ''K'' equal lengths/width (equal intervals) or K% of the total data (equal frequencies).
[
]
Mechanisms for discretizing continuous data include
Fayyad & Irani's MDL method, which uses
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
to recursively define the best bins, CAIM, CACC, Ameva, and many others
Many machine learning algorithms are known to produce better models by discretizing continuous attributes.
Software
This is a partial list of software that implement MDL algorithm.
discretize4crftool designed to work with popular
CRF implementations (
C++)
mdlpin the R package discretization
Discretizein the R package RWeka
See also
*
Density estimation
*
Continuity correction
References
Estimation of densities
Statistical data coding
{{statistics-stub