HOME

TheInfoList




In
statistics Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data Data (; ) are individual facts, statistics, or items of information, often numeric. In a more technical sens ...

statistics
and
machine learning Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data ...

machine learning
, discretization refers to the process of converting or partitioning continuous
attributes Attribute may refer to: * Attribute (philosophy) In logic Logic (from Ancient Greek, Greek: grc, wikt:λογική, λογική, label=none, lit=possessed of reason, intellectual, dialectical, argumentative, translit=logikḗ)Also relate ...
,
features Feature may refer to: Computing * Feature (CAD), could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (software design) is an intentional distinguishing characteristic of a software item ...
or variables to discretized or
nominal Nominal may refer to: Linguistics and grammar * Nominal (linguistics), one of the parts of speech * Nominal, the adjectival form of "noun", as in "nominal agreement" (= "noun agreement") * Nominal sentence, a sentence without a finite verb * Nou ...
attributes/features/variables/
intervals Interval may refer to: Mathematics and physics * Interval (mathematics) In mathematics Mathematics (from Greek: ) includes the study of such topics as numbers (arithmetic and number theory), formulas and related structures (algebra), sh ...
. This can be useful when creating probability mass functions – formally, in
density estimation In probability Probability is the branch of mathematics Mathematics (from Greek: ) includes the study of such topics as numbers (arithmetic and number theory), formulas and related structures (algebra), shapes and spaces in which ...
. It is a form of
discretization In applied mathematics Applied mathematics is the application of mathematical methods by different fields such as physics Physics is the natural science that studies matter, its Elementary particle, fundamental constituents, its Motion ...
in general and also of binning, as in making a
histogram A histogram is an approximate representation of the distributionDistribution may refer to: Mathematics *Distribution (mathematics) Distributions, also known as Schwartz distributions or generalized functions, are objects that generaliz ...

histogram
. Whenever
continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ga ...
data is discretized, there is always some amount of
discretization error In numerical analysis Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic computation, symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathemat ...
. The goal is to reduce the amount to a level considered negligible for the
modeling A model is an informative representation of an object, person or system. The term originally denoted the Plan_(drawing), plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin ''modulus'', a meas ...

modeling
purposes at hand. Typically data is discretized into partitions of ''K'' equal lengths/width (equal intervals) or K% of the total data (equal frequencies). Mechanisms for discretizing continuous data include Fayyad & Irani's MDL method, which uses
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information content ...
to recursively define the best bins, CAIM, CACC, Ameva, and many others Many machine learning algorithms are known to produce better models by discretizing continuous attributes.


Software

This is a partial list of software that implement MDL algorithm.
discretize4crf
tool designed to work with popular CRF implementations (C++)
mdlp
in the R package discretization
Discretize
in the R package RWeka


See also

* Density estimation * Continuity correction


References

Estimation of densities Statistical data coding {{statistics-stub