In

discretize4crf

tool designed to work with popular CRF implementations (C++)

mdlp

in the R package discretization

Discretize

in the R package RWeka

Estimation of densities
Statistical data coding
{{statistics-stub

statistics
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data
Data (; ) are individual facts, statistics, or items of information, often numeric. In a more technical sens ...

and machine learning
Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data ...

, discretization refers to the process of converting or partitioning continuous attributes
Attribute may refer to:
* Attribute (philosophy)
In logic
Logic (from Ancient Greek, Greek: grc, wikt:λογική, λογική, label=none, lit=possessed of reason, intellectual, dialectical, argumentative, translit=logikḗ)Also relate ...

, features
Feature may refer to:
Computing
* Feature (CAD), could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (software design) is an intentional distinguishing characteristic of a software item ...

or variables to discretized or nominal
Nominal may refer to:
Linguistics and grammar
* Nominal (linguistics), one of the parts of speech
* Nominal, the adjectival form of "noun", as in "nominal agreement" (= "noun agreement")
* Nominal sentence, a sentence without a finite verb
* Nou ...

attributes/features/variables/intervals
Interval may refer to:
Mathematics and physics
* Interval (mathematics)
In mathematics
Mathematics (from Greek: ) includes the study of such topics as numbers (arithmetic and number theory), formulas and related structures (algebra), sh ...

. This can be useful when creating probability mass functions – formally, in density estimation
In probability
Probability is the branch of mathematics
Mathematics (from Greek: ) includes the study of such topics as numbers (arithmetic and number theory), formulas and related structures (algebra), shapes and spaces in which ...

. It is a form of discretization
In applied mathematics
Applied mathematics is the application of mathematical methods by different fields such as physics
Physics is the natural science that studies matter, its Elementary particle, fundamental constituents, its Motion ...

in general and also of binning, as in making a histogram
A histogram is an approximate representation of the distributionDistribution may refer to:
Mathematics
*Distribution (mathematics)
Distributions, also known as Schwartz distributions or generalized functions, are objects that generaliz ...

. Whenever continuous
Continuity or continuous may refer to:
Mathematics
* Continuity (mathematics), the opposing concept to discreteness; common examples include
** Continuous probability distribution or random variable in probability and statistics
** Continuous ga ...

data is discretized, there is always some amount of discretization error
In numerical analysis
Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic computation, symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathemat ...

. The goal is to reduce the amount to a level considered negligible for the modeling
A model is an informative representation of an object, person or system. The term originally denoted the Plan_(drawing), plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin ''modulus'', a meas ...

purposes at hand.
Typically data is discretized into partitions of ''K'' equal lengths/width (equal intervals) or K% of the total data (equal frequencies).
Mechanisms for discretizing continuous data include Fayyad & Irani's MDL method, which uses mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information content ...

to recursively define the best bins, CAIM, CACC, Ameva, and many others
Many machine learning algorithms are known to produce better models by discretizing continuous attributes.
Software

This is a partial list of software that implement MDL algorithm.discretize4crf

tool designed to work with popular CRF implementations (C++)

mdlp

in the R package discretization

Discretize

in the R package RWeka

See also

* Density estimation * Continuity correctionReferences