HOME

TheInfoList




Data binning, also called discrete binning or bucketing, is a
data pre-processingData preprocessing is an important step in the data mining process. The phrase GIGO, "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data collection, Data-gathering methods are often loosely controll ...
technique used to reduce the effects of minor observation errors. The original data values which fall into a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization. Statistical data binning is a way to group numbers of more or less continuous values into a smaller number of "bins". For example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals (for example, grouping every five years together). It can also be used in
multivariate statistics Multivariate statistics is a subdivision of statistics Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data Data (; ) are individual facts, statistics, or item ...
, binning in several dimensions at once.


Image data processing

In the context of
image processing Digital image processing is the use of a digital computer A computer is a machine that can be programmed to Execution (computing), carry out sequences of arithmetic or logical operations automatically. Modern computers can perform generic se ...
, binning is the procedure of combining a cluster of
pixel In digital imaging Digital imaging or digital image acquisition is the creation of a representation of the visual characteristics of an object, such as a physical scene or the interior structure of an object. The term is often assumed to imp ...

pixel
s into a single pixel. As such, in 2x2 binning, an array of 4 pixels becomes a single larger pixel, reducing the overall number of pixels. This aggregation, although associated with loss of information, reduces the amount of data to be processed, facilitating the analysis. For instance, binning the data may also reduce the impact of read noise on the processed image (at the cost of a lower resolution).


Example usage

Histogram A histogram is an approximate representation of the distribution of numerical data. It was first introduced by Karl Pearson Karl Pearson (; born Carl Pearson; 27 March 1857 – 27 April 1936) was an English mathematician A mathematician i ...

Histogram
s are an example of data binning used in order to observe underlying
distributionDistribution may refer to: Mathematics *Distribution (mathematics) Distributions, also known as Schwartz distributions or generalized functions, are objects that generalize the classical notion of functions in mathematical analysis. Distr ...

distribution
s. They typically occur in
one-dimensional space In physics Physics (from grc, φυσική (ἐπιστήμη), physikḗ (epistḗmē), knowledge of nature, from ''phýsis'' 'nature'), , is the natural science that studies matter, its Motion (physics), motion and behavior through S ...
and in
equal Equal or equals may refer to: Arts and entertainment * Equals (film), ''Equals'' (film), a 2015 American science fiction film * Equals (game), ''Equals'' (game), a board game * The Equals, a British pop group formed in 1965 * "Equal", a 2016 song b ...

equal
intervals for ease of visualization. Data binning may be used when small instrumental shifts in the spectral dimension from
mass spectrometry Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio The mass-to-charge ratio (''m''/''Q'') is a physical quantity A physical quantity is a physical property of a material or system that can be Quant ...
(MS) or
nuclear magnetic resonance Nuclear magnetic resonance (NMR) is a physical phenomenon A phenomenon (; plural phenomena) is an observable In physics Physics (from grc, φυσική (ἐπιστήμη), physikḗ (epistḗmē), knowledge of nature, from ...
(NMR) experiments will be falsely interpreted as representing different components, when a collection of data profiles is subjected to
pattern recognition Pattern recognition is the automated recognition of pattern A pattern is a regularity in the world, in human-made design, or in abstract ideas. As such, the elements of a pattern repeat in a predictable manner. A geometric pattern is a kind of ...
analysis. A straightforward way to cope with this problem is by using binning techniques in which the spectrum is reduced in resolution to a sufficient degree to ensure that a given peak remains in its bin despite small spectral shifts between analyses. For example, in
NMR Nuclear magnetic resonance (NMR) is a physical phenomenon A phenomenon (; plural phenomena) is an observable In physics Physics (from grc, φυσική (ἐπιστήμη), physikḗ (epistḗmē), knowledge of nature, from ...

NMR
the
chemical shift In nuclear magnetic resonance Nuclear magnetic resonance (NMR) is a physical phenomenon A phenomenon (; plural phenomena) is an observable In physics Physics (from grc, φυσική (ἐπιστήμη), physikḗ (epistḗm ...
axis may be discretized and coarsely binned, and in MS the spectral accuracies may be rounded to integer
atomic mass unit The dalton or unified atomic mass unit (symbols: Da or u) is a unit Unit may refer to: Arts and entertainment * UNIT, a fictional military organization in the science fiction television series ''Doctor Who'' * Unit of action, a discrete piece o ...
values. Also, several
digital camera A digital camera is a camera A camera is an optical Optics is the branch of physics Physics is the natural science that studies matter, its Elementary particle, fundamental constituents, its Motion (physics), motion and behav ...

digital camera
systems incorporate an automatic pixel binning function to improve image contrast. Binning is also used in machine learning to speed up the decision-tree boosting method for supervised classification and regression in algorithms such as
Microsoft Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation which produces Software, computer software, consumer electronics, personal computers, and related services. Its best-know ...

Microsoft
's
LightGBM LightGBM, short for Light Gradient Boosting Machine, is a free and open source distributed gradient boosting framework for machine learning originally developed by Microsoft. It is based on decision tree algorithms and used for Learning to rank, ra ...
and
scikit-learn Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software Free software (or libre software) is computer software Software is a collection of Instruction (computer science), instructions and data (computing), data ...
'
Histogram-based Gradient Boosting Classification Tree


See also

*
Histogram A histogram is an approximate representation of the distribution of numerical data. It was first introduced by Karl Pearson Karl Pearson (; born Carl Pearson; 27 March 1857 – 27 April 1936) was an English mathematician A mathematician i ...

Histogram
*
Grouped dataGrouped data are data Data are units of information Information can be thought of as the resolution of uncertainty; it answers the question of "What an entity is" and thus defines both its essence and the nature of its characteristics. Th ...
*
Level of measurement Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens Stanley Smith Stevens (November 4, 1906 – January 18, 1973) wa ...
*
Quantization (signal processing) Quantization, in mathematics and digital signal processing Digital signal processing (DSP) is the use of digital processing Digital data, in information theory and information systems, is information represented as a string of discrete s ...
*
Discretization of continuous featuresIn statistics Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a ...


References

Statistical data coding {{statistics-stub