Data Reduction
   HOME

TheInfoList



OR:

Data reduction is the transformation of numerical or alphabetical
digital information Digital data, in information theory and information systems, is information represented as a string of discrete symbols each of which can take on one of only a finite number of values from some alphabet, such as letters or digit (unit), digits. ...
derived empirically or
experimentally An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into Causality, cause-and-effect by demonstrating what outcome oc ...
into a corrected, ordered, and simplified form. The purpose of data reduction can be two-fold: reduce the number of data records by eliminating invalid data or produce summary data and statistics at different aggregation levels for various applications. When information is derived from instrument readings there may also be a transformation from analog to digital form. When the data are already in digital form the 'reduction' of the data typically involves some editing,
scaling Scaling may refer to: Science and technology Mathematics and physics * Scaling (geometry), a linear transformation that enlarges or diminishes objects * Scale invariance, a feature of objects or laws that do not change if scales of length, energ ...
,
encoding In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
,
sorting Sorting refers to ordering data in an increasing or decreasing manner according to some linear relationship among the data items. # ordering: arranging items in a sequence ordered by some criterion; # categorizing: grouping items with similar pro ...
, collating, and producing tabular summaries. When the observations are discrete but the underlying phenomenon is continuous then
smoothing In statistics and image processing, to smooth a data set is to create an approximating function (mathematics), function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena ...
and
interpolation In the mathematical field of numerical analysis, interpolation is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points. In engineering and science, one often has a n ...
are often needed. The data reduction is often undertaken in the presence of reading or
measurement error Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a " mistake ...
s. Some idea of the nature of these errors is needed before the most likely value may be determined. An example in
astronomy Astronomy () is a natural science that studies astronomical object, celestial objects and phenomena. It uses mathematics, physics, and chemistry in order to explain their origin and chronology of the Universe, evolution. Objects of interest ...
is the data reduction in the ''Kepler'' satellite. This satellite records 95-megapixel images once every six seconds, generating dozens of megabytes of data per second, which is orders-of-magnitudes more than the downlink bandwidth of 550 
kB/s In telecommunications, data-transfer rate is the average number of bits ( bitrate), characters or symbols ( baudrate), or data blocks per unit time passing through a communication link in a data-transmission system. Common data rate units are mu ...
. The on-board data reduction encompasses co-adding the raw frames for thirty minutes, reducing the bandwidth by a factor of 300. Furthermore, interesting targets are pre-selected and only the relevant pixels are processed, which is 6% of the total. This reduced data is then sent to Earth where it is processed further. Research has also been carried out on the use of data reduction in wearable (wireless) devices for health monitoring and diagnosis applications. For example, in the context of
epilepsy Epilepsy is a group of non-communicable neurological disorders characterized by recurrent epileptic seizures. Epileptic seizures can vary from brief and nearly undetectable periods to long periods of vigorous shaking due to abnormal electrical ...
diagnosis, data reduction has been used to increase the battery lifetime of a wearable EEG device by selecting and only transmitting, EEG data that is relevant for diagnosis and discarding background activity.


Types of Data Reduction


Dimensionality Reduction

When dimensionality increases, data becomes increasingly sparse while density and distance between points, critical to clustering and outlier analysis, becomes less meaningful.
Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
helps reduce noise in the data and allows for easier visualization, such as the example below where 3-dimensional data is transformed into 2 dimensions to show hidden parts. One method of dimensionality reduction is
wavelet transform In mathematics, a wavelet series is a representation of a square-integrable (real number, real- or complex number, complex-valued) function (mathematics), function by a certain orthonormal series (mathematics), series generated by a wavelet. This ...
, in which data is transformed to preserver relative distance between objects at different levels of resolution, and is often used for
image compression Image compression is a type of data compression applied to digital images, to reduce their cost for storage or transmission. Algorithms may take advantage of visual perception and the statistical properties of image data to provide superior r ...
.


Numerosity Reduction

This method of data reduction reduces the data volume by choosing alternate, smaller forms of data representation. Numerosity reduction can be split into 2 groups: parametric and non-parametric methods. Parametric methods (regression, for example) assume the data fits some model, estimate model parameters, store only the parameters, and discard the data. One example of this is in the image below, where the volume of data to be processed is reduced based on more specific criteria. Another example would be a
log-linear model A log-linear model is a mathematical model that takes the form of a function whose logarithm equals a linear combination of the parameters of the model, which makes it possible to apply (possibly multivariate) linear regression. That is, it has ...
, obtaining a value at a point in m-D space as the product on appropriate marginal subspaces. Non-parametric methods do not assume models, some examples being histograms, clustering, sampling, etc.


Statistical modelling

Data reduction can be obtained by assuming a statistical model for the data. Classical principles of data reduction include sufficiency,
likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
,
conditionality In political economy and international relations, conditionality is the use of conditions attached to the provision of benefits such as a loan, debt relief or bilateral aid. These conditions are typically imposed by international financial institu ...
and
equivariance In mathematics, equivariance is a form of symmetry for functions from one space with symmetry to another (such as symmetric spaces). A function is said to be an equivariant map when its domain and codomain are acted on by the same symmetry grou ...
.


See also

*
Data cleansing Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the dat ...
*
Data editing Data editing is defined as the process involving the review and adjustment of collected survey data. Data editing helps define guidelines that will reduce potential bias and ensure consistent estimates leading to a clear analysis of the data set by ...
*
Data pre-processing Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to ...
*
Data wrangling Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...


References


Further reading

* {{data *