In statistics, data

transformation Transformation may refer to: Science and mathematics In biology and medicine * Metamorphosis, the biological process of changing physical form after birth or hatching * Malignant transformation, the process of cells becoming cancerous * Tran ...

is the application of a deterministic mathematical

function Function or functionality may refer to: Computing * Function key, a type of key on computer keyboards * Function model, a structured representation of processes in a system * Function object or functor or functionoid, a concept of object-oriente ...

to each point in a

data In the pursuit of knowledge, data (; ) is a collection of discrete Value_(semiotics), values that convey information, describing quantity, qualitative property, quality, fact, statistics, other basic units of meaning, or simply sequences of sy ...

set—that is, each data point ''z_i'' is replaced with the transformed value ''y_i'' = ''f''(''z_i''), where ''f'' is a function. Transforms are usually applied so that the data appear to more closely meet the assumptions of a statistical inference procedure that is to be applied, or to improve the interpretability or appearance of graphs. Nearly always, the function that is used to transform the data is

invertible In mathematics, the concept of an inverse element generalises the concepts of opposite () and reciprocal () of numbers. Given an operation denoted here , and an identity element denoted , if , one says that is a left inverse of , and that is ...

, and generally is

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...

. The transformation is usually applied to a collection of comparable measurements. For example, if we are working with data on peoples' incomes in some

currency A currency, "in circulation", from la, currens, -entis, literally meaning "running" or "traversing" is a standardization of money in any form, in use or circulation as a medium of exchange, for example banknotes and coins. A more general ...

unit, it would be common to transform each person's income value by the

logarithm In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number to the base is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 of ...

function.

Motivation

Guidance for how data should be transformed, or whether a transformation should be applied at all, should come from the particular statistical analysis to be performed. For example, a simple way to construct an approximate 95% confidence interval for the population mean is to take the

sample mean The sample mean (or "empirical mean") and the sample covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or mean value) of a sample of numbers taken from a larger popu ...

plus or minus two

standard error The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error o ...

units. However, the constant factor 2 used here is particular to the

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...

, and is only applicable if the sample mean varies approximately normally. The

central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themsel ...

states that in many situations, the sample mean does vary normally if the sample size is reasonably large. However, if the

population Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Governments typically quantify the size of the resident population within their jurisdiction using a ...

is substantially

skewed In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimoda ...

and the sample size is at most moderate, the approximation provided by the central limit theorem can be poor, and the resulting confidence interval will likely have the wrong coverage probability. Thus, when there is evidence of substantial skew in the data, it is common to transform the data to a

symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...

distribution before constructing a confidence interval. If desired, the confidence interval can then be transformed back to the original scale using the inverse of the transformation that was applied to the data. Data can also be transformed to make them easier to visualize. For example, suppose we have a scatterplot in which the points are the countries of the world, and the data values being plotted are the land area and population of each country. If the plot is made using untransformed data (e.g. square kilometers for area and the number of people for population), most of the countries would be plotted in tight cluster of points in the lower left corner of the graph. The few countries with very large areas and/or populations would be spread thinly around most of the graph's area. Simply rescaling units (e.g., to thousand square kilometers, or to millions of people) will not change this. However, following

ic transformations of both area and population, the points will be spread more uniformly in the graph. Another reason for applying data transformation is to improve interpretability, even if no formal statistical analysis or visualization is to be performed. For example, suppose we are comparing cars in terms of their fuel economy. These data are usually presented as "kilometers per liter" or "miles per gallon". However, if the goal is to assess how much additional fuel a person would use in one year when driving one car compared to another, it is more natural to work with the data transformed by applying the

reciprocal function In mathematics, a multiplicative inverse or reciprocal for a number ''x'', denoted by 1/''x'' or ''x''−1, is a number which when multiplied by ''x'' yields the multiplicative identity, 1. The multiplicative inverse of a fraction ''a''/''b ...

, yielding liters per kilometer, or gallons per mile.

In regression

Data transformation may be used as a remedial measure to make data suitable for modeling with linear regression if the original data violates one or more assumptions of linear regression. For example, the simplest linear regression models assume a

linear Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...

relationship between the expected value of ''Y'' (the

response variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...

to be predicted) and each independent variable (when the other independent variables are held fixed). If linearity fails to hold, even approximately, it is sometimes possible to transform either the independent or dependent variables in the regression model to improve the linearity. For example, addition of quadratic functions of the original independent variables may lead to a linear relationship with expected value of ''Y,'' resulting in a

polynomial regression In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable ''x'' and the dependent variable ''y'' is modelled as an ''n''th degree polynomial in ''x''. Polynomial regression fi ...

model, a special case of linear regression. Another assumption of linear regression is

homoscedasticity In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. Th ...

, that is the

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...

of errors must be the same regardless of the values of predictors. If this assumption is violated (i.e. if the data is heteroscedastic), it may be possible to find a transformation of ''Y'' alone, or transformations of both ''X'' (the predictor variables) and ''Y'', such that the homoscedasticity assumption (in addition to the linearity assumption) holds true on the transformed variables and linear regression may therefore be applied on these. Yet another application of data transformation is to address the problem of lack of normality in error terms. Univariate normality is not needed for least squares estimates of the regression parameters to be meaningful (see

Gauss–Markov theorem In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors) states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the ...

). However confidence intervals and

hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...

s will have better statistical properties if the variables exhibit multivariate normality. Transformations that stabilize the variance of error terms (i.e. those that address heteroscedaticity) often also help make the error terms approximately normal.

Examples

Equation:

Y = a + bX

:Meaning: A unit increase in X is associated with an average of b units increase in Y. Equation:

\log(Y) = a + bX

:(From exponentiating both sides of the equation:

Y = e^a e^

) :Meaning: A unit increase in X is associated with an average increase of b units in

\log(Y)

, or equivalently, Y increases on an average by a multiplicative factor of

e^\!

. For illustrative purposes, if base-10 logarithm were used instead of natural logarithm in the above transformation and the same symbols (''a'' and ''b'') are used to denote the regression coefficients, then a unit increase in X would lead to a

10^

times increase in Y on an average. If b were 1, then this implies a 10-fold increase in Y for a unit increase in X Equation:

Y = a + b \log(X)

:Meaning: A k-fold increase in X is associated with an average of

b \times \log(k)

units increase in Y. For illustrative purposes, if base-10 logarithm were used instead of natural logarithm in the above transformation and the same symbols (''a'' and ''b'') are used to denote the regression coefficients, then a tenfold increase in X would result in an average increase of

b \times \log_(10) = b

units in Y Equation:

\log(Y) = a + b \log(X)

:(From exponentiating both sides of the equation:

Y = e^a X^

) :Meaning: A k-fold increase in X is associated with a

k^

multiplicative increase in Y on an average. Thus if X doubles, it would result in Y changing by a multiplicative factor of

2^\!

Alternative

Generalized linear models (GLMs) provide a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. GLMs allow the linear model to be related to the response variable via a link function and allow the magnitude of the variance of each measurement to be a function of its predicted value.

Common cases

The

transformation and

square root In mathematics, a square root of a number is a number such that ; in other words, a number whose ''square'' (the result of multiplying the number by itself, or  ⋅ ) is . For example, 4 and −4 are square roots of 16, because . ...

transformation are commonly used for positive data, and the

multiplicative inverse In mathematics, a multiplicative inverse or reciprocal for a number ''x'', denoted by 1/''x'' or ''x''−1, is a number which when multiplied by ''x'' yields the multiplicative identity, 1. The multiplicative inverse of a fraction ''a''/ ...

transformation (reciprocal transformation) can be used for non-zero data. The '' power transformation'' is a family of transformations parameterized by a non-negative value λ that includes the logarithm, square root, and multiplicative inverse transformations as special cases. To approach data transformation systematically, it is possible to use

statistical estimation Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value ...

techniques to estimate the parameter λ in the power transformation, thereby identifying the transformation that is approximately the most appropriate in a given setting. Since the power transformation family also includes the identity transformation, this approach can also indicate whether it would be best to analyze the data without a transformation. In regression analysis, this approach is known as the '' Box–Cox transformation''. The reciprocal transformation, some power transformations such as the Yeo–Johnson transformation, and certain other transformations such as applying the

inverse hyperbolic sine In mathematics, the inverse hyperbolic functions are the inverse functions of the hyperbolic functions. For a given value of a hyperbolic function, the corresponding inverse hyperbolic function provides the corresponding hyperbolic angle. The ...

, can be meaningfully applied to data that include both positive and negative values (the power transformation is invertible over all real numbers if λ is an odd integer). However, when both negative and positive values are observed, it is sometimes common to begin by adding a constant to all values, producing a set of non-negative data to which any power transformation can be applied. A common situation where a data transformation is applied is when a value of interest ranges over several

orders of magnitude An order of magnitude is an approximation of the logarithm of a value relative to some contextually understood reference value, usually 10, interpreted as the base of the logarithm and the representative of values of magnitude one. Logarithmic dis ...

. Many physical and social phenomena exhibit such behavior — incomes, species populations, galaxy sizes, and rainfall volumes, to name a few. Power transforms, and in particular the logarithm, can often be used to induce symmetry in such data. The logarithm is often favored because it is easy to interpret its result in terms of "fold changes." The logarithm also has a useful effect on ratios. If we are comparing positive quantities ''X'' and ''Y'' using the ratio ''X'' / ''Y'', then if ''X'' < ''Y'', the ratio is in the interval (0,1), whereas if ''X'' > ''Y'', the ratio is in the half-line (1,∞), where the ratio of 1 corresponds to equality. In an analysis where ''X'' and ''Y'' are treated symmetrically, the log-ratio log(''X'' / ''Y'') is zero in the case of equality, and it has the property that if ''X'' is ''K'' times greater than ''Y'', the log-ratio is the equidistant from zero as in the situation where ''Y'' is ''K'' times greater than ''X'' (the log-ratios are log(''K'') and −log(''K'') in these two situations). If values are naturally restricted to be in the range 0 to 1, not including the end-points, then a logit transformation may be appropriate: this yields values in the range (−∞,∞).

Transforming to normality

1. It is not always necessary or desirable to transform a data set to resemble a normal distribution. However, if symmetry or normality are desired, they can often be induced through one of the power transformations. 2. A linguistic power function is distributed according to the Zipf-Mandelbrot law. The distribution is extremely spiky and leptokurtic, this is the reason why researchers had to turn their backs to statistics to solve e.g.

authorship attribution Stylometry is the application of the study of linguistic style, usually to written language. It has also been applied successfully to music and to fine-art paintings as well. Argamon, Shlomo, Kevin Burns, and Shlomo Dubnov, eds. The structure o ...

problems. Nevertheless, usage of Gaussian statistics is perfectly possible by applying data transformation. 3. To assess whether normality has been achieved after transformation, any of the standard

normality test In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed. More precisely, the tests are a fo ...

s may be used. A graphical approach is usually more informative than a formal statistical test and hence a normal quantile plot is commonly used to assess the fit of a data set to a normal population. Alternatively, rules of thumb based on the sample

skewness In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal ...

and

kurtosis In probability theory and statistics, kurtosis (from el, κυρτός, ''kyrtos'' or ''kurtos'', meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurt ...

have also been proposed.

Transforming to a uniform distribution or an arbitrary distribution

If we observe a set of ''n'' values ''X''₁, ..., ''X''_''n'' with no ties (i.e., there are ''n'' distinct values), we can replace ''X''_''i'' with the transformed value ''Y''_''i'' = ''k'', where ''k'' is defined such that ''X''_''i'' is the ''k''^th largest among all the ''X'' values. This is called the ''rank transform'', and creates data with a perfect fit to a uniform distribution. This approach has a

analogue. Using the

probability integral transform In probability theory, the probability integral transform (also known as universality of the uniform) relates to the result that data values that are modeled as being random variables from any given continuous distribution can be converted to random ...

, if ''X'' is any random variable, and ''F'' is the cumulative distribution function of ''X'', then as long as ''F'' is invertible, the random variable ''U'' = ''F''(''X'') follows a uniform distribution on the

unit interval In mathematics, the unit interval is the closed interval , that is, the set of all real numbers that are greater than or equal to 0 and less than or equal to 1. It is often denoted ' (capital letter ). In addition to its role in real analysis ...

,1 From a uniform distribution, we can transform to any distribution with an invertible cumulative distribution function. If ''G'' is an invertible cumulative distribution function, and ''U'' is a uniformly distributed random variable, then the random variable ''G''⁻¹(''U'') has ''G'' as its cumulative distribution function. Putting the two together, if ''X'' is any random variable, ''F'' is the invertible cumulative distribution function of ''X'', and ''G'' is an invertible cumulative distribution function then the random variable ''G''⁻¹(''F''(''X'')) has ''G'' as its cumulative distribution function.

Variance stabilizing transformations

Many types of statistical data exhibit a "

-on-mean relationship", meaning that the variability is different for data values with different

expected values In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...

. As an example, in comparing different populations in the world, the variance of income tends to increase with mean income. If we consider a number of small area units (e.g., counties in the United States) and obtain the mean and variance of incomes within each county, it is common that the counties with higher mean income also have higher variances. A

variance-stabilizing transformation In applied statistics, a variance-stabilizing transformation is a data transformation that is specifically chosen either to simplify considerations in graphical exploratory data analysis or to allow the application of simple regression-based or anal ...

aims to remove a variance-on-mean relationship, so that the variance becomes constant relative to the mean. Examples of variance-stabilizing transformations are the

Fisher transformation In statistics, the Fisher transformation (or Fisher ''z''-transformation) of a Pearson correlation coefficient is its inverse hyperbolic tangent (artanh). When the sample correlation coefficient ''r'' is near 1 or -1, its distribution is high ...

for the sample correlation coefficient, the

transformation or Anscombe transform for Poisson data (count data), the Box–Cox transformation for regression analysis, and the arcsine square root transformation or angular transformation for proportions ( binomial data). While commonly used for statistical analysis of proportional data, the arcsine square root transformation is not recommended because

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression a ...

or a logit transformation are more appropriate for binomial or non-binomial proportions, respectively, especially due to decreased type-II error.

Transformations for multivariate data

Univariate functions can be applied point-wise to multivariate data to modify their marginal distributions. It is also possible to modify some attributes of a multivariate distribution using an appropriately constructed transformation. For example, when working with

time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Ex ...

and other types of sequential data, it is common to

difference Difference, The Difference, Differences or Differently may refer to: Music * ''Difference'' (album), by Dreamtale, 2005 * ''Differently'' (album), by Cassie Davis, 2009 ** "Differently" (song), by Cassie Davis, 2009 * ''The Difference'' (al ...

the data to improve stationarity. If data generated by a random vector ''X'' are observed as vectors ''X''_i of observations with covariance matrix Σ, a

linear transformation In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pre ...

can be used to decorrelate the data. To do this, the

Cholesky decomposition In linear algebra, the Cholesky decomposition or Cholesky factorization (pronounced ) is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for effici ...

is used to express Σ = ''A'' ''A. Then the transformed vector ''Y''_i = ''A''⁻¹''X''_i has the identity matrix as its covariance matrix.

References

External links

Log Transformations for Skewed and Wide Distributions
– discussing the log and the "signed logarithm" transformations (A chapter from "Practical Data Science with R"). {{DEFAULTSORT:Data Transformation (Statistics) Statistical inference