Feature scaling is a method used to normalize the range of independent variables or features of data. In

data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an o ...

, it is also known as data normalization and is generally performed during the data preprocessing step.

Motivation

Since the range of values of raw data varies widely, in some

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

algorithms, objective functions will not work properly without normalization. For example, many classifiers calculate the distance between two points by the

Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...

. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance. Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it. It's also important to apply feature scaling if regularization is used as part of the loss function (so that coefficients are penalized appropriately). Empirically, feature scaling can improve the convergence speed of stochastic gradient descent. In support vector machines, it can reduce the time to find support vectors. Feature scaling is also often used in applications involving distances and similarities between data points, such as clustering and similarity search. As an example, the

K-means clustering ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...

algorithm is sensitive to feature scales.

Methods

Rescaling (min-max normalization)

Also known as min-max scaling or min-max normalization, rescaling is the simplest method and consists in rescaling the range of features to scale the range in

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

or ��1, 1 Selecting the target range depends on the nature of the data. The general formula for a min-max of

is given as: :

x' = \frac

where

x

is an original value,

x'

is the normalized value. For example, suppose that we have the students' weight data, and the students' weights span 60 pounds, 200 pounds To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights). To rescale a range between an arbitrary set of values , b the formula becomes: :

x' = a + \frac

where

a,b

are the min-max values.

Mean normalization

x' = \frac

where

x

is an original value,

x'

is the normalized value,

\bar=\text(x)

is the mean of that feature vector. There is another form of the means normalization which divides by the standard deviation which is also called standardization.

Standardization (Z-score Normalization)

The_effect_of_z-score_normalization_on_k-means_clustering

In machine learning, we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple dimensions. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. This method is widely used for normalization in many machine learning algorithms (e.g., support vector machines,

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

, and

artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

s). The general method of calculation is to determine the distribution

mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...

and

standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...

for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation. :

x' = \frac

Where

x

is the original feature vector,

\bar=\text(x)

is the mean of that feature vector, and

\sigma

is its standard deviation.

Robust Scaling

Robust scaling, also known as standardization using

median The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...

and

interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the differen ...

(IQR), is designed to be robust to outliers. It scales features using the median and IQR as reference points instead of the mean and standard deviation:

x' = \frac

where

Q_1(x), Q_2(x), Q_3(x)

are the three quartiles (25th, 50th, 75th percentile) of the feature.

Unit vector normalization

Unit vector normalization regards each individual data point as a vector, and divide each by its vector norm, to obtain

x' = x/\, x\,

. Any vector norm can be used, but the most common ones are the L1 norm and the L2 norm. For example, if

x = (v_1, v_2, v_3)

, then its Lp-normalized version is:

\left(\frac, 
\frac, 
\frac\right)

References

External links

Lecture by Andrew Ng on feature scaling
Machine learning Statistical data transformation

Motivation

Methods

Rescaling (min-max normalization)

Mean normalization

Standardization (Z-score Normalization)

Robust Scaling

Unit vector normalization

See also

References

Further reading

External links