Feature scaling is a method used to normalize the range of independent variables or features of data. In
data processing
Data processing is the collection and manipulation of digital data to produce meaningful information.
Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an ...
, it is also known as data normalization and is generally performed during the
data preprocessing Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to ...
step.
Motivation
Since the range of values of raw data varies widely, in some
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
algorithms, objective functions will not work properly without
normalization. For example, many
classifiers calculate the distance between two points by the
Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
Another reason why feature scaling is applied is that
gradient descent
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
converges much faster with feature scaling than without it.
It's also important to apply feature scaling if
regularization
Regularization may refer to:
* Regularization (linguistics)
* Regularization (mathematics)
* Regularization (physics)
In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
is used as part of the loss function (so that coefficients are penalized appropriately).
Methods
Rescaling (min-max normalization)
Also known as min-max scaling or min-max normalization, rescaling is the simplest method and consists in rescaling the range of features to scale the range in
, 1
The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline (t ...
or
ˆ’1, 1 Selecting the target range depends on the nature of the data. The general formula for a min-max of
, 1
The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline (t ...
is given as:
:
where
is an original value,
is the normalized value. For example, suppose that we have the students' weight data, and the students' weights span
60 pounds, 200 pounds To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights).
To rescale a range between an arbitrary set of values
, b the formula becomes:
:
where
are the min-max values.
Mean normalization
:
where
is an original value,
is the normalized value,
is the mean of that feature vector. There is another form of the means normalization which divides by the standard deviation which is also called standardization.
Standardization (Z-score Normalization)
In machine learning, we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple
dimensions
In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coordina ...
. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. This method is widely used for normalization in many machine learning algorithms (e.g.,
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
s,
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
, and
artificial neural network
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.
An ANN is based on a collection of connected unit ...
s).
The general method of calculation is to determine the distribution
mean
There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set.
For a data set, the ''arithme ...
and
standard deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.
:
Where
is the original feature vector,
is the mean of that feature vector, and
is its standard deviation.
Scaling to unit length
Another option that is widely used in machine-learning is to scale the components of a feature vector such that the complete vector has length one. This usually means dividing each component by the
Euclidean length of the vector:
:
In some applications (e.g., histogram features) it can be more practical to use the L
1 norm (i.e.,
taxicab geometry
A taxicab geometry or a Manhattan geometry is a geometry whose usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian co ...
) of the feature vector. This is especially important if in the following learning steps the scalar metric is used as a distance measure.
Note that this only works for
.
Application
In
stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
, feature scaling can sometimes improve the convergence speed of the algorithm. In support vector machines,
it can reduce the time to find support vectors. Note that feature scaling changes the SVM result .
See also
*
Normalization (statistics)
In statistics and applications of statistics, normalization can have a range of meanings. In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging ...
*
Standard score
In statistics, the standard score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mean ...
*
fMLLR In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplica ...
, Feature space Maximum Likelihood Linear Regression
References
Further reading
* {{cite book , first1=Jiawei , last1=Han , first2=Micheline , last2=Kamber , first3=Jian , last3=Pei , title=Data Mining: Concepts and Techniques , publisher=Elsevier , year=2011 , chapter=Data Transformation and Data Discretization , pages=111–118 , isbn=9780123814807 , chapter-url=https://www.google.com/books/edition/Data_Mining_Concepts_and_Techniques/pQws07tdpjoC?hl=en&gbpv=1&pg=PA111
External links
Lecture by Andrew Ng on feature scaling
Machine learning
Statistical data transformation