Batch normalization (also known as batch norm) is a
normalization
Normalization or normalisation refers to a process that makes something more normal or regular. Science
* Normalization process theory, a sociological theory of the implementation of new technologies or innovations
* Normalization model, used in ...
technique used to make training of
artificial neural networks
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
faster and more stable by adjusting the inputs to each layer—re-centering them around zero and re-scaling them to a standard size. It was introduced by Sergey Ioffe and Christian Szegedy in 2015.
Experts still debate why batch normalization works so well. It was initially thought to tackle ''internal covariate shift'', a problem where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network.
However, newer research suggests it doesn’t fix this shift but instead smooths the
objective function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
—a mathematical guide the network follows to improve—enhancing performance.
In very deep networks, batch normalization can initially cause a severe
gradient explosion—where updates to the network grow uncontrollably large—but this is managed with shortcuts called skip connections in residual networks.
Another theory is that batch normalization adjusts data by handling its size and path separately, speeding up training.
Internal covariate shift
Each layer in a neural network has inputs that follow a specific distribution, which shifts during training due to two main factors: the random starting values of the network’s settings (
parameter initialization) and the natural variation in the input data. This shifting pattern affecting the inputs to the network’s inner layers is called internal covariate shift. While a strict definition isn’t fully agreed upon, experiments show that it involves changes in the means and variances of these inputs during training.
Batch normalization was first developed to address internal covariate shift.
During training, as the parameters of preceding layers adjust, the distribution of inputs to the current layer changes accordingly, such that the current layer needs to constantly readjust to new distributions. This issue is particularly severe in deep networks, because small changes in shallower hidden layers will be amplified as they propagate within the network, resulting in significant shift in deeper hidden layers. Batch normalization was proposed to reduced these unwanted shifts to speed up training and produce more reliable models.
Beyond possibly tackling internal covariate shift, batch normalization offers several additional advantages. It allows the network to use a higher
learning rate—a setting that controls how quickly the network learns—without causing problems like vanishing or exploding gradients, where updates become too small or too large. It also appears to have a regularizing effect, improving the network’s ability to generalize to new data, reducing the need for
dropout, a technique used to prevent
overfitting
In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
(when a model learns the training data too well and fails on new data). Additionally, networks using batch normalization are less sensitive to the choice of starting settings or learning rates, making them more robust and adaptable.
Procedures
Transformation
In a neural network, batch normalization is achieved through a normalization step that fixes the means and variances of each layer's inputs. Ideally, the normalization would be conducted over the entire training set, but to use this step jointly with
stochastic optimization
Stochastic optimization (SO) are optimization methods that generate and use random variables. For stochastic optimization problems, the objective functions or constraints are random. Stochastic optimization also include methods with random iter ...
methods, it is impractical to use the global information. Thus, normalization is restrained to each mini-batch in the training process.
Let us use ''B'' to denote a mini-batch of size ''m'' of the entire training set. The empirical
mean
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
and
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of ''B'' could thus be denoted as
and
.
For a layer of the network with ''d-''dimensional input,
, each dimension of its input is then normalized (i.e. re-centered and re-scaled) separately,
, where