Batch normalization (also known as batch norm) is a

normalization Normalization or normalisation refers to a process that makes something more normal or regular. Science * Normalization process theory, a sociological theory of the implementation of new technologies or innovations * Normalization model, used in ...

technique used to make training of

artificial neural networks In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

faster and more stable by adjusting the inputs to each layer—re-centering them around zero and re-scaling them to a standard size. It was introduced by Sergey Ioffe and Christian Szegedy in 2015. Experts still debate why batch normalization works so well. It was initially thought to tackle ''internal covariate shift'', a problem where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network. However, newer research suggests it doesn’t fix this shift but instead smooths the

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

—a mathematical guide the network follows to improve—enhancing performance. In very deep networks, batch normalization can initially cause a severe gradient explosion—where updates to the network grow uncontrollably large—but this is managed with shortcuts called skip connections in residual networks. Another theory is that batch normalization adjusts data by handling its size and path separately, speeding up training.

Internal covariate shift

Each layer in a neural network has inputs that follow a specific distribution, which shifts during training due to two main factors: the random starting values of the network’s settings ( parameter initialization) and the natural variation in the input data. This shifting pattern affecting the inputs to the network’s inner layers is called internal covariate shift. While a strict definition isn’t fully agreed upon, experiments show that it involves changes in the means and variances of these inputs during training. Batch normalization was first developed to address internal covariate shift. During training, as the parameters of preceding layers adjust, the distribution of inputs to the current layer changes accordingly, such that the current layer needs to constantly readjust to new distributions. This issue is particularly severe in deep networks, because small changes in shallower hidden layers will be amplified as they propagate within the network, resulting in significant shift in deeper hidden layers. Batch normalization was proposed to reduced these unwanted shifts to speed up training and produce more reliable models. Beyond possibly tackling internal covariate shift, batch normalization offers several additional advantages. It allows the network to use a higher learning rate—a setting that controls how quickly the network learns—without causing problems like vanishing or exploding gradients, where updates become too small or too large. It also appears to have a regularizing effect, improving the network’s ability to generalize to new data, reducing the need for dropout, a technique used to prevent

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

(when a model learns the training data too well and fails on new data). Additionally, networks using batch normalization are less sensitive to the choice of starting settings or learning rates, making them more robust and adaptable.

Procedures

Transformation

In a neural network, batch normalization is achieved through a normalization step that fixes the means and variances of each layer's inputs. Ideally, the normalization would be conducted over the entire training set, but to use this step jointly with

stochastic optimization Stochastic optimization (SO) are optimization methods that generate and use random variables. For stochastic optimization problems, the objective functions or constraints are random. Stochastic optimization also include methods with random iter ...

methods, it is impractical to use the global information. Thus, normalization is restrained to each mini-batch in the training process. Let us use ''B'' to denote a mini-batch of size ''m'' of the entire training set. The empirical

mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...

and

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

of ''B'' could thus be denoted as

\mu_B = \frac 1 m \sum_^m x_i

and

\sigma_B^2 = \frac 1 m \sum_^m (x_i-\mu_B)^2

. For a layer of the network with ''d-''dimensional input,

x = (x^,...,x^)

, each dimension of its input is then normalized (i.e. re-centered and re-scaled) separately,

\hat_^ = \frac  \sqrt

, where

k \in,d /math> and i \in,m /math>; \mu_B^and \sigma_B^are the per-dimension mean and standard deviation, respectively. \epsilon is added in the denominator for numerical stability and is an arbitrarily small constant. The resulting normalized activation \hat^have zero mean and unit variance, if \epsilon is not taken into account. To restore the representation power of the network, a transformation step then follows as y_i^ = \gamma^ \hat_^ +\beta^,

where the parameters \gamma^and \beta^are subsequently learned in the optimization process.

Formally, the operation that implements batch normalization is a transform BN_: x^_ \rightarrow y^_called the Batch Normalizing transform. The output of the BN transform y^ = BN_(x^) is then passed to other network layers, while the normalized output \hat_^remains internal to the current layer.

Backpropagation

The described BN transform is a

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...

operation, and the gradient of the loss ''l'' with respect to the different parameters can be computed directly with the

chain rule In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h ...

. Specifically,

\frac

depends on the choice of

activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...

, and the

gradient In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The g ...

against other parameters could be expressed as a function of

\frac

\frac = \frac\gamma^

\frac = \sum_^m \frac\hat_i^

\frac = \sum_^m \frac

\frac = \sum_^m \frac (x_i^-\mu_B^)\left(-\frac  2 (\sigma_B^+\epsilon)^\right)

\frac = \sum_^m \frac\frac+\frac\frac\sum_^m (-2)\cdot (x_i^-\mu^_B)

, and

\frac = \frac\frac+\frac\frac+\frac\frac

Inference

During the training stage, the normalization steps depend on the mini-batches to ensure efficient and reliable training. However, in the inference stage, this dependence is not useful any more. Instead, the normalization step in this stage is computed with the population statistics such that the output could depend on the input in a deterministic manner. The population mean,

E^/math>, and variance, \operatorname^/math>, are computed as: E^= E_mu^_B, and \operatorname^= \fracE_left(\sigma^_B \right)^2 .

The population statistics thus is a complete representation of the mini-batches.

The BN transform in the inference step thus becomes y^ = BN^_(x^)=\gamma^\frac + \beta^,

where y^is passed on to future layers instead of x^. Since the parameters are fixed in this transformation, the batch normalization procedure is essentially applying a linear transform to the activation.

Theory

Although batch normalization has become popular due to its strong empirical performance, the working mechanism of the method is not yet well-understood. The explanation made in the original paper was that batch norm works by reducing internal covariate shift, but this has been challenged by more recent work. One experiment trained a VGG-16 network under 3 different training regimes: standard (no batch norm), batch norm, and batch norm with noise added to each layer during training. In the third model, the noise has non-zero mean and non-unit variance, i.e. it explicitly introduces covariate shift. Despite this, it showed similar accuracy to the second model, and both performed better than the first, suggesting that covariate shift is not the reason that batch norm improves performance. Using batch normalization causes the items in a batch to no longer be iid, which can lead to difficulties in training due to lower quality gradient estimation.

Smoothness

One alternative explanation, is that the improvement with batch normalization is instead due to it producing a smoother parameter space and smoother gradients, as formalized by a smaller Lipschitz constant. Consider two identical networks, one contains batch normalization layers and the other does not, the behaviors of these two networks are then compared. Denote the loss functions as

\hat

and

L

, respectively. Let the input to both networks be

x

, and the output be

y

, for which

y = Wx

, where

W

is the layer weights. For the second network,

y

additionally goes through a batch normalization layer. Denote the normalized activation as

\hat

, which has zero mean and unit variance. Let the transformed activation be

z = \gamma\hat+\beta

, and suppose

\gamma

and

\beta

are constants. Finally, denote the standard deviation over a mini-batch

\hat \in \R^m

\sigma_j

. First, it can be shown that the gradient magnitude of a batch normalized network,

, , \triangledown_\hat, ,

, is bounded, with the bound expressed as

, , \triangledown_\hat, , ^2 \le \frac\Bigg(, , \triangledown_L, , ^2-\frac\langle 1,\triangledown_L\rangle^2-\frac\langle\triangledown_L,\hat_j\rangle^2\bigg)

. Since the gradient magnitude represents the Lipschitzness of the loss, this relationship indicates that a batch normalized network could achieve greater Lipschitzness comparatively. Notice that the bound gets tighter when the gradient

\triangledown_\hat

correlates with the activation

\hat

, which is a common phenomena. The scaling of

\frac

is also significant, since the variance is often large. Secondly, the quadratic form of the loss Hessian with respect to activation in the gradient direction can be bounded as

(\triangledown_\hat)^T \frac(\triangledown_\hat) \le \frac
\bigg(\frac\bigg)^T \bigg(\frac\bigg)\bigg(\frac\bigg)
-\frac\langle\triangledown_L,\hat\rangle \bigg, \bigg, \frac\bigg, \bigg, ^2

. The scaling of

\frac

indicates that the loss Hessian is resilient to the mini-batch variance, whereas the second term on the right hand side suggests that it becomes smoother when the Hessian and the inner product are non-negative. If the loss is locally

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

, then the Hessian is positive semi-definite, while the inner product is positive if

\hat

is in the direction towards the minimum of the loss. It could thus be concluded from this inequality that the gradient generally becomes more predictive with the batch normalization layer. It then follows to translate the bounds related to the loss with respect to the normalized activation to a bound on the loss with respect to the network weights:

\hat \le \frac(g^2_j-m\mu^2_-\lambda^2\langle \triangledown_L,\hat_j\rangle^2)

, where

g_j = max_ , , \triangledown_W L, , ^2

and

\hat_j = max_ , , \triangledown_W \hat, , ^2

. In addition to the smoother landscape, it is further shown that batch normalization could result in a better initialization with the following inequality:

, , W_0-\hat^*, , ^2 \le , , W_0-W^*, , ^2-\frac(, , W^*, , ^2-\langle W^*,W_0\rangle)^2

, where

W^*

and

\hat^*

are the local optimal weights for the two networks, respectively. Some scholars argue that the above analysis cannot fully capture the performance of batch normalization, because the proof only concerns the largest eigenvalue, or equivalently, one direction in the landscape at all points. It is suggested that the complete eigenspectrum needs to be taken into account to make a conclusive analysis.

Measure

Since it is hypothesized that batch normalization layers could reduce internal covariate shift, an experiment is set up to measure quantitatively how much covariate shift is reduced. First, the notion of internal covariate shift needs to be defined mathematically. Specifically, to quantify the adjustment that a layer's parameters make in response to updates in previous layers, the correlation between the gradients of the loss before and after all previous layers are updated is measured, since gradients could capture the shifts from the first-order training method. If the shift introduced by the changes in previous layers is small, then the correlation between the gradients would be close to 1. The correlation between the gradients are computed for four models: a standard VGG network, a VGG network with batch normalization layers, a 25-layer deep linear network (DLN) trained with full-batch gradient descent, and a DLN network with batch normalization layers. Interestingly, it is shown that the standard VGG and DLN models both have higher correlations of gradients compared with their counterparts, indicating that the additional batch normalization layers are not reducing internal covariate shift.

Vanishing/exploding gradients

Even though batchnorm was originally introduced to alleviate gradient vanishing or explosion problems, a deep batchnorm network in fact ''suffers from gradient explosion'' at initialization time, no matter what it uses for nonlinearity. Thus the optimization landscape is very far from smooth for a randomly initialized, deep batchnorm network. More precisely, if the network has

L

layers, then the gradient of the first layer weights has norm

> c\lambda^L

for some

\lambda>1, c>0

depending only on the nonlinearity. For any fixed nonlinearity,

\lambda

decreases as the batch size increases. For example, for ReLU,

\lambda

decreases to

\pi/(\pi-1)\approx 1.467

as the batch size tends to infinity. Practically, this means deep batchnorm networks are untrainable. This is only relieved by skip connections in the fashion of residual networks. This gradient explosion on the surface contradicts the ''smoothness'' property explained in the previous section, but in fact they are consistent. The previous section studies the effect of inserting a single batchnorm in a network, while the gradient explosion depends on stacking batchnorms typical of modern deep neural networks.

Decoupling

Another possible reason for the success of batch normalization is that it decouples the length and direction of the weight vectors and thus facilitates better training. By interpreting batch norm as a reparametrization of weight space, it can be shown that the length and the direction of the weights are separated and can thus be trained separately. For a particular neural network unit with input

x

and weight vector

w

, denote its output as

f(w) = E_x phi(x^Tw)

, where

\phi

is the activation function, and denote

S = E x^T

. Assume that

E 0

, and that the spectrum of the matrix

S

is bounded as

0<\mu = \lambda_(S)

L=\lambda_(S)<\infty

, such that

S

is symmetric positive definite. Adding batch normalization to this unit thus results in

f_(w,\gamma,\beta) = E_x phi(BN(x^Tw)) = E_x\bigg phi\bigg(\gamma(\frac)+\beta\bigg)\bigg

, by definition. The variance term can be simplified such that

var_x^Tw w^TSw

. Assume that

x

has zero mean and

\beta

can be omitted, then it follows that

f_(w,\gamma) = E_x\bigg phi\bigg(\gamma\frac\bigg)\bigg

, where

(w^TSw)^

is the induced norm of

S

, , w, , _s

. Hence, it could be concluded that

f_(w,\gamma) = E_x phi(x^T\tilde)

, where

\tilde=\gamma \frac

, and

\gamma

and

w

accounts for its length and direction separately. This property could then be used to prove the faster convergence of problems with batch normalization.

Linear convergence

Least-square problem

With the reparametrization interpretation, it could then be proved that applying batch normalization to the ordinary least squares problem achieves a linear convergence rate in gradient descent, which is faster than the regular gradient descent with only sub-linear convergence. Denote the objective of minimizing an ordinary least squares problem as

=min_(2u^T\tilde+\tilde^TS\tilde)

, where

u=E yx

and

S = E x^T

. Since

\tilde=\gamma\frac

, the objective thus becomes

min_f_(w,\gamma)=min_\bigg(2\gamma\frac\bigg)

, where 0 is excluded to avoid 0 in the denominator. Since the objective is convex with respect to

\gamma

, its optimal value could be calculated by setting the partial derivative of the objective against

\gamma

to 0. The objective could be further simplified to be

min_\rho(w)=min_\bigg(-\frac\bigg)

. Note that this objective is a form of the generalized Rayleigh quotient

\tilde(w)=\frac

, where

B\in R^

is a symmetric matrix and

A\in R^

is a symmetric positive definite matrix. It is proven that the gradient descent convergence rate of the generalized

Rayleigh quotient In mathematics, the Rayleigh quotient () for a given complex Hermitian matrix M and nonzero vector ''x'' is defined as:R(M,x) = .For real matrices and vectors, the condition of being Hermitian reduces to that of being symmetric, and the conjugat ...

\frac\le \bigg(1-\frac\bigg)^\frac

, where

\lambda_1

is the largest

eigenvalue In linear algebra, an eigenvector ( ) or characteristic vector is a vector that has its direction unchanged (or reversed) by a given linear transformation. More precisely, an eigenvector \mathbf v of a linear transformation T is scaled by a ...

B

\lambda_2

is the second largest eigenvalue of

B

, and

\lambda_

is the smallest eigenvalue of

B

. In our case,

B=uu^T

is a rank one matrix, and the convergence result can be simplified accordingly. Specifically, consider gradient descent steps of the form

w_=w_t-\eta_t\triangledown\rho(w_t)

with step size

\eta_t=\frac

, and starting from

\rho(w_0)\ne 0

, then

\rho(w_t)-\rho(w^*)\le \bigg(1-\frac\bigg)^(\rho(w_0)-\rho(w^*))

Learning halfspace problem

The problem of learning halfspaces refers to the training of the

Perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

, which is the simplest form of neural network. The optimization problem in this case is

min_f_(\tilde)=E_phi(z^T\tilde)

, where

z=-yx

and

\phi

is an arbitrary loss function. Suppose that

\phi

is infinitely differentiable and has a bounded derivative. Assume that the objective function

f_

\zeta

- smooth, and that a solution

\alpha^* = argmin_, , \triangledown f(\alpha w), , ^2

exists and is bounded such that

-\infty < \alpha^* < \infty

. Also assume

z

is a multivariate normal random variable. With the Gaussian assumption, it can be shown that all critical points lie on the same line, for any choice of loss function

\phi

. Specifically, the gradient of

f_

could be represented as

\triangledown_f_(\tilde)=c_1(\tilde)u+c_2(\tilde)S\tilde

, where

u^T\tilde)

c_2(\tilde)=E_z phi^(z^T\tilde)

, and

\phi^

is the

i

-th derivative of

\phi

. By setting the gradient to 0, it thus follows that the bounded critical points

\tilde_*

can be expressed as

\tilde_* = g_*S^u

, where

g_*

depends on

\tilde_*

and

\phi

. Combining this global property with length-direction decoupling, it could thus be proved that this optimization problem converges linearly. First, a variation of

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

with batch normalization, Gradient Descent in Normalized Parameterization (GDNP), is designed for the objective function

min_f_(w,\gamma)

, such that the direction and length of the weights are updated separately. Denote the stopping criterion of GDNP as

h(w_t,\gamma_t)=E_z phi'(z^T\tilde_t) u^Tw_t)-E_z phi''(z^T\tilde_t) u^Tw_t)^2

. Let the step size be

s_t=s(w_t,\gamma_t)=-\frac

. For each step, if

h(w_t,\gamma_t)\ne 0

, then update the direction as

w_=w_t-s_t\triangledown_w f(w_t,\gamma_t)

. Then update the length according to

\gamma_t = \text(T_s,f,w_t)

, where

\text

is the classical bisection algorithm, and

T_s

is the total iterations ran in the bisection step. Denote the total number of iterations as

T_d

, then the final output of GDNP is

\tilde_ = \gamma_\frac

. The GDNP algorithm thus slightly modifies the batch normalization step for the ease of mathematical analysis. It can be shown that in GDNP, the partial derivative of

f_

against the length component converges to zero at a linear rate, such that

(\partial_\gamma f_(w_t,a_t^)^2\le \frac

, where

a_t^

and

b_t^

are the two starting points of the bisection algorithm on the left and on the right, correspondingly. Further, for each iteration, the norm of the gradient of

f_

with respect to

w

converges linearly, such that

, , w_t, , _S^2, , \triangledown f_(w_t,g_t), , _^2\le \bigg(1-\frac\bigg)^\Phi^2\gamma_t^2(\rho(w_0)-\rho^*)

. Combining these two inequalities, a bound could thus be obtained for the gradient with respect to

\tilde_

, , \triangledown_\tildef(\tilde_), , ^2\le \bigg(1-\frac\bigg)^\Phi^2(\rho(w_0)-\rho^*)+\frac

, such that the algorithm is guaranteed to converge linearly. Although the proof stands on the assumption of Gaussian input, it is also shown in experiments that GDNP could accelerate optimization without this constraint.

Neural networks

Consider a

multilayer perceptron In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...

(MLP) with one hidden layer and

m

hidden units with mapping from input

x\in R^d

to a scalar output described as

F_x(\tilde,\Theta)=\sum_^\theta_i\phi(x^T\tilde^)

, where

\tilde^

and

\theta_i

are the input and output weights of unit

i

correspondingly, and

\phi

is the activation function and is assumed to be a tanh function. The input and output weights could then be optimized with

min_(f_(\tilde,\Theta)=E_(-yF_x(\tilde,\Theta))

, where

l

is a loss function,

\tilde=\

, and

\Theta=\

. Consider fixed

\Theta

and optimizing only

\tilde

, it can be shown that the critical points of

f_(\tilde)

of a particular hidden unit

i

\hat^

, all align along one line depending on incoming information into the hidden layer, such that

\hat^ = \hat^S^u

, where

\hat^\in R

is a scalar,

i =1,...,m

. This result could be proved by setting the gradient of

f_

to zero and solving the system of equations. Apply the GDNP algorithm to this optimization problem by alternating optimization over the different hidden units. Specifically, for each hidden unit, run GDNP to find the optimal

W

and

\gamma

. With the same choice of stopping criterion and stepsize, it follows that

, , \triangledown_f(\tilde_t^), , ^2_\le\bigg(1-\frac\bigg)^C(\rho(w_0)-\rho^*)+\frac

. Since the parameters of each hidden unit converge linearly, the whole optimization problem has a linear rate of convergence.

Internal covariate shift

Procedures

Transformation

Backpropagation

Inference

Theory

Smoothness

Measure

Vanishing/exploding gradients

Decoupling

Linear convergence

Least-square problem

Learning halfspace problem

Neural networks

References

Further reading