The LogSumExp (LSE) (also called RealSoftMax or multivariable

softplus In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a neu ...

)

function Function or functionality may refer to: Computing * Function key, a type of key on computer keyboards * Function model, a structured representation of processes in a system * Function object or functor or functionoid, a concept of object-oriente ...

is a

smooth maximum In mathematics, a smooth maximum of an indexed family ''x''1, ..., ''x'n'' of numbers is a smooth approximation to the maximum function \max(x_1,\ldots,x_n), meaning a parametric family of functions m_\alpha(x_1,\ldots,x_n) such that ...

– a

smooth Smooth may refer to: Mathematics * Smooth function, a function that is infinitely differentiable; used in calculus and topology * Smooth manifold, a differentiable manifold for which all the transition maps are smooth functions * Smooth algebraic ...

approximation An approximation is anything that is intentionally similar but not exactly equality (mathematics), equal to something else. Etymology and usage The word ''approximation'' is derived from Latin ''approximatus'', from ''proximus'' meaning ''very ...

to the

maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ran ...

function, mainly used by

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

algorithms. It is defined as the

logarithm In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number to the base is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 o ...

of the sum of the

exponentials Exponential may refer to any of several mathematical topics related to exponentiation, including: *Exponential function, also: **Matrix exponential, the matrix analogue to the above * Exponential decay, decrease at a rate proportional to value *Exp ...

of the arguments: :

\mathrm(x_1, \dots, x_n) = \log\left( \exp(x_1) + \cdots + \exp(x_n) \right).

Properties

The LogSumExp function

domain Domain may refer to: Mathematics *Domain of a function, the set of input values for which the (total) function is defined **Domain of definition of a partial function **Natural domain of a partial function **Domain of holomorphy of a function * Do ...

\R^n

, the

real coordinate space In mathematics, the real coordinate space of dimension , denoted ( ) or is the set of the -tuples of real numbers, that is the set of all sequences of real numbers. With component-wise addition and scalar multiplication, it is a real vector ...

, and its codomain is

\R

, the

real line In elementary mathematics, a number line is a picture of a graduated straight line (geometry), line that serves as visual representation of the real numbers. Every point of a number line is assumed to correspond to a real number, and every real ...

. It is an approximation to the maximum

\max_i x_i

with the following bounds :

\max \leq \mathrm(x_1, \dots, x_n) \leq \max + \log(n).

The first

inequality Inequality may refer to: Economics * Attention inequality, unequal distribution of attention across users, groups of people, issues in etc. in attention economy * Economic inequality, difference in economic well-being between population groups * ...

is strict unless

n = 1

. The second inequality is strict unless all arguments are equal. (Proof: Let

m = \max_i x_i

. Then

\exp(m) \leq \sum_^n \exp(x_i) \leq n \exp(m)

. Applying the logarithm to the inequality gives the result.) In addition, we can scale the function to make the bounds tighter. Consider the function

\frac 1 t \mathrm(tx)

. Then :

\max < \frac 1 t \mathrm(tx) \leq \max + \frac.

(Proof: Replace each

x_i

with

tx_i

for some

t>0

in the inequalities above, to give :

\max < \mathrm(tx_1, \dots, tx_n)\leq \max + \log(n).

and, since

t>0

t \max < \mathrm(tx_1, \dots, tx_n)\leq t\max + \log(n).

finally, dividing by

t

gives the result.) Also, if we multiply by a negative number instead, we of course find a comparison to the

\min

function: :

\min - \frac \leq \frac 1  \mathrm(-tx) < \min.

The LogSumExp function is

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytope ...

, and is strictly increasing everywhere in its domain (but not strictly convex everywhere). Writing

\mathbf = (x_1, \dots, x_n),

the

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). Part ...

s are: :

\frac = 
\frac,

which means the

gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gradi ...

of LogSumExp is the

softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

. The

convex conjugate In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformation ...

of LogSumExp is the

negative entropy In information theory and statistics, negentropy is used as a measure of distance to normality. The concept and phrase "negative entropy" was introduced by Erwin Schrödinger in his 1944 popular-science book '' What is Life?'' Later, Léon Bril ...

log-sum-exp trick for log-domain calculations

The LSE function is often encountered when the usual arithmetic computations are performed on a

logarithmic scale A logarithmic scale (or log scale) is a way of displaying numerical data over a very wide range of values in a compact way—typically the largest numbers in the data are hundreds or even thousands of times larger than the smallest numbers. Such a ...

, as in

log probability In probability theory and computer science, a log probability is simply a logarithm of a probability. The use of log probabilities means representing probabilities on a logarithmic scale, instead of the standard

, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline (t ...

unit interval. Since the p ...

. Similar to multiplication operations in linear-scale becoming simple additions in log-scale, an addition operation in linear-scale becomes the LSE in log-scale: :

\mathrm(\log(x_1), ..., \log(x_n)) = \log(x_1 + \dots + x_n)

A common purpose of using log-domain computations is to increase accuracy and avoid underflow and overflow problems when very small or very large numbers are represented directly (i.e. in a linear domain) using limited-precision floating point numbers. Unfortunately, the use of LSE directly in this case can again cause overflow/underflow problems. Therefore, the following equivalent must be used instead (especially when the accuracy of the above 'max' approximation is not sufficient). Therefore, many math libraries such as IT++ provide a default routine of LSE and use this formula internally. :

\mathrm(x_1, \dots, x_n) = x^* + \log\left( \exp(x_1-x^*)+ \cdots + \exp(x_n-x^*) \right)

where

x^* = \max

A strictly convex log-sum-exp type function

LSE is convex but not strictly convex. We can define a strictly convex log-sum-exp type function by adding an extra argument set to zero: :

\mathrm_0^+(x_1,...,x_n) = \mathrm(0,x_1,...,x_n)

This function is a proper Bregman generator (strictly convex and

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...

). It is encountered in machine learning, for example, as the cumulant of the multinomial/binomial family. In

tropical analysis In the mathematical discipline of idempotent analysis, tropical analysis is the study of the tropical semiring. Applications The max tropical semiring can be used appropriately to determine marking times within a given Petri net and a vector fil ...

, this is the sum in the

log semiring In mathematics, in the field of tropical analysis, the log semiring is the semiring structure on the logarithmic scale, obtained by considering the extended real numbers as logarithms. That is, the operations of addition and multiplication are defi ...

References

{{refend Logarithms

Properties

log-sum-exp trick for log-domain calculations

A strictly convex log-sum-exp type function

See also

References