Softmax
   HOME
*





Softmax
The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom. Definition The softmax function takes as input a vector of real numbers, and normalizes it into a probability distribution consisting of probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval (0, 1), and the components will add up to 1, so that they can be interpreted as proba ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Multinomial Logistic Regression
In statistics, multinomial logistic regression is a statistical classification, classification method that generalizes logistic regression to multiclass classification, multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorical distribution, categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.). Multinomial logistic regression is known by a variety of other names, including polytomous LR, multiclass LR, Softmax activation function, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model. Background Multinomial logistic regression is used when the dependent variable in question is Level of measurement#Nominal measurement, nominal (equivalently ''categorical'', meaning that it fall ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Smooth Maximum
In mathematics, a smooth maximum of an indexed family ''x''1, ..., ''x''''n'' of numbers is a smooth approximation to the maximum function \max(x_1,\ldots,x_n), meaning a parametric family of functions m_\alpha(x_1,\ldots,x_n) such that for every , the function is smooth, and the family converges to the maximum function as . The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, as and as . The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family. Examples For large positive values of the parameter \alpha > 0, the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum. : ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Artificial Neural Network
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron receives signals then processes them and can signal neurons connected to it. The "signal" at a connection is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections are called ''edges''. Neurons and edges typically have a ''weight'' that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold. Typically ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


LogSumExp
The LogSumExp (LSE) (also called RealSoftMax or multivariable softplus) function is a smooth maximum – a smooth approximation to the maximum function, mainly used by machine learning algorithms. It is defined as the logarithm of the sum of the exponentials of the arguments: :\mathrm(x_1, \dots, x_n) = \log\left( \exp(x_1) + \cdots + \exp(x_n) \right). Properties The LogSumExp function domain is \R^n, the real coordinate space, and its codomain is \R, the real line. It is an approximation to the maximum \max_i x_i with the following bounds :\max \leq \mathrm(x_1, \dots, x_n) \leq \max + \log(n). The first inequality is strict unless n = 1. The second inequality is strict unless all arguments are equal. (Proof: Let m = \max_i x_i. Then \exp(m) \leq \sum_^n \exp(x_i) \leq n \exp(m). Applying the logarithm to the inequality gives the result.) In addition, we can scale the function to make the bounds tighter. Consider the function \frac 1 t \mathrm(tx). Then : \max 0 in the ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Gibbs Distribution
In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability that a system will be in a certain microstate (statistical mechanics), state as a function of that state's energy and the temperature of the system. The distribution is expressed in the form: :p_i \propto e^ where is the probability of the system being in state , is the energy of that state, and a constant of the distribution is the product of the Boltzmann constant and thermodynamic temperature . The symbol \propto denotes proportionality (mathematics), proportionality (see for the proportionality constant). The term ''system'' here has a very wide meaning; it can range from a collection of 'sufficient number' of atoms or a single atom to a macroscopic system such as a Natural gas storage, natural gas storage tank. Therefore the Bolt ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Luce's Choice Axiom
In probability theory, Luce's choice axiom, formulated by R. Duncan Luce (1959), states that the probability of selecting one item over another from a pool of many items is not affected by the presence or absence of other items in the pool. Selection of this kind is said to have "independence of irrelevant alternatives, independence from irrelevant alternatives" (IIA). Overview Consider a set X of possible outcomes, and consider a selection rule P, such that for any a\in A \subset X with A a finite set, the selector selects a from A with probability P(a , A). Luce proposed two choice axioms. The second one is usually meant by "Luce's choice axiom", as the first one is usually called "independence of irrelevant alternatives, independence from irrelevant alternatives" (IIA). Luce's choice axiom 1 (IIA): if P(a, A) = 0, P(b, A) > 0, then for any a, b \in B \subset \textA, we still have P(a, B) = 0. Luce's choice axiom 2 ("path independence"): P(a , A) = P(a, B)\sum_P(b, A)for any a ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Categorical Distribution
In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can take on one of ''K'' possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution, (e.g. 1 to ''K''). The ''K''-dimensional categorical distribution is the most general distribution over a ''K''-way event; any other discrete distribution over a size-''K'' sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1. The categorical distribution is the generalization of the Bernoulli distribution for a categorical random variable, i.e. for a dis ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Activation Function
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the linear perceptron in neural networks. However, only ''nonlinear'' activation functions allow such networks to compute nontrivial problems using only a small number of nodes, and such activation functions are called nonlinearities. Classification of activation functions The most common activation functions can be divided in three categories: ridge functions, radial functions and fold functions. An activation function f is saturating if \lim_ , \nabla f(v), = 0. It is nonsaturating if it is not saturating. Non-saturating activation functions, such as ReLU, may be better than saturating activation functions, as they don't suffer from vanishing gradient. Ridge activation functions ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Logistic Function
A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the graph of f approaching L as x approaches +\infty and approaching zero as x approaches -\infty. The logistic function finds applications in a range of fields, including biology (especially ecology), biomathematics, chemistry, demography, economics, geoscience, mathematical psychology, probability, sociology, political science, linguistics, statistics, and artificial neural networks. A generalization of the logistic function is the hyperbolastic function of type I. The standard logistic function, where L=1,k=1,x_0=0, is sometimes simply called ''the sigmoid''. It is also sometimes called the ''expit'', being the inverse of the logit. History The logistic function was introduced in a series of three papers by Pierre François Verhulst ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Jump Discontinuity
Continuous functions are of utmost importance in mathematics, functions and applications. However, not all functions are continuous. If a function is not continuous at a point in its domain, one says that it has a discontinuity there. The set of all points of discontinuity of a function may be a discrete set, a dense set, or even the entire domain of the function. This article describes the classification of discontinuities in the simplest case of functions of a single real variable taking real values. The oscillation of a function at a point quantifies these discontinuities as follows: * in a removable discontinuity, the distance that the value of the function is off by is the oscillation; * in a jump discontinuity, the size of the jump is the oscillation (assuming that the value ''at'' the point lies between these limits of the two sides); * in an essential discontinuity, oscillation measures the failure of a limit to exist; the limit is constant. A special case is if the fun ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Probability Theory
Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set of axioms. Typically these axioms formalise probability in terms of a probability space, which assigns a measure taking values between 0 and 1, termed the probability measure, to a set of outcomes called the sample space. Any specified subset of the sample space is called an event. Central subjects in probability theory include discrete and continuous random variables, probability distributions, and stochastic processes (which provide mathematical abstractions of non-deterministic or uncertain processes or measured quantities that may either be single occurrences or evolve over time in a random fashion). Although it is not possible to perfectly predict random events, much can be said about their behavior. Two major results in probability ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Min-plus Semiring
In idempotent analysis, the tropical semiring is a semiring of extended real numbers with the operations of minimum (or maximum) and addition replacing the usual ("classical") operations of addition and multiplication, respectively. The tropical semiring has various applications (see tropical analysis), and forms the basis of tropical geometry. The name ''tropical'' is a reference to the Hungarian-born computer scientist Imre Simon, so named because he lived and worked in Brazil. Definition The ' (or or ) is the semiring (ℝ ∪ , ⊕, ⊗), with the operations: : x \oplus y = \min\, : x \otimes y = x + y. The operations ⊕ and ⊗ are referred to as ''tropical addition'' and ''tropical multiplication'' respectively. The unit for ⊕ is +∞, and the unit for ⊗ is 0. Similarly, the ' (or or or ) is the semiring (ℝ ∪ , ⊕, ⊗), with operations: : x \oplus y = \max\, : x \otimes y = x + y. The unit for ⊕ is −∞, and the unit for ⊗ is 0. The two semirings are ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]