Stochastic Variance Reduction
   HOME
*





Stochastic Variance Reduction
(Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting. Variance reduction approaches are widely used for training machine learning models such as logistic regression and support vector machines as these problems have finite-sum structure and uniform conditioning that make them ideal candidates for variance reduction. Finite sum objectives A function f is considered to have finite sum structure if it can be decomposed into a summation or average: :f(x) = \frac\sum_^n f_i(x), where the function value and derivative of each f_i can be queried independently. Although variance reduction methods can be applied for any positive n and any f_i structure, their favorable theor ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Stochastic Approximation
Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations. In a nutshell, stochastic approximation algorithms deal with a function of the form f(\theta) = \operatorname E_ (\theta,\xi) which is the expected value of a function depending on a random variable \xi . The goal is to recover properties of such a function f without evaluating it directly. Instead, stochastic approximation algorithms use random samples of F(\theta,\xi) to efficiently approximate properties of f such as zeros or extrema. Recently, stochastic approximations have found extensive applications in the fields of statistics and machine lea ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Proximal Operator
In mathematical optimization, the proximal operator is an operator associated with a proper,An (extended) real-valued function ''f'' on a Hilbert space is said to be ''proper'' if it is not identically equal to +\infty, and -\infty is not in its image. lower semi-continuous convex function f from a Hilbert space \mathcal to \infty,+\infty/math>, and is defined by: ::\operatorname_f(v) = \arg \min_ \left(f(x) + \frac 1 2 \, x - v\, _\mathcal^2\right). For any function in this class, the minimizer of the right-hand side above is unique, hence making the proximal operator well-defined. The proximal operator is used in proximal gradient methods, which is frequently used in optimization algorithms associated with non-differentiable optimization problems such as total variation denoising. Properties The \text of a proper, lower semi-continuous convex function f enjoys several useful properties for optimization. * Fixed points of \text_f are minimizers of f: \ = \arg \min f. * ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Gradient Methods
In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gradient of a function is non-zero at a point , the direction of the gradient is the direction in which the function increases most quickly from , and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to maximize a function by gradient ascent. In coordinate-free terms, the gradient of a function f(\bf) may be defined by: :df=\nabla f \cdot d\bf where ''df'' is the total infinitesimal change in ''f'' for an infinitesimal displacement d\bf, and is seen to be maximal when d\bf is in the direction of the gradi ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Stochastic Optimization
Stochastic optimization (SO) methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involves random objective functions or random constraints. Stochastic optimization methods also include methods with random iterates. Some stochastic optimization methods use random iterates to solve stochastic problems, combining both meanings of stochastic optimization. Stochastic optimization methods generalize deterministic methods for deterministic problems. Methods for stochastic functions Partly random input data arise in such areas as real-time estimation and control, simulation-based optimization where Monte Carlo simulations are run as estimates of an actual system, and problems where there is experimental (random) error in the measurements of the criterion. In such cases, knowledge that the function values are contaminated by random "noise" leads natural ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Proximal Operator
In mathematical optimization, the proximal operator is an operator associated with a proper,An (extended) real-valued function ''f'' on a Hilbert space is said to be ''proper'' if it is not identically equal to +\infty, and -\infty is not in its image. lower semi-continuous convex function f from a Hilbert space \mathcal to \infty,+\infty/math>, and is defined by: ::\operatorname_f(v) = \arg \min_ \left(f(x) + \frac 1 2 \, x - v\, _\mathcal^2\right). For any function in this class, the minimizer of the right-hand side above is unique, hence making the proximal operator well-defined. The proximal operator is used in proximal gradient methods, which is frequently used in optimization algorithms associated with non-differentiable optimization problems such as total variation denoising. Properties The \text of a proper, lower semi-continuous convex function f enjoys several useful properties for optimization. * Fixed points of \text_f are minimizers of f: \ = \arg \min f. * ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Online Machine Learning
In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches. Introduction In the setting of supervised learning, a function of f : X \to Y is to be learned, where X is thought of as a ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Stochastic Gradient Descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in trade for a lower convergence rate. While the basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s, stochastic gradient descent has become an important optimization method in machine learning. Background Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum: : Q(w) = \frac\sum_^n Q_i(w), ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Coordinate Descent
Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a coordinate or coordinate block via a coordinate selection rule, then exactly or inexactly minimizes over the corresponding coordinate hyperplane while fixing all other coordinates or coordinate blocks. A line search along the coordinate direction can be performed at the current iterate to determine the appropriate step size. Coordinate descent is applicable in both differentiable and derivative-free contexts. Description Coordinate descent is based on the idea that the minimization of a multivariable function F(\mathbf) can be achieved by minimizing it along one direction at a time, i.e., solving univariate (or at least much simpler) optimization problems in a loop. In the simplest case of ''cyclic coordinate descent'', one cyclically iterates through the directions, one at a time, minimizing the objec ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Convex Conjugate
In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformation, or Fenchel conjugate (after Adrien-Marie Legendre and Werner Fenchel). It allows in particular for a far reaching generalization of Lagrangian duality. Definition Let X be a real topological vector space and let X^ be the dual space to X. Denote by :\langle \cdot , \cdot \rangle : X^ \times X \to \mathbb the canonical dual pairing, which is defined by \left( x^*, x \right) \mapsto x^* (x). For a function f : X \to \mathbb \cup \ taking values on the extended real number line, its is the function :f^ : X^ \to \mathbb \cup \ whose value at x^* \in X^ is defined to be the supremum: :f^ \left( x^ \right) := \sup \left\, or, equivalently, in terms of the infimum: :f^ \left( x^ \right) := - \inf \left\. This definition can be ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Logistic Regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regression (or logit regression) is estimation theory, estimating the parameters of a logistic model (the coefficients in the linear combination). Formally, in binary logistic regression there is a single binary variable, binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, h ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Duality (optimization)
In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then the dual is a maximization problem (and vice versa). Any feasible solution to the primal (minimization) problem is at least as large as any feasible solution to the dual (maximization) problem. Therefore, the solution to the primal is an upper bound to the solution of the dual, and the solution of the dual is a lower bound to the solution of the primal. This fact is called weak duality. In general, the optimal values of the primal and dual problems need not be equal. Their difference is called the duality gap. For convex optimization problems, the duality gap is zero under a constraint qualification condition. This fact is called strong duality. Dual problem Usually the term "dual problem" refers to the ''Lagrangian dual problem'' but other ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Random Variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the possible upper sides of a flipped coin such as heads H and tails T) in a sample space (e.g., the set \) to a measurable space, often the real numbers (e.g., \ in which 1 corresponding to H and -1 corresponding to T). Informally, randomness typically represents some fundamental element of chance, such as in the roll of a dice; it may also represent uncertainty, such as measurement error. However, the interpretation of probability is philosophically complicated, and even in specific cases is not always straightforward. The purely mathematical analysis of random variables is independent of such interpretational difficulties, and can be based upon a rigorous axiomatic setup. In the formal mathematical language of measure theory, a random var ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]