Stochastic Variance Reduction

	Stochastic Variance Reduction (Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting. Variance reduction approaches are widely used for training machine learning models such as logistic regression and support vector machines as these problems have finite-sum structure and uniform conditioning that make them ideal candidates for variance reduction. Finite sum objectives A function f is considered to have finite sum structure if it can be decomposed into a summation or average: :f(x) = \frac\sum_^n f_i(x), where the function value and derivative of each f_i can be queried independently. Although variance reduction methods can be applied for any positive n and any f_i structure, their favorable theor ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Stochastic Approximation Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations. In a nutshell, stochastic approximation algorithms deal with a function of the form f(\theta) = \operatorname E_ (\theta,\xi) which is the expected value of a function depending on a random variable \xi . The goal is to recover properties of such a function f without evaluating it directly. Instead, stochastic approximation algorithms use random samples of F(\theta,\xi) to efficiently approximate properties of f such as zeros or extrema. Recently, stochastic approximations have found extensive applications in the fields of statistics and machine lea ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Proximal Operator In mathematical optimization, the proximal operator is an Operator (mathematics), operator associated with a proper,An Extended real number line, (extended) real-valued function ''f'' on a Hilbert space is said to be ''proper'' if it is not identically equal to +\infty, and -\infty is not in its image. Semi-continuity, lower semi-continuous convex function f from a Hilbert space \mathcal to [-\infty,+\infty], and is defined by: ::\operatorname_f(v) = \arg \min_ \left(f(x) + \frac 1 2 \, x - v\, _\mathcal^2\right). For any function in this class, the minimizer of the right-hand side above is unique, hence making the proximal operator well-defined. The proximal operator is used in proximal gradient methods, which is frequently used in optimization algorithms associated with non-Differentiable function, differentiable optimization problems such as total variation denoising. Properties The \text of a proper, lower semi-continuous convex function f enjoys several useful propertie ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Gradient Methods In vector calculus, the gradient of a scalar-valued function, scalar-valued differentiable function f of Multivalued function, several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The gradient transforms like a vector under change of basis of the space of variables of f. If the gradient of a function is non-zero at a point p, the direction of the gradient is the direction in which the function increases most quickly from p, and the magnitude (mathematics), magnitude of the gradient is the rate of increase in that direction, the greatest absolute value, absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to minimize a function by gradient descent. In coordinate-free terms, the gradient of a function f(\mathbf) may be defined by: df ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Stochastic Optimization Stochastic optimization (SO) are optimization methods that generate and use random variables. For stochastic optimization problems, the objective functions or constraints are random. Stochastic optimization also include methods with random iterates. Some hybrid methods use random iterates to solve stochastic problems, combining both meanings of stochastic optimization. Stochastic optimization methods generalize deterministic methods for deterministic problems. Methods for stochastic functions Partly random input data arise in such areas as real-time estimation and control, simulation-based optimization where Monte Carlo simulations are run as estimates of an actual system, and problems where there is experimental (random) error in the measurements of the criterion. In such cases, knowledge that the function values are contaminated by random "noise" leads naturally to algorithms that use statistical inference tools to estimate the "true" values of the function and/or make sta ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Proximal Operator In mathematical optimization, the proximal operator is an Operator (mathematics), operator associated with a proper,An Extended real number line, (extended) real-valued function ''f'' on a Hilbert space is said to be ''proper'' if it is not identically equal to +\infty, and -\infty is not in its image. Semi-continuity, lower semi-continuous convex function f from a Hilbert space \mathcal to [-\infty,+\infty], and is defined by: ::\operatorname_f(v) = \arg \min_ \left(f(x) + \frac 1 2 \, x - v\, _\mathcal^2\right). For any function in this class, the minimizer of the right-hand side above is unique, hence making the proximal operator well-defined. The proximal operator is used in proximal gradient methods, which is frequently used in optimization algorithms associated with non-Differentiable function, differentiable optimization problems such as total variation denoising. Properties The \text of a proper, lower semi-continuous convex function f enjoys several useful propertie ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Online Machine Learning In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches. Introduction In the setting of supervised learning, a function of f : X \to Y is to ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Stochastic Gradient Descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high Computational complexity, computational burden, achieving faster iterations in exchange for a lower Rate of convergence, convergence rate. The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in machine learning. Background Both statistics, statistical M-estimation, estimation and ma ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Coordinate Descent Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a coordinate or coordinate block via a coordinate selection rule, then exactly or inexactly minimizes over the corresponding coordinate hyperplane while fixing all other coordinates or coordinate blocks. A line search along the coordinate direction can be performed at the current iterate to determine the appropriate step size. Coordinate descent is applicable in both differentiable and derivative-free contexts. Description Coordinate descent is based on the idea that the minimization of a multivariable function F(\mathbf) can be achieved by minimizing it along one direction at a time, i.e., solving univariate (or at least much simpler) optimization problems in a loop. In the simplest case of ''cyclic coordinate descent'', one cyclically iterates through the directions, one at a time, minimizing the ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Convex Conjugate In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformation, or Fenchel conjugate (after Adrien-Marie Legendre and Werner Fenchel). The convex conjugate is widely used for constructing the dual problem in optimization theory, thus generalizing Lagrangian duality. Definition Let X be a real topological vector space and let X^ be the dual space to X. Denote by :\langle \cdot , \cdot \rangle : X^ \times X \to \mathbb the canonical dual pairing, which is defined by \left\langle x^, x \right\rangle \mapsto x^ (x). For a function f : X \to \mathbb \cup \ taking values on the extended real number line, its is the function :f^ : X^ \to \mathbb \cup \ whose value at x^* \in X^ is defined to be the supremum: :f^ \left( x^ \right) := \sup \left\, or, equivalently, in terms of the in ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Logistic Regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regression (or logit regression) estimation theory, estimates the parameters of a logistic model (the coefficients in the linear or non linear combinations). In binary logistic regression there is a single binary variable, binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Duality (optimization) In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. If the primal is a minimization problem then the dual is a maximization problem (and vice versa). Any feasible solution to the primal (minimization) problem is at least as large as any feasible solution to the dual (maximization) problem. Therefore, the solution to the primal is an upper bound to the solution of the dual, and the solution of the dual is a lower bound to the solution of the primal. This fact is called weak duality. In general, the optimal values of the primal and dual problems need not be equal. Their difference is called the duality gap. For convex optimization problems, the duality gap is zero under a constraint qualification condition. This fact is called strong duality. Dual problem Usually the term "dual problem" refers to the ''Lagrangian dual problem'' but o ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Random Variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathematical definition refers to neither randomness nor variability but instead is a mathematical function (mathematics), function in which * the Domain of a function, domain is the set of possible Outcome (probability), outcomes in a sample space (e.g. the set \ which are the possible upper sides of a flipped coin heads H or tails T as the result from tossing a coin); and * the Range of a function, range is a measurable space (e.g. corresponding to the domain above, the range might be the set \ if say heads H mapped to -1 and T mapped to 1). Typically, the range of a random variable is a subset of the Real number, real numbers. Informally, randomness typically represents some fundamental element of chance, such as in the roll of a dice, d ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]