HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, online machine learning is a method of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to
catastrophic interference Catastrophic interference, also known as catastrophic forgetting, is the tendency of an artificial neural network to abruptly and drastically forget previously learned information upon learning new information. Neural networks are an important par ...
, a problem that can be addressed by
incremental learning In computer science, incremental learning is a method of machine learning in which input data is continuously used to extend the existing model's knowledge i.e. to further train the model. It represents a dynamic technique of supervised learnin ...
approaches.


Introduction

In the setting of
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
, a function of f : X \to Y is to be learned, where X is thought of as a space of inputs and Y as a space of outputs, that predicts well on instances that are drawn from a
joint probability distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
p(x,y) on X \times Y. In reality, the learner never knows the true distribution p(x,y) over instances. Instead, the learner usually has access to a training set of examples (x_1, y_1), \ldots, (x_n, y_n). In this setting, the
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
is given as V : Y \times Y \to \mathbb, such that V(f(x), y) measures the difference between the predicted value f(x) and the true value y. The ideal goal is to select a function f \in \mathcal, where \mathcal is a space of functions called a hypothesis space, so that some notion of total loss is minimised. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms.


Statistical view of online learning

In statistical learning models, the training sample (x_i,y_i) are assumed to have been drawn from the true distribution p(x,y) and the objective is to minimize the expected "risk" :I = \mathbb (f(x), y)= \int V(f(x), y)\,dp(x, y) \ . A common paradigm in this situation is to estimate a function \hat through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized
least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...
and
support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratori ...
. A purely online model in this category would learn based on just the new input (x_,y_), the current best predictor f_ and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinear
kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for exampl ...
, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used where f_ is permitted to depend on f_t and all previous data points (x_1, y_1), \ldots, (x_t, y_t). In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques. A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch of b \ge 1 data points at a time, this can be considered as pseudo-online learning for b much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for example,
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
. When combined with
backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
, this is currently the de facto training method for training
artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...
.


Example: linear least squares

The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions.


Batch learning

Consider the setting of supervised learning with f being a linear function to be learned: : f(x_j) = \langle w,x_j\rangle = w \cdot x_j where x_j \in \mathbb^d is a vector of inputs (data points) and w \in \mathbb^d is a linear filter vector. The goal is to compute the filter vector w. To this end, a square loss function : V(f(x_j), y_j) = (f(x_j) - y_j)^2 = (\langle w,x_j\rangle - y_j)^2 is used to compute the vector w that minimizes the empirical loss : I_n = \sum_^ V(\langle w,x_j\rangle,y_j) = \sum_^ (x_j^Tw-y_j)^2 where : y_j \in \mathbb . Let X be the i \times d data matrix and y \in \mathbb^i is the column vector of target values after the arrival of the first i data points. Assuming that the covariance matrix \Sigma_i = X^T X is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solution f^*(x) = \langle w^*, x \rangle to the linear least squares problem is given by : w^* = (X^TX)^X^T y = \Sigma_i^ \sum_^ x_j y_j . Now, calculating the covariance matrix \Sigma_i = \sum_^ x_j x_j^T takes time O(id^2) , inverting the d \times d matrix takes time O(d^3), while the rest of the multiplication takes time O(d^2), giving a total time of O(id^2 + d^3). When there are n total points in the dataset, to recompute the solution after the arrival of every datapoint i=1, \ldots, n, the naive approach will have a total complexity O(n^2d^2 + nd^3). Note that when storing the matrix \Sigma_i , then updating it at each step needs only adding x_x_^T , which takes O(d^2) time, reducing the total time to O(nd^2 + nd^3) = O(nd^3), but with an additional storage space of O(d^2) to store \Sigma_i .L. Rosasco, T. Poggio, Machine Learning: a Regularization Approach, MIT-9.520 Lectures Notes, Manuscript, Dec. 2015. Chapter 7 - Online Learning


Online learning: recursive least squares

The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialising \textstyle w_0 = 0 \in \mathbb^d and \textstyle \Gamma_0 = I \in \mathbb^, the solution of the linear least squares problem given in the previous section can be computed by the following iteration: : \Gamma_i=\Gamma_-\frac : w_i = w_-\Gamma_ix_i(x_i^T w_-y_i) The above iteration algorithm can be proved using induction on i . The proof also shows that \Gamma_i = \Sigma_i^ . One can look at RLS also in the context of adaptive filters (see RLS). The complexity for n steps of this algorithm is O(nd^2), which is an order of magnitude faster than the corresponding batch learning complexity. The storage requirements at every step i here are to store the matrix \Gamma_i, which is constant at O(d^2). For the case when \Sigma_i is not invertible, consider the regularised version of the problem loss function \sum_^ (x_j^Tw - y_j)^2 + \lambda , , w , , _2^2 . Then, it's easy to show that the same algorithm works with \Gamma_0 = (I + \lambda I)^ , and the iterations proceed to give \Gamma_i = (\Sigma_i + \lambda I)^ .


Stochastic gradient descent

When this : \textstyle w_i = w_-\Gamma_ix_i(x_i^T w_-y_i) is replaced by : \textstyle w_i = w_-\gamma_i x_i(x_i^T w_-y_i) = w_ - \gamma_i \nabla V(\langle w_, x_i \rangle, y_i) or \Gamma_i \in \mathbb^ by \gamma_i \in \mathbb, this becomes the stochastic gradient descent algorithm. In this case, the complexity for n steps of this algorithm reduces to O(nd). The storage requirements at every step i are constant at O(d). However, the stepsize \gamma_i needs to be chosen carefully to solve the expected risk minimization problem, as detailed above. By choosing a decaying step size \gamma_i \approx \frac, one can prove the convergence of the average iterate \overline_n = \frac \sum_^ w_i . This setting is a special case of
stochastic optimization Stochastic optimization (SO) methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involves random objective functi ...
, a well known problem in optimization.


Incremental stochastic gradient descent

In practice, one can perform multiple stochastic gradient passes (also called cycles or epochs) over the data. The algorithm thus obtained is called incremental gradient method and corresponds to an iteration : \textstyle w_i = w_ - \gamma_i \nabla V(\langle w_, x_ \rangle, y_) The main difference with the stochastic gradient method is that here a sequence t_i is chosen to decide which training point is visited in the i -th step. Such a sequence can be stochastic or deterministic. The number of iterations is then decoupled to the number of points (each point can be considered more than once). The incremental gradient method can be shown to provide a minimizer to the empirical risk.Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optimization for Machine Learning, 85. Incremental techniques can be advantageous when considering objective functions made up of a sum of many terms e.g. an empirical error corresponding to a very large dataset.


Kernel methods

Kernels can be used to extend the above algorithms to non-parametric models (or models where the parameters form an infinite dimensional space). The corresponding procedure will no longer be truly online and instead involve storing all the data points, but is still faster than the brute force method. This discussion is restricted to the case of the square loss, though it can be extended to any convex loss. It can be shown by an easy induction that if X_i is the data matrix and w_i is the output after i steps of the SGD algorithm, then, : w_i = X_i^T c_i where \textstyle c_i = ((c_i)_1, (c_i)_2, ..., (c_i)_i) \in \mathbb^i and the sequence c_i satisfies the recursion: : c_0 = 0 : (c_i)_j = (c_)_j, j=1,2,...,i-1 and : (c_i)_i = \gamma_i \Big(y_i - \sum_^ (c_)_j\langle x_j, x_i \rangle\Big) Notice that here \langle x_j, x_i \rangle is just the standard Kernel on \mathbb^d , and the predictor is of the form : f_i(x) = \langle w_,x \rangle = \sum_^ (c_)_j \langle x_j,x \rangle . Now, if a general kernel K is introduced instead and let the predictor be : f_i(x) = \sum_^ (c_)_j K(x_j,x) then the same proof will also show that predictor minimising the least squares loss is obtained by changing the above recursion to : (c_i)_i = \gamma_i \Big(y_i - \sum_^(c_)_j K(x_j,x_i) \Big) The above expression requires storing all the data for updating c_i . The total time complexity for the recursion when evaluating for the n -th datapoint is O(n^2 d k) , where k is the cost of evaluating the kernel on a single pair of points. Thus, the use of the kernel has allowed the movement from a finite dimensional parameter space \textstyle w_ \in \mathbb^d to a possibly infinite dimensional feature represented by a kernel K by instead performing the recursion on the space of parameters \textstyle c_ \in \mathbb^i , whose dimension is the same as the size of the training dataset. In general, this is a consequence of the
representer theorem For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f^ of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represe ...
.


Online convex optimization

Online convex optimization (OCO) is a general framework for decision making which leverages
convex optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization pr ...
to allow for efficient algorithms. The framework is that of repeated game playing as follows: For t = 1,2,...,T * Learner receives input x_t * Learner outputs w_t from a fixed convex set S * Nature sends back a convex loss function v_t : S \rightarrow \mathbb . * Learner suffers loss v_t(w_t) and updates its model The goal is to minimize regret, or the difference between cumulative loss and the loss of the best fixed point u \in S in hindsight. As an example, consider the case of online least squares linear regression. Here, the weight vectors come from the convex set S = \mathbb^d , and nature sends back the convex loss function v_t(w) = ( \langle w,x_t \rangle - y_t )^2 . Note here that y_t is implicitly sent with v_t . Some online prediction problems however cannot fit in the framework of OCO. For example, in online classification, the prediction domain and the loss functions are not convex. In such scenarios, two simple techniques for convexification are used:
randomisation Randomization is the process of making something random. Randomization is not haphazard; instead, a random process is a sequence of random variables describing a process whose outcomes do not follow a deterministic pattern, but follow an evolution d ...
and surrogate loss functions. Some simple online convex optimisation algorithms are:


Follow the leader (FTL)

The simplest learning rule to try is to select (at the current step) the hypothesis that has the least loss over all past rounds. This algorithm is called Follow the leader, and is simply given round t by: : w_t = \operatorname_ \sum_^ v_i(w) This method can thus be looked as a
greedy algorithm A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...
. For the case of online quadratic optimization (where the loss function is v_t(w) = , , w - x_t , , _2^2 ), one can show a regret bound that grows as \log(T) . However, similar bounds cannot be obtained for the FTL algorithm for other important families of models like online linear optimization. To do so, one modifies FTL by adding regularisation.


Follow the regularised leader (FTRL)

This is a natural modification of FTL that is used to stabilise the FTL solutions and obtain better regret bounds. A regularisation function R : S \rightarrow \mathbb is chosen and learning performed in round as follows: : w_t = \operatorname_ \sum_^v_i(w) + R(w) As a special example, consider the case of online linear optimisation i.e. where nature sends back loss functions of the form v_t(w) = \langle w,z_t \rangle . Also, let S = \mathbb^d . Suppose the regularisation function R(w) = \frac , , w, , _2^2 is chosen for some positive number \eta . Then, one can show that the regret minimising iteration becomes : w_ = - \eta \sum_^ z_i = w_t - \eta z_t Note that this can be rewritten as w_ = w_t - \eta \nabla v_t(w_t) , which looks exactly like online gradient descent. If is instead some convex subspace of \mathbb^d , would need to be projected onto, leading to the modified update rule : w_ = \Pi_S(- \eta \sum_^ z_i) = \Pi_S(\eta \theta_) This algorithm is known as lazy projection, as the vector \theta_ accumulates the gradients. It is also known as Nesterov's dual averaging algorithm. In this scenario of linear loss functions and quadratic regularisation, the regret is bounded by O(\sqrt) , and thus the average regret goes to as desired.


Online subgradient descent (OSD)

The above proved a regret bound for linear loss functions v_t(w) = \langle w, z_t \rangle . To generalise the algorithm to any convex loss function, the
subgradient In mathematics, the subderivative, subgradient, and subdifferential generalize the derivative to convex functions which are not necessarily differentiable. Subderivatives arise in convex analysis, the study of convex functions, often in connecti ...
\partial v_t(w_t) of v_t is used as a linear approximation to v_t near w_t , leading to the online subgradient descent algorithm: Initialise parameter \eta, w_1 = 0 For t = 1,2,...,T * Predict using w_t , receive f_t from nature. * Choose z_t \in \partial v_t(w_t) * If S = \mathbb^d , update as w_ = w_t - \eta z_t * If S \subset \mathbb^d , project cumulative gradients onto S i.e. w_ = \Pi_S(\eta\theta_) , \theta_ = \theta_t + z_t One can use the OSD algorithm to derive O(\sqrt) regret bounds for the online version of SVM's for classification, which use the hinge loss v_t(w) = \max \


Other algorithms

Quadratically regularised FTRL algorithms lead to lazily projected gradient algorithms as described above. To use the above for arbitrary convex functions and regularisers, one uses online mirror descent. The optimal regularization in hindsight can be derived for linear loss functions, this leads to the AdaGrad algorithm. For the Euclidean regularisation, one can show a regret bound of O(\sqrt) , which can be improved further to a O(\log T) for strongly convex and exp-concave loss functions.


Continual learning

Continual learning means constantly improving the learned model by processing continuous streams of information. Continual learning capabilities are essential for software systems and autonomous agents interacting in an ever changing real world. However, continual learning is a challenge for machine learning and neural network models since the continual acquisition of incrementally available information from non-stationary data distributions generally leads to catastrophic forgetting.


Interpretations of online learning

The paradigm of online learning has different interpretations depending on the choice of the learning model, each of which has distinct implications about the predictive quality of the sequence of functions f_1, f_2, \ldots, f_n. The prototypical stochastic gradient descent algorithm is used for this discussion. As noted above, its recursion is given by : \textstyle w_t = w_ - \gamma_t \nabla V(\langle w_, x_t \rangle, y_t) The first interpretation consider the
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
method as applied to the problem of minimizing the expected risk I /math> defined above. Indeed, in the case of an infinite stream of data, since the examples (x_1, y_1), (x_2, y_2), \ldots are assumed to be drawn i.i.d. from the distribution p(x,y), the sequence of gradients of V(\cdot, \cdot) in the above iteration are an i.i.d. sample of stochastic estimates of the gradient of the expected risk I /math> and therefore one can apply complexity results for the stochastic gradient descent method to bound the deviation I _t- I ^\ast/math>, where w^\ast is the minimizer of I /math>.''Stochastic Approximation Algorithms and Applications'', Harold J. Kushner and G. George Yin, New York: Springer-Verlag, 1997. ; 2nd ed., titled ''Stochastic Approximation and Recursive Algorithms and Applications'', 2003, . This interpretation is also valid in the case of a finite training set; although with multiple passes through the data the gradients are no longer independent, still complexity results can be obtained in special cases. The second interpretation applies to the case of a finite training set and considers the SGD algorithm as an instance of incremental gradient descent method. In this case, one instead looks at the empirical risk: : I_n = \frac\sum_{i = 1}^nV(\langle w,x_i \rangle, y_i) \ . Since the gradients of V(\cdot, \cdot) in the incremental gradient descent iterations are also stochastic estimates of the gradient of I_n /math>, this interpretation is also related to the stochastic gradient descent method, but applied to minimize the empirical risk as opposed to the expected risk. Since this interpretation concerns the empirical risk and not the expected risk, multiple passes through the data are readily allowed and actually lead to tighter bounds on the deviations I_n _t- I_n ^\ast_n/math>, where w^\ast_n is the minimizer of I_n /math>.


Implementations

*
Vowpal Wabbit Vowpal Wabbit (VW) is an open-source fast online interactive machine learning system library and program developed originally at Yahoo! Research, and currently at Microsoft Research. It was started and is led by John Langford. Vowpal Wabbit's ...
: Open-source fast out-of-core online learning system which is notable for supporting a number of machine learning reductions, importance weighting and a selection of different loss functions and optimisation algorithms. It uses the hashing trick for bounding the size of the set of features independent of the amount of training data. *
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
: Provides out-of-core implementations of algorithms for ** Classification:
Perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...
, SGD classifier,
Naive bayes classifier In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Bay ...
. ** Regression: SGD Regressor, Passive Aggressive regressor. ** Clustering: Mini-batch k-means. ** Feature extraction: Mini-batch dictionary learning, Incremental PCA.


See also

Learning paradigms *
Incremental learning In computer science, incremental learning is a method of machine learning in which input data is continuously used to extend the existing model's knowledge i.e. to further train the model. It represents a dynamic technique of supervised learnin ...
*
Lazy learning In machine learning, lazy learning is a learning method in which generalization of the training data is, in theory, delayed until a query is made to the system, as opposed to eager learning, where the system tries to generalize the training data be ...
*
Offline learning In machine learning, systems which employ offline learning do not change their approximation of the target function when the initial training phase has been completed. These systems are also typically examples of eager learning. While in online ...
, the opposite model *
Reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
*
Multi-armed bandit In probability theory and machine learning, the multi-armed bandit problem (sometimes called the ''K''- or ''N''-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices ...
*
Supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
General algorithms *
Online algorithm In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start. In contrast, an o ...
* Online optimization *
Streaming algorithm In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). In most models, these algorithms have access t ...
*
Stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
Learning models *
Adaptive Resonance Theory Adaptive resonance theory (ART) is a theory developed by Stephen Grossberg and Gail Carpenter on aspects of how the brain processes information. It describes a number of neural network models which use supervised and unsupervised learning metho ...
*
Hierarchical temporal memory Hierarchical temporal memory (HTM) is a biologically constrained machine intelligence technology developed by Numenta. Originally described in the 2004 book ''On Intelligence'' by Jeff Hawkins with Sandra Blakeslee, HTM is primarily used today for ...
*
k-nearest neighbor algorithm In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regress ...
* Learning vector quantization *
Perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...


References


External links


6.883: Online Methods in Machine Learning: Theory and Applications. Alexander Rakhlin. MIT
Machine learning algorithms