Empirical Risk Minimization

	Empirical Risk Minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true "risk") because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the "empirical" risk). Background Consider the following situation, which is a general setting of many supervised learning problems. We have two spaces of objects X and Y and would like to learn a function \ h: X \to Y (often called ''hypothesis'') which outputs an object y \in Y, given x \in X. To do so, we have at our disposal a ''training set'' of n examples \ (x_1, y_1), \ldots, (x_n, y_n) where x_i \in X is an input and y_i \in Y is the corresponding response that we wish to get from h(x_i). To put it more formally, we assu ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Statistical Learning Theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics. Introduction The goals of learning are understanding and prediction. Learning falls into many categories, including supervised learning, unsupervised learning, online learning, and reinforcement learning. From the perspective of statistical learning theory, supervised learning is best understood. Supervised learning involves learning from a training set of data. Every point in the training is an input-output pair, where the input maps to an output. The learning problem consists of inferring the function that maps between the input and the output, such that the learned function can be used to pre ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Empirical Measure In probability theory, an empirical measure is a random measure arising from a particular realization of a (usually finite) sequence of random variables. The precise definition is found below. Empirical measures are relevant to mathematical statistics. The motivation for studying empirical measures is that it is often impossible to know the true underlying probability measure P. We collect observations X_1, X_2, \dots , X_n and compute relative frequencies. We can estimate P, or a related distribution function F by means of the empirical measure or empirical distribution function, respectively. These are uniformly good estimates under certain conditions. Theorems in the area of empirical processes provide rates of this convergence. Definition Let X_1, X_2, \dots be a sequence of independent identically distributed random variables with values in the state space ''S'' with probability distribution ''P''. Definition :The ''empirical measure'' ''P''''n'' is defined for mea ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	M-estimator In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average. Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. 48 samples of robust M-estimators can be found in a recent review study. More generally, an M-estimator may be defined to be a zero of an estimating function. This estimating function is often the derivative of another statistical function. For example, a maximum-likelihood estimate is the point where the derivative of the likelihood function with respect to the parameter is zero; thus, a maximum-likelihood estimator is a critical point of the score function. In many applications, such M-estimators can be thought of as estimating characteristics of the pop ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Maximum Likelihood Estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference. If the likelihood function is differentiable, the derivative test for finding maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved analytically; for instance, the ordinary least squares estimator for a linear regression model maximizes the likelihood when all observed outcomes are assumed to have Normal distributions with the same variance. From the perspective of Bayesian infere ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Support Vector Machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues (Boser et al., 1992, Guyon et al., 1993, Cortes and Vapnik, 1995, Vapnik et al., 1997) SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik (1982, 1995) and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. N ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Hinge Loss In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output and a classifier score , the hinge loss of the prediction is defined as :\ell(y) = \max(0, 1-t \cdot y) Note that y should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, y = \mathbf \cdot \mathbf + b, where (\mathbf,b) are the parameters of the hyperplane and \mathbf is the input variable(s). When and have the same sign (meaning predicts the right class) and , y, \ge 1, the hinge loss \ell(y) = 0. When they have opposite signs, \ell(y) increases linearly with , and similarly if , y, < 1, even if it has the same sign (correct prediction, but not by enough margin). Extensions While binary SVMs are commonly extended to [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Convex Optimization Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets (or, equivalently, maximizing concave functions over convex sets). Many classes of convex optimization problems admit polynomial-time algorithms, whereas mathematical optimization is in general NP-hard. Convex optimization has applications in a wide range of disciplines, such as automatic control systems, estimation and signal processing, communications and networks, electronic circuit design, data analysis and modeling, finance, statistics (optimal experimental design), and structural optimization, where the approximation concept has proven to be efficient. With recent advancements in computing and optimization algorithms, convex programming is nearly as straightforward as linear programming. Definition A convex optimization problem is an optimization problem in which the objective function is a convex function and the feasible set ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Linearly Separable In Euclidean geometry, linear separability is a property of two sets of points. This is most easily visualized in two dimensions (the Euclidean plane) by thinking of one set of points as being colored blue and the other set of points as being colored red. These two sets are ''linearly separable'' if there exists at least one line in the plane with all of the blue points on one side of the line and all the red points on the other side. This idea immediately generalizes to higher-dimensional Euclidean spaces if the line is replaced by a hyperplane. The problem of determining if a pair of sets is linearly separable and finding a separating hyperplane if they are, arises in several areas. In statistics and machine learning, classifying certain types of data is a problem for which good algorithms exist that are based on this concept. Mathematical definition Let X_ and X_ be two sets of points in an ''n''-dimensional Euclidean space. Then X_ and X_ are ''linearly separable'' if there ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Linear Classifier In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables ( features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. Definition If the input feature vector to the classifier is a real vector \vec x, then the output score is :y = f(\vec\cdot\vec) = f\left(\sum_j w_j x_j\right), where \vec w is a real vector of weights and ''f'' is a function that converts the dot product of the two vectors into the desired outp ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard problem is the subset sum problem. A more precise specification is: a problem ''H'' is NP-hard when every problem ''L'' in NP can be reduced in polynomial time to ''H''; that is, assuming a solution for ''H'' takes 1 unit time, ''H''s solution can be used to solve ''L'' in polynomial time. As a consequence, finding a polynomial time algorithm to solve any NP-hard problem would give polynomial time algorithms for all the problems in NP. As it is suspected that P≠NP, it is unlikely that such an algorithm exists. It is suspected that there are no polynomial-time algorithms for NP-hard problems, but that has not been proven. Moreover, the class P, in which all problems can be solved in polynomial time, is contained in the NP class. Def ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Mathematical Optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfields: discrete optimization and continuous optimization. Optimization problems of sorts arise in all quantitative disciplines from computer science and engineering to operations research and economics, and the development of solution methods has been of interest in mathematics for centuries. In the more general approach, an optimization problem consists of maxima and minima, maximizing or minimizing a Function of a real variable, real function by systematically choosing Argument of a function, input values from within an allowed set and computing the Value (mathematics), value of the function. The generalization of optimization theory and techniques to other formulations constitutes a large area of applied mathematics. More generally, op ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Agnostic Learning Agnosticism is the view or belief that the existence of God, of the divine or the supernatural is unknown or unknowable. (page 56 in 1967 edition) Another definition provided is the view that "human reason is incapable of providing sufficient rational grounds to justify either the belief that God exists or the belief that God does not exist." The English biologist Thomas Henry Huxley coined the word ''agnostic'' in 1869, and said "It simply means that a man shall not say he knows or believes that which he has no scientific grounds for professing to know or believe." Earlier thinkers, however, had written works that promoted agnostic points of view, such as Sanjaya Belatthaputta, a 5th-century BCE Indian philosopher who expressed agnosticism about any afterlife;Bhaskar (1972). and Protagoras, a 5th-century BCE Greek philosopher who expressed agnosticism about the existence of "the gods". Defining agnosticism Being a scientist, above all else, Huxley presented agnos ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]

Extensions