HOME

TheInfoList



OR:

The distributional learning theory or learning of probability distribution is a framework in
computational learning theory In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms. Overview Theoretical results in machine learning m ...
. It has been proposed from Michael Kearns,
Yishay Mansour Jesse () or Yishai ( he, יִשַׁי – ''Yīšay'', – ''ʾĪšay''. in pausa he, יִשָׁי – ''Yīšāy'', meaning "King" or "God's gift"; syr, ܐܝܫܝ – ''Eshai''; el, Ἰεσσαί – ''Iessaí''; la, Issai, Isai, Jesse), i ...
,
Dana Ron Dana Ron Goldreich ( he, דנה רון גולדרייך; b. 1964) is a computer scientist, a professor of electrical engineering at the Tel Aviv University, Israel. Prof. Ron is one of the pioneers of research in property testing, and a leading ...
,
Ronitt Rubinfeld Ronitt Rubinfeld is a professor of electrical engineering and computer science at MIT. Education Rubinfeld graduated from the University of Michigan with a BSE in Electrical and Computer Engineering. Following that, she received her PhD from th ...
,
Robert Schapire Robert Elias Schapire is an American computer scientist, former David M. Siegel '83 Professor in the computer science department at Princeton University, and has recently moved to Microsoft Research. His primary specialty is theoretical and app ...
and Linda Sellie in 1994 M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, L. Sellie ''On the Learnability of Discrete Distributions''. ACM Symposium on Theory of Computing, 199

/ref> and it was inspired from the PAC-learning, PAC-framework introduced by
Leslie Valiant Leslie Gabriel Valiant (born 28 March 1949) is a British American computer scientist and computational theorist. He was born to a chemical engineer father and a translator mother. He is currently the T. Jefferson Coolidge Professor of Comput ...
.L. Valiant ''A theory of the learnable''. Communications of ACM, 1984
/ref> In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality, this framework has been used in a large variety of different fields like
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
,
approximation algorithms In computer science and operations research, approximation algorithms are efficient algorithms that find approximate solutions to optimization problems (in particular NP-hard problems) with provable guarantees on the distance of the returned solut ...
,
applied probability Applied probability is the application of probability theory to statistical problems and other scientific and engineering domains. Scope Much research involving probability is done under the auspices of applied probability. However, while such res ...
and
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
. This article explains the basic definitions, tools and results in this framework from the theory of computation point of view.


Definitions

Let \textstyle X be the support of the distributions of interest. As in the original work of Kearns et al. if \textstyle X is finite it can be assumed without loss of generality that \textstyle X = \^n where \textstyle n is the number of bits that have to be used in order to represent any \textstyle y \in X. We focus in probability distributions over \textstyle X. There are two possible representations of a probability distribution \textstyle D over \textstyle X. * probability distribution function (or evaluator) an evaluator \textstyle E_D for \textstyle D takes as input any \textstyle y \in X and outputs a real number \textstyle E_D /math> which denotes the probability that of \textstyle y according to \textstyle D, i.e. \textstyle E_D = \Pr = y/math> if \textstyle Y \sim D. * generator a generator \textstyle G_D for \textstyle D takes as input a string of truly random bits \textstyle y and outputs \textstyle G_D \in X according to the distribution \textstyle D. Generator can be interpreted as a routine that simulates sampling from the distribution \textstyle D given a sequence of fair coin tosses. A distribution \textstyle D is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time. Let \textstyle C_X a class of distribution over X, that is \textstyle C_X is a set such that every \textstyle D \in C_X is a probability distribution with support \textstyle X. The \textstyle C_X can also be written as \textstyle C for simplicity. Before defining learnability, it is necessary to define good approximations of a distribution \textstyle D. There are several ways to measure the distance between two distribution. The three more common possibilities are * Kullback-Leibler divergence *
Total variation distance of probability measures In probability theory, the total variation distance is a distance measure for probability distributions. It is an example of a statistical distance metric, and is sometimes called the statistical distance, statistical difference or variational dist ...
* Kolmogorov distance The strongest of these distances is the Kullback-Leibler divergence and the weakest is the Kolmogorov distance. This means that for any pair of distributions \textstyle D, \textstyle D' : : \text(D, D') \ge \text(D, D') \ge \text(D, D') Therefore, for example if \textstyle D and \textstyle D' are close with respect to Kullback-Leibler divergence then they are also close with respect to all the other distances. Next definitions hold for all the distances and therefore the symbol \textstyle d(D, D') denotes the distance between the distribution \textstyle D and the distribution \textstyle D' using one of the distances that we describe above. Although learnability of a class of distributions can be defined using any of these distances, applications refer to a specific distance. The basic input that we use in order to learn a distribution is a number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle \textstyle GEN(D) that returns a sample from the distribution \textstyle D. Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution \textstyle D in class of distributions \textstyle C. This quantity is called sample complexity of the learning algorithm. In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in.Lorenzo Rosasco, Tomaso Poggio, "A Regularization Tour of Machine Learning — MIT-9.520 Lectures Notes" Manuscript, Dec. 201

/ref> In this framework of
statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...
a training set \textstyle S = \ and the goal is to find a target function \textstyle f : X \rightarrow Y that minimizes some loss function, e.g. the square loss function. More formally f = \arg \min_ \int V(y, g(x)) d\rho(x, y), where V(\cdot, \cdot) is the loss function, e.g. V(y, z) = (y - z)^2 and \rho(x, y) the probability distribution according to which the elements of the training set are sampled. If the
conditional probability distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the co ...
\rho_x(y) is known then the target function has the closed form f(x) = \int_y y d\rho_x(y). So the set S is a set of samples from the
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
\rho(x, y). Now the goal of distributional learning theory if to find \rho given S which can be used to find the target function f. Definition of learnability ''A class of distributions \textstyle C is called efficiently learnable if for every \textstyle \epsilon > 0 and \textstyle 0 < \delta \le 1 given access to \textstyle GEN(D) for an unknown distribution \textstyle D \in C, there exists a polynomial time algorithm \textstyle A, called learning algorithm of \textstyle C, that outputs a generator or an evaluator of a distribution \textstyle D' such that'' : \Pr d(D, D') \le \epsilon \ge 1 - \delta ''If we know that \textstyle D' \in C then \textstyle A is called proper learning algorithm, otherwise is called improper learning algorithm.'' In some settings the class of distributions \textstyle C is a class with well known distributions which can be described by a set of parameters. For instance \textstyle C could be the class of all the Gaussian distributions \textstyle N(\mu, \sigma^2). In this case the algorithm \textstyle A should be able to estimate the parameters \textstyle \mu, \sigma. In this case \textstyle A is called parameter learning algorithm. Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description.


First results

In their seminal work, Kearns et al. deal with the case where \textstyle A is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution. * \textstyle OR gate distributions for this kind of distributions there is no polynomial-sized evaluator, unless \textstyle \#P \subseteq P/\text. On the other hand, this class is efficiently learnable with generator. * Parity gate distributions this class is efficiently learnable with both generator and evaluator. * Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator. * Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning framework.


\textstyle \epsilon-Covers

One very common technique in order to find a learning algorithm for a class of distributions \textstyle C is to first find a small \textstyle \epsilon-cover of \textstyle C. Definition ''A set \textstyle C_ is called \textstyle \epsilon-cover of \textstyle C if for every \textstyle D \in C there is a \textstyle D' \in C_ such that \textstyle d(D, D') \le \epsilon. An \textstyle \epsilon- cover is small if it has polynomial size with respect to the parameters that describe \textstyle D.'' Once there is an efficient procedure that for every \textstyle \epsilon > 0 finds a small \textstyle \epsilon-cover \textstyle C_ of C then the only left task is to select from \textstyle C_ the distribution \textstyle D' \in C_ that is closer to the distribution \textstyle D \in C that has to be learned. The problem is that given \textstyle D', D'' \in C_ it is not trivial how we can compare \textstyle d(D, D') and \textstyle d(D, D'') in order to decide which one is the closest to \textstyle D, because \textstyle D is unknown. Therefore, the samples from \textstyle D have to be used to do these comparisons. Obviously the result of the comparison always has a probability of error. So the task is similar with finding the minimum in a set of element using noisy comparisons. There are a lot of classical algorithms in order to achieve this goal. The most recent one which achieves the best guarantees was proposed by Daskalakis and
Kamath Kamat is a surname from Goa, Maharashtra and coastal Karnataka in India. It is found among Hindus of the Goud Saraswat Brahmin, Saraswat and Rajapur Saraswat Brahmin communities following Madhva Sampradaya of either Gokarna Matha or Kashi Mat ...
C. Daskalakis, G. Kamath ''Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians''. Annual Conference on Learning Theory, 201

/ref> This algorithm sets up a fast tournament between the elements of \textstyle C_ where the winner \textstyle D^* of this tournament is the element which is \textstyle \epsilon-close to \textstyle D (i.e. \textstyle d(D^*, D) \le \epsilon) with probability at least \textstyle 1 - \delta. In order to do so their algorithm uses \textstyle O(\log N / \epsilon^2) samples from \textstyle D and runs in \textstyle O(N \log N / \epsilon^2) time, where \textstyle N = , C_, .


Learning sums of random variables

Learning of simple well known distributions is a well studied field and there are a lot of estimators that can be used. One more complicated class of distributions is the distribution of a sum of variables that follow simple distributions. These learning procedure have a close relation with limit theorems like the central limit theorem because they tend to examine the same object when the sum tends to an infinite sum. Recently there are two results that described here include the learning Poisson binomial distributions and learning sums of independent integer random variables. All the results below hold using the
total variation In mathematics, the total variation identifies several slightly different concepts, related to the (local or global) structure of the codomain of a function or a measure. For a real-valued continuous function ''f'', defined on an interval 'a'' ...
distance as a distance measure.


Learning Poisson binomial distributions

Consider \textstyle n independent Bernoulli random variables \textstyle X_1, \dots, X_n with probabilities of success \textstyle p_1, \dots, p_n. A Poisson Binomial Distribution of order \textstyle n is the distribution of the sum \textstyle X = \sum_i X_i. For learning the class \textstyle PBD = \. The first of the following results deals with the case of improper learning of \textstyle PBD and the second with the proper learning of \textstyle PBD.C. Daskalakis, I. Diakonikolas, R. Servedio ''Learning Poisson Binomial Distributions''. ACM Symposium on Theory of Computing, 201

/ref> Theorem ''Let \textstyle D \in PBD then there is an algorithm which given \textstyle n, \textstyle \epsilon > 0, \textstyle 0 < \delta \le 1 and access to \textstyle GEN(D) finds a \textstyle D' such that \textstyle \Pr d(D, D') \le \epsilon \ge 1 - \delta. The sample complexity of this algorithm is \textstyle \tilde( ( 1 / \epsilon^3 ) \log (1 / \delta) ) and the running time is \textstyle \tilde( (1 / \epsilon^3) \log n \log^2 (1 / \delta) ).'' Theorem ''Let \textstyle D \in PBD then there is an algorithm which given \textstyle n, \textstyle \epsilon > 0, \textstyle 0 < \delta \le 1 and access to \textstyle GEN(D) finds a \textstyle D' \in PBD such that \textstyle \Pr d(D, D') \le \epsilon \ge 1 - \delta. The sample complexity of this algorithm is \textstyle \tilde( ( 1 / \epsilon^2 ) ) \log (1 / \delta) and the running time is \textstyle (1 / \epsilon)^ \tilde( \log n \log (1 / \delta) ).'' One part of the above results is that the sample complexity of the learning algorithm doesn't depend on \textstyle n, although the description of \textstyle D is linear in \textstyle n. Also the second result is almost optimal with respect to the sample complexity because there is also a lower bound of \textstyle O(1 / \epsilon^2). The proof uses a small \textstyle \epsilon-cover of \textstyle PBD that has been produced by Daskalakis and Papadimitriou,C. Daskalakis, C. Papadimitriou ''Sparse Covers for Sums of Indicators''. Probability Theory and Related Fields, 201

/ref> in order to get this algorithm.


Learning Sums of Independent Integer Random Variables

Consider \textstyle n independent random variables \textstyle X_1, \dots, X_n each of which follows an arbitrary distribution with support \textstyle \. A \textstyle k-sum of independent integer random variable of order \textstyle n is the distribution of the sum \textstyle X = \sum_i X_i. For learning the class \textstyle k-SIIRV = \ there is the following result Theorem ''Let \textstyle D \in k-SIIRV then there is an algorithm which given \textstyle n, \textstyle \epsilon > 0 and access to \textstyle GEN(D) finds a \textstyle D' such that \textstyle \Pr d(D, D') \le \epsilon \ge 1 - \delta. The sample complexity of this algorithm is \textstyle \text(k / \epsilon) and the running time is also \textstyle \text(k / \epsilon).'' Another part is that the sample and the time complexity does not depend on \textstyle n. Its possible to conclude this independence for the previous section if we set \textstyle k = 2.C. Daskalakis, I. Diakonikolas, R. O’Donnell, R. Servedio, L. Tan ''Learning Sums of Independent Integer Random Variables''. IEEE Symposium on Foundations of Computer Science, 201

/ref>


Learning mixtures of Gaussians

Let the random variables \textstyle X \sim N(\mu_1, \Sigma_1) and \textstyle Y \sim N(\mu_2, \Sigma_2). Define the random variable \textstyle Z which takes the same value as \textstyle X with probability \textstyle w_1 and the same value as \textstyle Y with probability \textstyle w_2 = 1 - w_1. Then if \textstyle F_1 is the density of \textstyle X and \textstyle F_2 is the density of \textstyle Y the density of \textstyle Z is \textstyle F = w_1 F_1 + w_2 F_2. In this case \textstyle Z is said to follow a mixture of Gaussians. Pearson K. Pearson ''Contribution to the Mathematical Theory of Evolution''. Philosophical Transactions of the Royal Society in London, 189

/ref> was the first who introduced the notion of the mixtures of Gaussians in his attempt to explain the probability distribution from which he got same data that he wanted to analyze. So after doing a lot of calculations by hand, he finally fitted his data to a mixture of Gaussians. The learning task in this case is to determine the parameters of the mixture \textstyle w_1, w_2, \mu_1, \mu_2, \Sigma_1, \Sigma_2. The first attempt to solve this problem was from S. Dasgupta, Dasgupta.S. Dasgupta ''Learning Mixtures of Gaussians''. IEEE Symposium on Foundations of Computer Science, 199

/ref> In this work S. Dasgupta, Dasgupta assumes that the two means of the Gaussians are far enough from each other. This means that there is a lower bound on the distance \textstyle , , \mu_1 - \mu_2, , . Using this assumption Dasgupta and a lot of scientists after him were able to learn the parameters of the mixture. The learning procedure starts with Cluster analysis, clustering the samples into two different clusters minimizing some metric. Using the assumption that the means of the Gaussians are far away from each other with high probability the samples in the first cluster correspond to samples from the first Gaussian and the samples in the second cluster to samples from the second one. Now that the samples are partitioned the \textstyle \mu_i, \Sigma_i can be computed from simple statistical estimators and \textstyle w_i by comparing the magnitude of the clusters. If \textstyle GM is the set of all the mixtures of two Gaussians, using the above procedure theorems like the following can be proved. Theorem ''Let \textstyle D \in GM with \textstyle , , \mu_1 - \mu_2, , \ge c \sqrt, where \textstyle c > 1/2 and \textstyle \lambda_(A) the largest eigenvalue of \textstyle A, then there is an algorithm which given \textstyle \epsilon > 0, \textstyle 0 < \delta \le 1 and access to \textstyle GEN(D) finds an approximation \textstyle w'_i, \mu'_i, \Sigma'_i of the parameters such that \textstyle \Pr , w_i - w'_i, , \le \epsilon \ge 1 - \delta (respectively for \textstyle \mu_i and \textstyle \Sigma_i. The sample complexity of this algorithm is \textstyle M = 2^ and the running time is \textstyle O(M^2 d + M d n).'' The above result could also be generalized in \textstyle k-mixture of Gaussians. For the case of mixture of two Gaussians there are learning results without the assumption of the distance between their means, like the following one which uses the total variation distance as a distance measure. Theorem ''Let \textstyle F \in GM then there is an algorithm which given \textstyle \epsilon > 0, \textstyle 0 < \delta \le 1 and access to \textstyle GEN(D) finds \textstyle w'_i, \mu'_i, \Sigma'_i such that if \textstyle F' = w'_1 F'_1 + w'_2 F'_2, where \textstyle F'_i = N(\mu'_i, \Sigma'_i) then \textstyle \Pr d(F, F') \le \epsilon \ge 1 - \delta. The sample complexity and the running time of this algorithm is \textstyle \text{poly}(n, 1 / \epsilon, 1 / \delta, 1 / w_1, 1 / w_2, 1 / d(F_1, F_2)).'' The distance between \textstyle F_1 and \textstyle F_2 doesn't affect the quality of the result of the algorithm but just the sample complexity and the running time.A. Kalai, A. Moitra, G. Valiant ''Efficiently Learning Mixtures of Two Gaussians'' ACM Symposium on Theory of Computing, 201

/ref>


References

Computational learning theory