Weak supervision (also known as semi-supervised learning) is a paradigm in
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, the relevance and notability of which increased with the advent of
large language model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation.
The largest and most capable LLMs are g ...
s due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-
labeled data (exclusively used in more expensive and time-consuming
supervised learning
In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
paradigm), followed by a large amount of unlabeled data (used exclusively in
unsupervised learning
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
paradigm). In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the
transductive setting, these unsolved problems act as exam questions. In the
inductive setting, they become practice problems of the sort that will make up the exam.
Problem

The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render large, fully labeled training sets infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.
Technique

More formally, semi-supervised learning assumes a set of
independently identically distributed examples
with corresponding labels
and
unlabeled examples
are processed. Semi-supervised learning combines this information to surpass the
classification
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
performance that can be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.
Semi-supervised learning may refer to either
transductive learning or
inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data
only. The goal of inductive learning is to infer the correct mapping from
to
.
It is unnecessary (and, according to
Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.
Assumptions
In order to make any use of unlabeled data, some relationship to the underlying distribution of data must exist. Semi-supervised learning algorithms make use of at least one of the following assumptions:
Continuity / smoothness assumption
''Points that are close to each other are more likely to share a label.'' This is also generally assumed in supervised learning and yields a preference for geometrically simple
decision boundaries. In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision boundaries in low-density regions, so few points are close to each other but in different classes.
Cluster assumption
''The data tend to form discrete clusters, and points in the same cluster are more likely to share a label'' (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to
feature learning with clustering algorithms.
Manifold assumption
''The data lie approximately on a
manifold
In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an n-dimensional manifold, or ''n-manifold'' for short, is a topological space with the property that each point has a N ...
of much lower dimension than the input space.'' In this case learning the manifold using both the labeled and unlabeled data can avoid the
curse of dimensionality
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. T ...
. Then learning can proceed using distances and densities defined on the manifold.
The manifold assumption is practical when high-dimensional data are generated by some process that may be hard to model directly, but which has only a few degrees of freedom. For instance, human voice is controlled by a few vocal folds,
and images of various facial expressions are controlled by a few muscles. In these cases, it is better to consider distances and smoothness in the natural space of the generating problem, rather than in the space of all possible acoustic waves or images, respectively.
History
The heuristic approach of ''self-training'' (also known as ''self-learning'' or ''self-labeling'') is historically the oldest approach to semi-supervised learning, with examples of applications starting in the 1960s.
The transductive learning framework was formally introduced by
Vladimir Vapnik in the 1970s. Interest in inductive learning using generative models also began in the 1970s. A
''probably approximately correct'' learning bound for semi-supervised learning of a
Gaussian mixture was demonstrated by Ratsaby and Venkatesh in 1995.
[ in . Cited in ]
Methods
Generative models
Generative approaches to statistical learning first seek to estimate
, the distribution of data points belonging to each class. The probability
that a given point
has label
is then proportional to
by
Bayes' rule. Semi-supervised learning with
generative model
In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsiste ...
s can be viewed either as an extension of supervised learning (classification plus information about
) or as an extension of unsupervised learning (clustering plus some labels).
Generative models assume that the distributions take some particular form
parameterized by the vector
. If these assumptions are incorrect, the unlabeled data may actually decrease the accuracy of the solution relative to what would have been obtained from labeled data alone.
However, if the assumptions are correct, then the unlabeled data necessarily improves performance.
The unlabeled data are distributed according to a mixture of individual-class distributions. In order to learn the mixture distribution from the unlabeled data, it must be identifiable, that is, different parameters must yield different summed distributions. Gaussian mixture distributions are identifiable and commonly used for generative models.
The parameterized
joint distribution
A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...
can be written as
by using the
chain rule
In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h ...
. Each parameter vector
is associated with a decision function
.
The parameter is then chosen based on fit to both the labeled and unlabeled data, weighted by
:
:
[Zhu, Xiaojin]
Semi-Supervised Learning
University of Wisconsin-Madison.
Low-density separation
Another major class of methods attempts to place boundaries in regions with few data points (labeled or unlabeled). One of the most commonly used algorithms is the
transductive support vector machine, or TSVM (which, despite its name, may be used for inductive learning as well). Whereas
support vector machines
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
for supervised learning seek a decision boundary with maximal
margin
Margin may refer to:
Physical or graphical edges
*Margin (typography), the white space that surrounds the content of a page
* Continental margin, the zone of the ocean floor that separates the thin oceanic crust from thick continental crust
*Leaf ...
over the labeled data, the goal of TSVM is a labeling of the unlabeled data such that the decision boundary has maximal margin over all of the data. In addition to the standard
hinge loss for labeled data, a loss function
is introduced over the unlabeled data by letting
. TSVM then selects
from a
reproducing kernel Hilbert space
In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Specifically, a Hilbert space H of functions from a set X (to \mathbb or \mathbb) is ...
by minimizing the
regularized empirical risk:
:
An exact solution is intractable due to the non-
convex
Convex or convexity may refer to:
Science and technology
* Convex lens, in optics
Mathematics
* Convex set, containing the whole line segment that joins points
** Convex polygon, a polygon which encloses a convex set of points
** Convex polytop ...
term
, so research focuses on useful approximations.
Other approaches that implement low-density separation include Gaussian process models, information regularization, and entropy minimization (of which TSVM is a special case).
Laplacian regularization
Laplacian regularization has been historically approached through graph-Laplacian.
Graph-based methods for semi-supervised learning use a graph representation of the data, with a node for each labeled and unlabeled example. The graph may be constructed using domain knowledge or similarity of examples; two common methods are to connect each data point to its
nearest neighbors or to examples within some distance
. The weight
of an edge between
and
is then set to
.
Within the framework of
manifold regularization
In machine learning, Manifold regularization is a technique for using the shape of a dataset to constrain the functions that should be learned on that dataset. In many machine learning problems, the data to be learned do not cover the entire inpu ...
, the graph serves as a proxy for the manifold. A term is added to the standard
Tikhonov regularization problem to enforce smoothness of the solution relative to the manifold (in the intrinsic space of the problem) as well as relative to the ambient input space. The minimization problem becomes
:
where
is a reproducing kernel
Hilbert space
In mathematics, a Hilbert space is a real number, real or complex number, complex inner product space that is also a complete metric space with respect to the metric induced by the inner product. It generalizes the notion of Euclidean space. The ...
and
is the manifold on which the data lie. The regularization parameters
and
control smoothness in the ambient and intrinsic spaces respectively. The graph is used to approximate the intrinsic regularization term. Defining the
graph Laplacian where
and
is the vector