One-shot learning is an

object categorization problem In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic ...

, found mostly in

computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...

. Whereas most

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.

Motivation

The ability to learn object categories from few examples, and at a rapid pace, has been demonstrated in humans. It is estimated that a child learns almost all of the 10 ~ 30 thousand object categories in the world by age six. This is due not only to the human mind's computational power, but also to its ability to synthesize and learn new object categories from existing information about different, previously learned categories. Given two examples from two object categories: one, an unknown object composed of familiar shapes, the second, an unknown, amorphous shape; it is much easier for humans to recognize the former than the latter, suggesting that humans make use of previously learned categories when learning new ones. The key motivation for solving one-shot learning is that systems, like humans, can use knowledge about object categories to classify new objects.

Background

As with most

classification scheme In information science and ontology, a classification scheme is the product of arranging things into kinds of things (classes) or into ''groups'' of classes; this bears similarity to categorization, but with perhaps a more theoretical bent, as cla ...

s, one-shot learning involves three main challenges: *Representation: How should objects and categories be described? *Learning: How can such descriptions be created? *Recognition: How can a known object be filtered from enveloping clutter, irrespective of occlusion, viewpoint, and lighting? One-shot learning differs from single object recognition and standard category recognition algorithms in its emphasis on knowledge transfer, which makes use of previously learned categories. *Model parameters: Reuses model parameters, based on the similarity between old and new categories. Categories are first learned on numerous training examples, then new categories are learned using transformations of model parameters from those initial categories or selecting relevant parameters for a classifier. *Feature sharing: Shares parts or features of objects across categories. One algorithm extracts "diagnostic information" in patches from already learned categories by maximizing the patches'

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...

, and then applies these features to the learning of a new category. A dog category, for example, may be learned in one shot from previous knowledge of horse and cow categories, because dog objects may contain similar distinguishing patches. *Contextual information: Appeals to global knowledge of the scene in which the object appears. Such global information can be used as frequency distributions in a

conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid ...

framework to recognize objects. Alternatively context can consider camera height and scene geometry. Algorithms of this type have two advantages. First, they learn object categories that are relatively dissimilar; and second, they perform well in ad hoc situations where an image has not been hand-cropped and aligned.

Theory

The

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a follower ...

one-shot learning algorithm represents the foreground and background of images as parametrized by a mixture of constellation models. During the learning phase, the parameters of these models are learned using a conjugate density parameter posterior and Variational Bayesian Expectation-Maximization (VBEM). In this stage the previously learned object categories inform the choice of model parameters via transfer by contextual information. For object recognition on new images, the posterior obtained during the learning phase is used in a Bayesian decision framework to estimate the ratio of ''p(object , test, train)'' to ''p(background clutter , test, train)'' where ''p'' is the probability of the outcome.

Bayesian framework

Given the task of finding a particular object in a query image, the overall objective of the Bayesian one-shot learning algorithm is to compare the probability that object is present vs the probability that only background clutter is present. If the former probability is higher, the algorithm reports the object's presence, otherwise the algorithm reports its absence. To compute these probabilities, the object class must be modeled from a set of (1 ~ 5) training images containing examples. To formalize these ideas, let

I

be the query image, which contains either an example of the foreground category

O_

or only background clutter of a generic background category

O_

. Also let

I_t

be the set of training images used as the foreground category. The decision of whether

I

contains an object from the foreground category, or only clutter from the background category is: :

R = \frac =  \frac,

where the class posteriors

p(O_ , I, I_t)

and

p(O_, I, I_t)

have been expanded by

Bayes' Theorem In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...

, yielding a ratio of likelihoods and a ratio of object category

priors Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be l ...

. We decide that the image

I

contains an object from the foreground class if

R

exceeds a certain threshold

T

. We next introduce parametric models for the foreground and background categories with parameters

\theta

and

\theta_

respectively. This foreground parametric model is learned during the learning stage from

I_t

, as well as prior information of learned categories. The background model we assume to be uniform across images. Omitting the constant ratio of category priors,

\frac

, and parametrizing over

\theta

and

\theta_

yields :

R \propto \frac = \frac

, having simplified

p(I ,  \theta, O_)

and

p(I ,  \theta, O_)

p(I ,  \theta_)

and

p(I ,  \theta_).

The posterior distribution of model parameters given the training images,

p(\theta ,  I_t, O_)

is estimated in the learning phase. In this estimation, one-shot learning differs sharply from more traditional Bayesian estimation models that approximate the integral as

\delta(\theta^)

. Instead, it uses a variational approach using prior information from previously learned categories. However, the traditional

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...

of the model parameters is used for the background model and the categories learned in advance through training.

Object category model

For each query image

I

and training images

I_t

, a

constellation model The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of ''N'' parts under mutual g ...

is used for representation. To obtain this model for a given image

I

, first a set of N interesting regions is detected in the image using the

Kadir brady saliency detector Kadir is the primary transliteration of two Arabic male given names ( ar, قادر, also spelled Ghader, Kader, Qader, Qadir or Quadir) and ( ar, قدیر, also spelled Ghadir, Kadeer, Qadeer or Qadir). It's also one of the names of God in Islam, ...

. Each region selected is represented by a location in the image,

X_i

and a description of its appearance,

A_i

. Letting

X = \sum_^N X_i, A = \sum_^N A_i

and

X_t

and

A_t

the analogous representations for training images, the expression for R becomes: :

R \propto \frac = \frac

The likelihoods

p(X, A, \theta)

and

p(X, A, \theta_)

are represented as

mixtures In chemistry, a mixture is a material made up of two or more different chemical substances which are not chemically bonded. A mixture is the physical combination of two or more substances in which the identities are retained and are mixed in the ...

of constellation models. A typical constellation model has P(3 ~ 7) parts, with N(~100) interest regions. Thus a P-dimensional vector h assigns one region of interest (out of N regions) to each model part (for P parts). Thus h denotes a ''hypothesis'' (an assignment of interest regions to model parts) for the model and a full constellation model is represented by summing over all possible hypotheses h in the hypothesis space

H

. Finally the likelihood is written :

p(X,A, \theta) = \sum_^ \sum_ p(X,A,\textbf, \omega ,  \theta).

The different

\omega

's represent different configurations of parts, whereas the different hypotheses h represent different assignations of regions to parts, given a part model

\omega

. The assumption that the shape of the model (as represented by

X

, the collection of part locations) and appearance are independent allows one to consider the likelihood expression

p(X,A,\textbf, \omega ,  \theta)

as two separate likelihoods of appearance and shape.

Appearance

The appearance of each feature is represented by a point in appearance space (discussed below in implementation). "Each part

p

in the constellation model has a Gaussian density within this space with mean and precision parameters

\theta_^ =

." From these the appearance likelihood described above is computed as a product of Gaussians over the model parts for a give hypothesis h and mixture component

\omega

Shape

The shape of the model for a given mixture component

\omega

and hypothesis h is represented as a joint Gaussian density of the locations of features. These features are transformed into a scale and translation-invariant space before modelling the relative location of the parts by a 2(P - 1)-dimensional Gaussian. From this, we obtain the shape likelihood, completing our representation of

p(X,A, \textbf, \omega ,  \theta)

. In order to reduce the number of hypotheses in the hypothesis space

H

, only those hypotheses that satisfy the ordering constraint that the x-coordinate of each part is monotonically increasing are considered. This eliminates

P!

hypotheses from

H

Conjugate densities

In order to compute

R

, the integral

\int d\theta

must be evaluated, but is analytically intractable. The object category model above gives information about

p(X,A ,  \theta)

, so what remains is to examine

p(\theta,  X_t, A_t, O)

, the posterior of

\theta

, and find a sufficient approximation to render the integral tractable. Previous work approximates the posterior by a

\delta

function centered at

\theta^

, collapsing the integral in question into

p(X, A, \theta^)

. This

\theta^

is normally estimated using a

Maximum Likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

(

\theta^ = \theta^

) or

Maximum A Posteriori In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the b ...

(

\theta^ = \theta^

) procedure. However, because in one-shot learning, few training examples are used, the distribution will not be well-peaked, as is assumed in a

\delta

function approximation. Thus instead of this traditional approximation, the Bayesian one-shot learning algorithm seeks to "find a parametric form of

p(\theta)

such that the learning of

p(\theta,  X_t, A_t, O_)

is feasible". The algorithm employs a

Normal Normal(s) or The Normal(s) may refer to: Film and television * ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson * ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie * ''Norma ...

Wishart distribution In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. It is a family of probability distributions define ...

as the

conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...

p(\theta,  X_t, A_t, O_)

, and in the learning phase,

variational Bayesian methods Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...

with the same computational complexity as maximum likelihood methods are used to learn the

hyperparameter In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...

s of the distribution. Then, since

p(X,A, \theta)

is a product of Gaussians, as chosen in the object category model, the integral reduces to a multivariate Student's T distribution, which can be evaluated.

Implementation

Feature detection and representation

To detect features in an image so that it can be represented by a constellation model, the Kadir Brady feature detector is used on grey-scale images, finding salient regions of the image. These regions are then clustered, yielding a number of features (the clusters) and the shape parameter

X

, composed of the cluster centers. The Kadir Brady detector was chosen because it produces fewer, more salient regions, as opposed to feature detectors like multiscale Harris, which produces numerous, less significant regions. The regions are then taken from the image and rescaled to a small patch of 11 x 11 pixels, allowing each patch to be represented in 121-dimensional space. This dimensionality is reduced using

principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...

, and

A

, the appearance parameter, is then formed from the first 10 principal components of each patch.

Learning

To obtain shape and appearance priors, three categories (spotted cats, faces, and airplanes) are learned using maximum likelihood estimation. These object category model parameters are then used to estimate the hyper-parameters of the desired priors. Given a set of training examples, the algorithm runs the feature detector on these images, and determines model parameters from the salient regions. The hypothesis index h assigning features to parts prevents a closed-form solution of the linear model, so the posterior

p(\theta,  X_t, A_t, O_)

is estimated by variational Bayesian expectation-maximization, which is run until parameter convergence after ~ 100 iterations. Learning a category in this fashion takes under a minute on a 2.8 GHz machine with a 4-part model and < 10 training images.

Experimental results

Motorbike example

To learn the motorbike category: *Six training images are selected from the motorbike category of the Caltech 4 Data Set and the Kadir Brady detector is applied, giving

X_t

and through PCA,

A_t

. *Next, the prior model parameters are computed from 30 models

\theta_t

, 10 from each of the three learned categories: spotted cats, faces, and airplanes. This prior encodes the knowledge that "models lacking visual consistency

e background clutter E, or e, is the fifth letter and the second vowel letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''e'' (pronounced ); plura ...

occupy a different part of the parameter space

rom Rom, or ROM may refer to: Biomechanics and medicine * Risk of mortality, a medical classification to estimate the likelihood of death for a patient * Rupture of membranes, a term used during pregnancy to describe a rupture of the amniotic sac * R ...

coherent models." *In learning, which is performed next, the prior biases the posterior

p(\theta,  X_t, A_t, O_)

towards parts of the parameter space corresponding to coherent models. Only one mixture component is used, letting

\Omega = 1

. The estimation of the posterior is shown below. *Finally, the figures below show the learned motorbike model with shape and appearance of parts, and the corresponding features. *For recognition tests, the model above is applied to 50 images which contain motorbikes and 50 which do not. The image below shows an ROC curve, measuring the probability of detection over the probability of false detection, as well as some recognized examples.

Shared densities on transforms

Another algorithm uses knowledge transfer by model parameters to learn a new object category that is similar in appearance to previously learned categories. An image is represented as either a texture and shape, or as a latent image that has been transformed, denoted by

I = T(I_L)

. A

Siamese neural network A Siamese neural network (sometimes called a twin neural network) is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors ...

works in tandem on two different input vectors to compute comparable output vectors.

Congealing

In this context, congealing is "the simultaneous vectorization of each of a set of images to each other". For a set of training images of a certain category, congealing iteratively transforms each image to minimize the images' joint pixelwise entropies E, where :

E = \sum_^H(\nu(p)),

"where

\nu(p)

is the binary random variable defined by the values of a particular pixel p across all of the images,

H( )

is the discrete entropy function of that variable, and

1\leq p \leq P

is the set of pixel indices for the image." The congealing algorithm begins with a set of images

I_i

and a corresponding transform matrix

U_i

, which at the end of the algorithm will represent the transformation of

I_i

into its latent

I_

. These latents

I_

minimize the joint pixel-wise entropies. Thus the task of the congealing algorithm is to estimate the transformations

U_i

. Sketch of algorithm: * Initialize

U_I

's to the identity. * Compute the joint pixelwise entropies of the current set of images. * For each image

I_i

, iterate through all possible affine transformations

A

(rotation, x-translation, y-translation, x-scale, y-scale, x-shear, y-shear) and test if

AU_i

decreases the joint pixelwise entropies. If so, set

U_i = AU_i

. * Repeat previous step until convergence. At the end of the algorithm,

U_i(I) = I_

, and

T = U_i^

transforms the latent image back into the originally observed image.

Classification

To use this model for classification, it must be estimated with the maximum posterior probability given an observed image

I

. Applying Bayes' rule to

P(c_j, I)

and parametrization by the transformation

T

gives a difficult integral that must be approximated, and then the best transform

T

(that which maps the test image to its latent image) must be found. Once this transformation is found, the test image can be transformed into its latent, and a nearest neighbor classifier based on

Hausdorff distance In mathematics, the Hausdorff distance, or Hausdorff metric, also called Pompeiu–Hausdorff distance, measures how far two subsets of a metric space are from each other. It turns the set of non-empty compact subsets of a metric space into a metric ...

between images can classify the latent (and thus the test image) as belonging to a particular class

c_j

. To find

T

, the test image I is inserted into the training ensemble for the congealing process. Since the test image is drawn from one of the categories

c_j

, congealing provides a corresponding

T_ = U_^

that maps I to its latent. The latent can then be classified.

Single-example classification

Given a set of transformations

B_i

obtained from congealing many images of a certain category, the classifier can be extended to the case where only one training

I_t

example of a new category

c

is allowed. Applying all the transformations

B_i

sequentially to

I_t

creates an artificial training set for

c

. This artificial data set can be made larger by borrowing transformations from many already known categories. Once this data set is obtained,

I

, a test instance of

c

, can be classified as in the normal classification procedure. The key assumption is that categories are similar enough that the transforms from one can be applied to another.

In natural language processing

With the introduction of the practice of scaling up

language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...

s like OpenAI's

GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is an open-source artificial intelligence created by OpenAI in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while somet ...

and

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standard ...

, few-shot performance of such models has been shown to achieve competitive results on

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

(NLP) tasks, sometimes surpassing prior

state-of-the-art The state of the art (sometimes cutting edge or leading edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contexts it can also refer to a level ...

fine-tuning In theoretical physics, fine-tuning is the process in which parameters of a model must be adjusted very precisely in order to fit with certain observations. This had led to the discovery that the fundamental constants and quantities fall into suc ...

approaches. Examples of such NLP tasks are

translation Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...

question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...

cloze A cloze test (also cloze deletion test or occlusion test) is an exercise, test, or assessment consisting of a portion of language with certain items, words, or signs removed (cloze text), where the participant is asked to replace the missing la ...

tasks, unscrambling words, and using a novel word in a sentence. In these cases, one-shot and few-shot learning have been demonstrated purely via text interaction with the model. This is done by prepending examples directly to the prompt that is fed into the model for

inference Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinction that in ...

. The creation and optimisation of such prompts is called

prompt engineering Prompt engineering is a concept in artificial intelligence, particularly natural language processing (NLP). In prompt engineering, the description of the task is embedded in the input, e.g., as a question instead of it being implicitly given. Promp ...

and is now an active field of study.

Citations

References

* * * * * * * * * * * * * * *{{cite journal, first1=T., last1=Kadir, first2=M., last2=Brady, title=Scale, Saliency, and Image Description, url=https://www.researchgate.net/publication/220660282, journal=International Journal of Computer Vision, volume=45, issue=2, pages=83–105, year=2001, doi=10.1023/A:1012460413855, s2cid=825395 Learning in computer vision