Curriculum learning is a technique in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

in which a

model A model is an informative representation of an object, person, or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin , . Models can be divided in ...

is trained on examples of increasing difficulty, where the definition of "difficulty" may be provided externally or discovered as part of the training process. This is intended to attain good performance more quickly, or to converge to a better local optimum if the

global optimum In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative' ...

is not found.

Approach

Most generally, curriculum learning is the technique of successively increasing the difficulty of examples in the

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

that is presented to a model over multiple training iterations. This can produce better results than exposing the model to the full training set immediately under some circumstances; most typically, when the model is able to learn general principles from easier examples, and then gradually incorporate more complex and nuanced information as harder examples are introduced, such as

edge case An edge case is a problem or situation that occurs only at an extreme (maximum or minimum) operating parameter. For example, a stereo speaker might noticeably distort audio when played at maximum volume, even in the absence of any other extreme s ...

s. This has been shown to work in many domains, most likely as a form of

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

. There are several major variations in how the technique is applied: * A concept of "difficulty" must be defined. This may come from human annotation or an external

heuristic A heuristic or heuristic technique (''problem solving'', '' mental shortcut'', ''rule of thumb'') is any approach to problem solving that employs a pragmatic method that is not fully optimized, perfected, or rationalized, but is nevertheless ...

; for example in

language modeling A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

, shorter sentences might be classified as easier than longer ones. Another approach is to use the performance of another model, with examples accurately predicted by that model being classified as easier (providing a connection to boosting). * Difficulty can be increased steadily or in distinct epochs, and in a deterministic schedule or according to a

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

. This may also be moderated by a requirement for diversity at each stage, in cases where easier examples are likely to be disproportionately similar to each other. * Applications must also decide the schedule for increasing the difficulty. Simple approaches may use a fixed schedule, such as training on easy examples for half of the available iterations and then all examples for the second half. Other approaches use self-paced learning to increase the difficulty in proportion to the performance of the model on the current set. Since curriculum learning only concerns the selection and ordering of training data, it can be combined with many other techniques in machine learning. The success of the method assumes that a model trained for an easier version of the problem can generalize to harder versions, so it can be seen as a form of

transfer learning Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...

. Some authors also consider curriculum learning to include other forms of progressively increasing complexity, such as increasing the number of model parameters. It is frequently combined with

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

, such as learning a simplified version of a game first. Some domains have shown success with anti-curriculum learning: training on the most difficult examples first. One example is the ACCAN method for

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

, which trains on the examples with the lowest

signal-to-noise ratio Signal-to-noise ratio (SNR or S/N) is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. SNR is defined as the ratio of signal power to noise power, often expressed in deci ...

first.

History

The term "curriculum learning" was introduced by

Yoshua Bengio Yoshua Bengio (born March 5, 1964) is a Canadian-French computer scientist, and a pioneer of artificial neural networks and deep learning. He is a professor at the Université de Montréal and scientific director of the AI institute Montreal In ...

et al in 2009, with reference to the

psychological Psychology is the scientific study of mind and behavior. Its subject matter includes the behavior of humans and nonhumans, both consciousness, conscious and Unconscious mind, unconscious phenomena, and mental processes such as thoughts, feel ...

technique of shaping in animals and structured education for humans: beginning with the simplest concepts and then building on them. The authors also note that the application of this technique in machine learning has its roots in the early study of

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

such as

Jeffrey Elman Jeffrey Locke Elman (January 22, 1948 – June 28, 2018) was an American psycholinguist and professor of cognitive science at the University of California, San Diego (UCSD). He specialized in the field of neural networks. In 1990, he introduced t ...

's 1993 paper ''Learning and development in neural networks: the importance of starting small''. Bengio et al showed good results for problems in

image classification Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form o ...

, such as identifying

geometric shape A shape is a graphical representation of an object's form or its external boundary, outline, or external surface. It is distinct from other object properties, such as color, texture, or material type. In geometry, ''shape'' excludes informat ...

s with progressively more complex forms, and

, such as training with a gradually expanding

vocabulary A vocabulary (also known as a lexicon) is a set of words, typically the set in a language or the set known to an individual. The word ''vocabulary'' originated from the Latin , meaning "a word, name". It forms an essential component of languag ...

. They conclude that, for curriculum strategies, "their beneficial effect is most pronounced on the test set", suggesting good generalization. The technique has since been applied to many other domains: *

Natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

: **

Part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...

** Intent detection **

Sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...

Machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

Language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

pre-training *

Image recognition Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form o ...

: **

Facial recognition Facial recognition or face recognition may refer to: *Face detection, often a step done before facial recognition *Face perception, the process by which the human brain understands and interprets the face *Pareidolia, which involves, in part, seein ...

Object detection Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...

Reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

: ** Game-playing * Graph learning *

Matrix factorization In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of ...

References

{{Reflist

Approach

History

References

Further reading