Artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...

(ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by

neural circuitry Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...

. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an " AI winter". Later, advances in hardware and the development of the backpropagation algorithm as well as recurrent neural networks and convolutional neural networks, renewed interest in ANNs. The 2010s, saw the development of a deep neural network (a neural network with many layers) called

AlexNet AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor. AlexNet competed in the ImageNet Large Scale Visu ...

. It greatly outperformed other image recognition models, and is thought to have launched the ongoing AI spring, and further increasing interest in ANNs. The transformer architecture was first described in 2017 as a method to teach ANNs grammatical dependencies in language, and is the predominant architecture used by

large language models A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

, such as GPT-4. Diffusion models were first described in 2015, and began to be used by image generation models such as DALL-E in the 2020s.

Perceptrons and other early neural networks

The simplest feedforward network consists of a single weight layer without activation functions. It would be just a linear map, and training it would be linear regression.

Linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...

least squares method The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

was used by Legendre (1805) and Gauss (1795) for the prediction of planetary movement.Merriman, Mansfield. ''A List of Writings Relating to the Method of Least Squares: With Historical and Critical Notes''. Vol. 4. Academy, 1877. Warren McCulloch and Walter Pitts (1943) considered a non-learning computational model for neural networks. This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to artificial intelligence. This work led to work on nerve networks and their link to finite automata. In the early 1940s, D. O. Hebb created a learning hypothesis based on the mechanism of neural plasticity that became known as Hebbian learning. Hebbian learning is unsupervised learning. This evolved into models for long-term potentiation. Researchers started applying these ideas to computational models in 1948 with Turing's B-type machines. Farley and

Clark Clark is an English language surname, ultimately derived from the Latin with historical links to England, Scotland, and Ireland ''clericus'' meaning "scribe", "secretary" or a scholar within a religious order, referring to someone who was educate ...

(1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by Rochester, Holland, Habit and Duda (1956).

Rosenblatt Rosenblatt is a surname of German and Jewish origin, meaning "rose leaf". People with this surname include: * Albert Rosenblatt (born 1936), New York Court of Appeals judge * Dana Rosenblatt, known as "Dangerous" (born 1972), American boxer * Elie ...

(1958) created the perceptron, an algorithm for pattern recognition. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the exclusive-or circuit that could not be processed by neural networks at the time. In 1959, a biological model proposed by

Nobel laureate The Nobel Prizes ( sv, Nobelpriset, no, Nobelprisen) are awarded annually by the Royal Swedish Academy of Sciences, the Swedish Academy, the Karolinska Institutet, and the Norwegian Nobel Committee to individuals and organizations who make out ...

s Hubel and Wiesel was based on their discovery of two types of cells in the primary visual cortex: simple cells and

complex cell Complex cells can be found in the primary visual cortex (V1), the secondary visual cortex (V2), and Brodmann area 19 ( V3). Like a simple cell, a complex cell will respond primarily to oriented edges and gratings, however it has a degree of spa ...

s. Some say that research stagnated following

Minsky Minsky (Belarusian: Мінскі; Russian: Минский) is a family name originating in Eastern Europe. People *Hyman Minsky (1919–1996), American economist *Marvin Minsky (1927–2016), American cognitive scientist in the field of Ar ...

and Papert '' Perceptrons'' (1969),. Frank Rosenblatt (1958) proposed the perceptron, a

multilayer perceptron A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean ''any'' feedforward ANN, sometimes strictly to refer to networks composed of mu ...

(MLP) with 3 layers: an input layer, a hidden layer with randomized weights that did not learn, and an output layer. He later published a 1962 book also introduced variants and computer experiments, including a version with four-layer perceptrons where the last two layers have learned weights (and thus a proper multilayer perceptron). Some consider that the 1962 book developed and explored all of the basic ingredients of the deep learning systems of today. Group method of data handling, a method to train arbitrarily deep neural networks was published by

Alexey Ivakhnenko Alexey Ivakhnenko ( uk, Олексíй Григо́рович Іва́хненко); (30 March 1913 – 16 October 2007) was a Soviet and Ukrainian mathematician most famous for developing the Group Method of Data Handling (GMDH), a method of ind ...

and Lapa in 1967, which they regarded as a form of polynomial regression, or a generalization of Rosenblatt's perceptron. A 1971 paper described a deep network with eight layers trained by this method. The first deep learning

trained by

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...

was published in 1967 by

Shun'ichi Amari , is a Japanese scholar born in 1936 in Tokyo, Japan. Overviews He majored in Mathematical Engineering in 1958 from the University of Tokyo then graduated in 1963 from the Graduate School of the University of Tokyo. His M. Eng. in 1960 was en ...

. In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned internal representations to classify non-linearily separable pattern classes. Subsequent developments in hardware and hyperparameter tunings have made end-to-end

the currently dominant training technique.

Backpropagation

Backpropagation is an efficient application of the chain rule derived by Gottfried Wilhelm Leibniz in 1673 to networks of differentiable nodes. The terminology "back-propagating errors" was actually introduced in 1962 by Rosenblatt, but he did not know how to implement this, although Henry J. Kelley had a continuous precursor of backpropagation in 1960 in the context of control theory. The modern form of backpropagation was developed multiple times in early 1970s. The earliest published instance was Seppo Linnainmaa's master thesis (1970).

Paul Werbos Paul John Werbos (born 1947) is an American social scientist and machine learning pioneer. He is best known for his 1974 dissertation, which first described the process of training artificial neural networks through backpropagation of errors. He ...

developed it independently in 1971, but had difficulty publishing it until 1982. In 1986,

David E. Rumelhart David Everett Rumelhart (June 12, 1942 – March 13, 2011) was an American psychologist who made many contributions to the formal analysis of human cognition, working primarily within the frameworks of mathematical psychology, symbolic artific ...

et al. popularized backpropagation.

Recurrent network architectures

One origin of RNN was

statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. It does not assume or postulate any natural laws, but explains the macroscopic be ...

. The Ising model was developed by

Wilhelm Lenz Wilhelm Lenz (February 8, 1888 in Frankfurt am Main – April 30, 1957 in Hamburg) was a German physicist, most notable for his invention of the Ising model and for his application of the Laplace–Runge–Lenz vector to the old quantum mechanical ...

and Ernst Ising in the 1920s as a simple statistical mechanical model of magnets at equilibrium.

Glauber Glauber is a scientific discovery method written in the context of computational philosophy of science. It is related to machine learning in artificial intelligence. Glauber was written, among other programs, by Pat Langley, Herbert A. Simon ...

in 1963 studied the Ising model evolving in time, as a process towards equilibrium (

Glauber dynamics In statistical physics, Glauber dynamics is a way to simulate the Ising model (a model of magnetism) on a computer. It is a type of Markov Chain Monte Carlo algorithm. The algorithm In the Ising model, we have say N particles that can spin up ...

), adding in the component of time.

in 1972 proposed to modify the weights of an Ising model by Hebbian learning rule as a model of associative memory, adding in the component of learning. This was popularized as the Hopfield network (1982). Another origin of RNN was neuroscience. The word "recurrent" is used to describe loop-like structures in anatomy. In 1901,

Cajal Cajal: * Santiago Ramón y Cajal, Spanish histologist, physician, pathologist * Fortún Garcés Cajal, medieval Spanish nobleman * Nicolae Cajal (1919–2004), Romanian Jewish physician, academic, politician, philanthropist * Cajal Institute, a neu ...

observed "recurrent semicircles" in the cerebellar cortex. In 1933, Lorente de Nó discovered "recurrent, reciprocal connections" by Golgi's method, and proposed that excitatory loops explain certain aspects of the vestibulo-ocular reflex. Hebb considered "reverberating circuit" as an explanation for short-term memory. The McCulloch and Pitts paper (1943) considered neural networks that contains cycles, and noted that the current activity of such networks can be affected by activity indefinitely far in the past. Two early influential works were the

Jordan network A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

(1986) and the Elman network (1990), which applied RNN to study

cognitive psychology Cognitive psychology is the scientific study of mental processes such as attention, language use, memory, perception, problem solving, creativity, and reasoning. Cognitive psychology originated in the 1960s in a break from behaviorism, which ...

. In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent

layers Layer or layered may refer to: Arts, entertainment, and media * ''Layers'' (Kungs album) * ''Layers'' (Les McCann album) * ''Layers'' (Royce da 5'9" album) *"Layers", the title track of Royce da 5'9"'s sixth studio album *Layer, a female Maveric ...

in an RNN unfolded in time. Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN.

LSTM

Sepp Hochreiter Josef "Sepp" Hochreiter (born 14 February 1967) is a German computer scientist. Since 2018 he has led the Institute for Machine Learning at the Johannes Kepler University of Linz after having led the Institute of Bioinformatics from 2006 to 2018. ...

's diploma thesis (1991)S. Hochreiter.,
Untersuchungen zu dynamischen neuronalen Netzen
. . ''Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber'', 1991. proposed the neural history compressor, and identified and analyzed the vanishing gradient problem. In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent

in an RNN unfolded in time. Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN. Hochreiter proposed recurrent residual connections to solve the vanishing gradient problem. This led to the

long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...

(LSTM), published in 1995. LSTM can learn "very deep learning" tasks with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. That LSTM was not yet the modern architecture, which required a "forget gate", introduced in 1999, which became the standard RNN architecture.

Long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...

(LSTM) networks were invented by Hochreiter and Schmidhuber in 1995 and set accuracy records in multiple applications domains. It became the default choice for RNN architecture. Around 2006, LSTM started to revolutionize speech recognition, outperforming traditional models in certain speech applications. LSTM also improved large-vocabulary speech recognition and text-to-speech synthesis and was used in Google voice search, and dictation on

Android devices Android is a mobile operating system based on a modified version of the Linux kernel and other open-source software, open-source software, designed primarily for touchscreen mobile devices such as smartphones and tablet computer, tablets. Androi ...

. LSTM broke records for improved machine translation, language modeling and Multilingual Language Processing. LSTM combined with convolutional neural networks (CNNs) improved automatic image captioning.

Convolutional neural networks (CNNs)

The origin of the CNN architecture is the " neocognitron" introduced by Kunihiko Fukushima in 1980. It was inspired by work of Hubel and Wiesel in the 1950s and 1960s which showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted. In 1969, Kunihiko Fukushima also introduced the ReLU (rectified linear unit) activation function. The rectifier has become the most popular activation function for CNNs and deep neural networks in general. The time delay neural network (TDNN) was introduced in 1987 by

Alex Waibel Alexander Waibel (born 2 May 1956 in Heidelberg, Germany) is a professor of Computer Science at Carnegie Mellon University and Karlsruhe Institute of Technology. Waibel's research interests focus on speech recognition and translation and human co ...

and was one of the first CNNs, as it achieved shift invariance. It did so by utilizing weight sharing in combination with backpropagation training. Alexander Waibel et al.,
Phoneme Recognition Using Time-Delay Neural Networks
' IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. – 339 March 1989. Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one. In 1988, Wei Zhang et al. applied backpropagation to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system. In 1989, Yann LeCun et al. trained a CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.LeCun ''et al.'', "Backpropagation Applied to Handwritten Zip Code Recognition," ''Neural Computation'', 1, pp. 541–551, 1989. Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types. Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991 and breast cancer detection in mammograms in 1994. In 1990 Yamaguchi et al. introduced max-pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They combined TDNNs with max-pooling in order to realize a speaker independent isolated word recognition system. In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.J. Weng, N. Ahuja and T. S. Huang,
Cresceptron: a self-organizing neural network which grows adaptively
" ''Proc. International Joint Conference on Neural Networks'', Baltimore, Maryland, vol I, pp. 576–581, June, 1992.J. Weng, N. Ahuja and T. S. Huang,
Learning recognition and segmentation of 3-D objects from 2-D images
" ''Proc. 4th International Conf. Computer Vision'', Berlin, Germany, pp. 121–128, May, 1993.J. Weng, N. Ahuja and T. S. Huang,
Learning recognition and segmentation using the Cresceptron
" ''International Journal of Computer Vision'', vol. 25, no. 2, pp. 105–139, Nov. 1997. Max-pooling is often used in modern CNNs. LeNet-5, a 7-level CNN by Yann LeCun et al. in 1998, that classifies digits, was applied by several banks to recognize hand-written numbers on checks () digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources. In 2010, Backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants.Dominik Scherer, Andreas C. Müller, and Sven Behnke:
Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition
" ''In 20th International Conference Artificial Neural Networks (ICANN)'', pp. 92–101, 2010. . Behnke (2003) relied only on the sign of the gradient ( Rprop) on problems such as image reconstruction and face localization. Rprop is a

first-order In mathematics and other formal sciences, first-order or first order most often means either: * "linear" (a polynomial of degree at most one), as in first-order approximation and other calculus uses, where it is contrasted with "polynomials of high ...

optimization algorithm created by Martin Riedmiller and Heinrich Braun in 1992.Martin Riedmiller und Heinrich Braun: Rprop – A Fast Adaptive Learning Algorithm. Proceedings of the International Symposium on Computer and Information Science VII, 1992

Deep learning

The deep learning revolution started around CNN- and GPU-based computer vision. Although CNNs trained by backpropagation had been around for decades and GPU implementations of NNs for years, including CNNs, faster implementations of CNNs on GPUs were needed to progress on computer vision. Later, as deep learning becomes widespread, specialized hardware and algorithm optimizations were developed specifically for deep learning. A key advance for the deep learning revolution was hardware advances, especially GPU. Some early work dated back to 2004. In 2009, Raina, Madhavan, and

Andrew Ng Andrew Yan-Tak Ng (; born 1976) is a British-born American computer scientist and technology entrepreneur focusing on machine learning and AI. Ng was a co-founder and head of Google Brain and was the former Chief Scientist at Baidu, building ...

reported a 100M deep belief network trained on 30 Nvidia GeForce GTX 280 GPUs, an early demonstration of GPU-based deep learning. They reported up to 70 times faster training. In 2011, a CNN named ''DanNet'' by Dan Ciresan, Ueli Meier, Jonathan Masci,

Luca Maria Gambardella Luca Maria Gambardella (born 4 January 1962) is an Italian computer scientist and author. He is the former director of the Dalle Molle Institute for Artificial Intelligence Research in Manno, in the Ticino canton of Switzerland. With Marco Dor ...

, and Jürgen Schmidhuber achieved for the first time superhuman performance in a visual pattern recognition contest, outperforming traditional methods by a factor of 3. It then won more contests. They also showed showed how max-pooling CNNs on GPU improved performance significantly. Many discoveries were empirical and focused on engineering. For example, in 2011, Xavier Glorot, Antoine Bordes and

Yoshua Bengio Yoshua Bengio (born March 5, 1964) is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning. He is a professor at the Department of Computer Science and Operations Research at the Université de ...

found that the ReLU worked better than widely used activation functions prior to 2011. In October 2012,

Alex Krizhevsky Alex Krizhevsky is a Ukrainian-born Canadian computer scientist most noted for his work on artificial neural networks and deep learning. Shortly after having won the ImageNet challenge in 2012 with AlexNet, he and his colleagues sold their star ...

, Ilya Sutskever, and Geoffrey Hinton won the large-scale ImageNet competition by a significant margin over shallow machine learning methods. Further incremental improvements included the VGG-16 network by

Karen Simonyan Karen Simonyan ( hy, Կարեն Սիմոնյան; born 3 October 1988), is an Armenian politician, Member of the National Assembly of Armenia of Bright Armenia Bright Armenia ( hy, Լուսավոր Հայաստան, translit=Lusavor Hayastan) i ...

and Andrew Zisserman and Google's Inceptionv3. The success in image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs.... In 2014, the state of the art was training “very deep neural network” with 20 to 30 layers. Stacking too many layers led to a steep reduction in training accuracy, known as the "degradation" problem. In 2015, two techniques were developed concurrently to train very deep networks: highway network and

residual neural network A residual neural network (ResNet) is an artificial neural network (ANN). It is a gateless or open-gated variant of the HighwayNet, the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural n ...

(ResNet). The ResNet research team attempted to train deeper ones by empirically testing various tricks for training deeper networks until they discovered the deep residual network architecture.

Generative adversarial networks

In 1991, Juergen Schmidhuber published "artificial curiosity",

neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

in a

zero-sum game Zero-sum game is a mathematical representation in game theory and economic theory of a situation which involves two sides, where the result is an advantage for one side and an equivalent loss for the other. In other words, player one's gain is e ...

. The first network is a generative model that models a

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. GANs can be regarded as a case where the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set. It was extended to "predictability minimization" to create disentangled representations of input patterns''.'' Other people had similar ideas but did not develop them similarly. An idea involving adversarial networks was published in a 2010 blog post by Olli Niemitalo. This idea was never implemented and did not involve stochasticity in the generator and thus was not a generative model. It is now known as a conditional GAN or cGAN. An idea similar to GANs was used to model animal behavior by Li, Gauci and Gross in 2013. Another inspiration for GANs was noise-contrastive estimation, which uses the same loss function as GANs and which Goodfellow studied during his PhD in 2010–2014. Generative adversarial network (GAN) by (

Ian Goodfellow Ian J. Goodfellow (born ) is a computer scientist, engineer, and executive, most noted for his work on artificial neural networks and deep learning. He was previously employed as a research scientist at Google Brain and director of machine lea ...

et al., 2014) became state of the art in generative modeling during 2014-2018 period. Excellent image quality is achieved by Nvidia's

StyleGAN StyleGAN is a generative adversarial network (GAN) introduced by Nvidia researchers in December 2018, and made source available in February 2019. StyleGAN depends on Nvidia's CUDA software, GPUs, and Google's TensorFlow, or Meta AI's PyTorch, w ...

(2018) based on the Progressive GAN by Tero Karras et al. Here the GAN generator is grown from small to large scale in a pyramidal fashion. Image generation by GAN reached popular success, and provoked discussions concerning

deepfakes Deepfakes (a portmanteau of "deep learning" and "fake") are synthetic media in which a person in an existing image or video is replaced with someone else's likeness. While the act of creating fake content is not new, deepfakes leverage powerful ...

. Diffusion models (2015) eclipsed GANs in generative modeling since then, with systems such as DALL·E 2 (2022) and Stable Diffusion (2022).

Attention mechanism and Transformer

The human

selective attention Attentional control, colloquially referred to as concentration, refers to an individual's capacity to choose what they pay attention to and what they ignore. It is also known as endogenous attention or executive attention. In lay terms, attenti ...

had been studied in neuroscience and cognitive psychology. Selective attention of audition was studied in the cocktail party effect (

Colin Cherry Edward Colin Cherry (23 June 1914 – 23 November 1979) was a British cognitive scientist whose main contributions were in focused auditory attention, specifically the cocktail party problem regarding the capacity to follow one conversatio ...

, 1953). ( Donald Broadbent, 1958) proposed the filter model of attention. Selective attention of vision was studied in the 1960s by George Sperling's partial report paradigm. It was also noticed that saccade control is modulated by cognitive processes, in that the eye moves preferentially towards areas of high salience. As the fovea of the eye is small, the eye cannot sharply resolve all of the visual field at once. The use of saccade control allows the eye to quickly scan important features of a scene. These researches inspired algorithms, such as a variant of the Neocognitron. Conversely, developments in neural networks had inspired circuit models of biological visual attention. A key aspect of attention mechanism is the use of multiplicative operations, which had been studied under the names of '' higher-order neural networks'', ''multiplication units'', ''sigma-pi units'', ''fast weight controllers'', and ''hyper-networks''.

Recurrent attention

During the deep learning era, attention mechanism was developed solve similar problems in encoding-decoding. The idea of encoder-decoder sequence transduction had been developed in the early 2010s. The papers most commonly cited as the originators that produced seq2seq are two papers from 2014. A

seq2seq Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models and text summarization. History The algorithm was proposed by Mikolov ...

architecture employs two RNN, typically LSTM, an "encoder" and a "decoder", for sequence transduction, such as machine translation. They became state of the art in machine translation, and was instrumental in the development of

attention mechanism In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus ...

and Transformer. An image captioning model was proposed in 2015, citing inspiration from the seq2seq model. that would encode an input image into a fixed-length vector. (Xu et al 2015), citing (Bahdanau et al 2014), applied the attention mechanism as used in the seq2seq model to image captioning.

Transformer

One problem with seq2seq models was their use of recurrent neural networks, which are not parallelizable as both the encoder and the decoder processes the sequence token-by-token. The ''decomposable attention'' attempted to solve this problem by processing the input sequence in parallel, before computing a "soft alignment matrix" ("alignment" is the terminology used by (Bahdanau et al 2014)). This allowed parallel processing. The idea of using attention mechanism for self-attention, instead of in an encoder-decoder (cross-attention), was also proposed during this period, such as in differentiable neural computers and neural Turing machines. It was termed ''intra-attention'' where an LSTM is augmented with a memory network as it encodes an input sequence. These strands of development were combined in the Transformer architecture, published in '' Attention Is All You Need'' (2017). Subsequently, attention mechanisms were extended within the framework of Transformer architecture. Seq2seq models with attention still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016, ''decomposable attention'' applied attention mechanism to the

feedforward network Feedforward is the provision of context of what one wants to communicate prior to that communication. In purposeful activity, feedforward creates an expectation which the actor anticipates. When expected experience occurs, this provides confirmato ...

, which are easy to parallelize. One of its authors, Jakob Uszkoreit, suspected that attention ''without'' recurrence is sufficient for language translation, thus the title "attention is ''all'' you need". In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the " Attention is all you need" paper. At the time, the focus of the research was on improving

for machine translation, by removing its recurrence to processes all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance. Its parallelizability was an important factor to its widespread use in large neural networks.

Unsupervised and self-supervised learning

Self-organizing maps

Self-organizing maps (SOMs) were described by Teuvo Kohonen in 1982. SOMs are neurophysiologically inspired artificial neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using

competitive learning Competitive learning is a form of unsupervised learning in artificial neural networks, in which nodes compete for the right to respond to a subset of the input data. A variant of Hebbian learning, competitive learning works by increasing the specia ...

. SOMs create internal representations reminiscent of the cortical homunculus, a distorted representation of the

human body The human body is the structure of a Human, human being. It is composed of many different types of Cell (biology), cells that together create Tissue (biology), tissues and subsequently organ systems. They ensure homeostasis and the life, viabi ...

, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body.

Boltzmann machines

During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by Terry Sejnowski,

Peter Dayan Peter Dayan is director at the Max Planck Institute for Biological Cybernetics in Tübingen, Germany. He is co-author of ''Theoretical Neuroscience'', an influential textbook on computational neuroscience. He is known for applying Bayesian metho ...

, Geoffrey Hinton, etc., including the Boltzmann machine, restricted Boltzmann machine, Helmholtz machine, and the

wake-sleep algorithm The wake-sleep algorithm is an unsupervised learning algorithm for a stochastic multilayer neural network. The algorithm adjusts the parameters so as to produce a good density estimator. There are two learning phases, the “wake” phase and the ...

. These were designed for unsupervised learning of deep generative models. However, those were more computationally expensive compared to backpropagation. Boltzmann machine learning algorithm, published in 1985, was briefly popular before being eclipsed by the backpropagation algorithm in 1986. (p. 112 ). Geoffrey Hinton et al. (2006) proposed learning a high-level internal representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine to model each layer. This RBM is a

generative Generative may refer to: * Generative actor, a person who instigates social change * Generative art, art that has been created using an autonomous system that is frequently, but not necessarily, implemented using a computer * Generative music, mus ...

stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...

feedforward neural network A feedforward neural network (FNN) is an artificial neural network wherein connections between the nodes do ''not'' form a cycle. As such, it is different from its descendant: recurrent neural networks. The feedforward neural network was the ...

that can learn a

over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.

Deep learning

In 2012,

and

Jeff Dean Jeffrey Adgate "Jeff" Dean (born July 23, 1968) is an American computer scientist and software engineer. Since 2018, he is the lead of Google AI, Google's AI division. Education Dean received a B.S., ''summa cum laude'', from the University o ...

created an FNN that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.

Other aspects

Knowledge distillation

Knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. The idea of using the output of one neural network to train another neural network was studied as the teacher-student network configuration. In 1992, several papers studied the statistical mechanics of teacher-student network configuration, where both networks are committee machines or both are parity machines. Another early example of network distillation was also published in 1992, in the field of recurrent neural networks (RNNs). The problem was sequence prediction. It was solved by two RNNs. One of them ("atomizer") predicted the sequence, and another ("chunker") predicted the errors of the atomizer. Simultaneously, the atomizer predicted the internal states of the chunker. After the atomizer manages to predict the chunker's internal states well, it would start fixing the errors, and soon the chunker is obsoleted, leaving just one RNN in the end. A related methodology was ''model compression'' or ''pruning'', where a trained network is reduced in size. It was inspired by neurobiological studies showing that the human brain is resistant to damage, and was studied in the 1980s, via methods such as Biased Weight Decay and Optimal Brain Damage.

Hardware-based designs

The development of metal–oxide–semiconductor (MOS)

very-large-scale integration Very large-scale integration (VLSI) is the process of creating an integrated circuit (IC) by combining millions or billions of MOS transistors onto a single chip. VLSI began in the 1970s when MOS integrated circuit (Metal Oxide Semiconductor) c ...

(VLSI), combining millions or billions of MOS transistors onto a single chip in the form of complementary MOS (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s. Computational devices were created in

CMOS Complementary metal–oxide–semiconductor (CMOS, pronounced "sea-moss", ) is a type of metal–oxide–semiconductor field-effect transistor (MOSFET) fabrication process that uses complementary and symmetrical pairs of p-type and n-type MOSFE ...

, for both biophysical simulation and

neuromorphic computing Neuromorphic engineering, also known as neuromorphic computing, is the use of electronic circuits to mimic neuro-biological architectures present in the nervous system. A neuromorphic computer/chip is any device that uses physical artificial ne ...

inspired by the structure and function of the human brain. Nanodevices for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally

analog Analog or analogue may refer to: Computing and electronics * Analog signal, in which information is encoded in a continuous variable ** Analog device, an apparatus that operates on analog signals *** Analog electronics, circuits which use analo ...

rather than

digital Digital usually refers to something using discrete digits, often binary digits. Technology and computing Hardware *Digital electronics, electronic circuits which operate using digital signals **Digital camera, which captures and stores digital i ...

(even though the first implementations may use digital devices).

Notes

References

External links

* {{Cite web, url=https://drive.google.com/file/d/1f0sPHv7ozHafASPwIOfuvF_RvP3FDPY0, title=Lecun 2019-7-11 ACM Tech Talk, website=Google Docs, access-date=2020-02-13 Computational statistics Computational neuroscience