In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, the perceptron (or McCulloch-Pitts neuron) is an algorithm for
supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
of
binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of
linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the val ...
, i.e. a classification algorithm that makes its predictions based on a
linear predictor function In statistics and in machine learning, a linear predictor function is a linear function ( linear combination) of a set of coefficients and explanatory variables (independent variables), whose value is used to predict the outcome of a dependent varia ...
combining a set of
weights with the
feature vector
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
.
History
The perceptron was invented in 1943 by McCulloch and Pitts. The first implementation was a machine built in 1958 at the
Cornell Aeronautical Laboratory
Calspan Corporation is a science and technology company founded in 1943 as part of the Research Laboratory of the Curtiss-Wright Airplane Division at Buffalo, New York. Calspan consists of four primary operating units: Flight Research, Transportati ...
by
Frank Rosenblatt
Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning.
Life and career
Rosenblatt was born in New Rochelle, New York as son o ...
, funded by the United States
Office of Naval Research
The Office of Naval Research (ONR) is an organization within the United States Department of the Navy responsible for the science and technology programs of the U.S. Navy and Marine Corps. Established by Congress in 1946, its mission is to plan ...
.
The perceptron was intended to be a machine, rather than a program, and while its first implementation was in software for the
IBM 704
The IBM 704 is a large digital mainframe computer introduced by IBM in 1954. It was the first mass-produced computer with hardware for floating-point arithmetic. The IBM 704 ''Manual of operation'' states:
The type 704 Electronic Data-Pro ...
, it was subsequently implemented in custom-built hardware as the "Mark 1 perceptron". This machine was designed for
image recognition
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
: it had an array of 400
photocell
Photodetectors, also called photosensors, are sensors of light or other electromagnetic radiation. There is a wide variety of photodetectors which may be classified by mechanism of detection, such as photoelectric or photochemical effects, or by ...
s, randomly connected to the "neurons". Weights were encoded in
potentiometer
A potentiometer is a three-terminal resistor with a sliding or rotating contact that forms an adjustable voltage divider. If only two terminals are used, one end and the wiper, it acts as a variable resistor or rheostat.
The measuring instrume ...
s, and weight updates during learning were performed by electric motors.
In a 1958 press conference organized by the US Navy, Rosenblatt made statements about the perceptron that caused a heated controversy among the fledgling
AI community; based on Rosenblatt's statements, ''
The New York Times
''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...
'' reported the perceptron to be "the embryo of an electronic computer that
he Navy
He or HE may refer to:
Language
* He (pronoun), an English pronoun
* He (kana), the romanization of the Japanese kana へ
* He (letter), the fifth letter of many Semitic alphabets
* He (Cyrillic), a letter of the Cyrillic script called ''He'' ...
expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."
Although the perceptron initially seemed promising, it was quickly proved that perceptrons could not be trained to recognise many classes of patterns. This caused the field of neural network research to stagnate for many years, before it was recognised that a
feedforward neural network
A feedforward neural network (FNN) is an artificial neural network wherein connections between the nodes do ''not'' form a cycle. As such, it is different from its descendant: recurrent neural networks.
The feedforward neural network was the ...
with two or more layers (also called a
multilayer perceptron
A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean ''any'' feedforward ANN, sometimes strictly to refer to networks composed of mu ...
) had greater processing power than perceptrons with one layer (also called a
single-layer perceptron).
Single-layer perceptrons are only capable of learning
linearly separable
In Euclidean geometry, linear separability is a property of two sets of points. This is most easily visualized in two dimensions (the Euclidean plane) by thinking of one set of points as being colored blue and the other set of points as being colo ...
patterns. For a classification task with some step activation function, a single node will have a single line dividing the data points forming the patterns. More nodes can create more dividing lines, but those lines must somehow be combined to form more complex classifications. A second layer of perceptrons, or even linear nodes, are sufficient to solve a lot of otherwise non-separable problems.
In 1969, a famous book entitled ''
Perceptrons
In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...
'' by
Marvin Minsky
Marvin Lee Minsky (August 9, 1927 – January 24, 2016) was an American cognitive and computer scientist concerned largely with research of artificial intelligence (AI), co-founder of the Massachusetts Institute of Technology's AI laboratory, an ...
and
Seymour Papert
Seymour Aubrey Papert (; 29 February 1928 – 31 July 2016) was a South African-born American mathematician, computer scientist, and educator, who spent most of his career teaching and researching at MIT. He was one of the pioneers of artificial ...
showed that it was impossible for these classes of network to learn an
XOR
Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false).
It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
function. It is often believed (incorrectly) that they also conjectured that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function. (See the page on ''
Perceptrons (book)
''Perceptrons: an introduction to computational geometry'' is a book written by Marvin Minsky and Seymour Papert and published in 1969. An edition with handwritten corrections and additions was released in the early 1970s. An expanded edition was ...
'' for more information.) Nevertheless, the often-miscited Minsky/Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until
neural network
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
research experienced a resurgence in the 1980s. This text was reprinted in 1987 as "Perceptrons - Expanded Edition" where some errors in the original text are shown and corrected.
A 2022 article states that the Mark 1 Perceptron was "part of a previously secret four-year NPIC
he_US'_National_Photographic_Interpretation_Center.html" ;"title="National Photographic Interpretation Center">he US' National Photographic Interpretation Center">National Photographic Interpretation Center">he US' National Photographic Interpretation Centereffort from 1963 through 1966 to develop this algorithm into a useful tool for photo-interpreters".
The kernel perceptron algorithm was already introduced in 1964 by Aizerman et al. Margin bounds guarantees were given for the Perceptron algorithm in the general non-separable case first by Yoav Freund, Freund and Robert Schapire, Schapire (1998),
and more recently by
Mehryar Mohri, Mohri and Rostamizadeh (2013) who extend previous results and give new L1 bounds.
The perceptron is a simplified model of a biological
neuron
A neuron, neurone, or nerve cell is an electrically excitable cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous tissue in all animals except sponges and placozoa. N ...
. While the complexity of
biological neuron models
Biological neuron models, also known as a spiking neuron models, are mathematical descriptions of the properties of certain cells in the nervous system that generate sharp electrical potentials across their cell membrane, roughly one millisecon ...
is often required to fully understand neural behavior, research suggests a perceptron-like linear model can produce some behavior seen in real neurons.
Definition
In the modern sense, the perceptron is an algorithm for learning a binary classifier called a
threshold function: a function that maps its input
(a real-valued
vector
Vector most often refers to:
*Euclidean vector, a quantity with a magnitude and a direction
*Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematic ...
) to an output value
(a single
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two digits (0 and 1)
* Binary function, a function that takes two arguments
* Binary operation, a mathematical operation that t ...
value):
:
where
is a vector of real-valued weights,
is the
dot product
In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
, where is the number of inputs to the perceptron, and is the ''bias''. The bias shifts the decision boundary away from the origin and does not depend on any input value.
The value of
(0 or 1) is used to classify
as either a positive or a negative instance, in the case of a binary classification problem. If is negative, then the weighted combination of inputs must produce a positive value greater than
in order to push the classifier neuron over the 0 threshold. Spatially, the bias alters the position (though not the orientation) of the
decision boundary
__NOTOC__
In a statistical-classification problem with two classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class. The classifier will classify all the point ...
. The perceptron learning algorithm does not terminate if the learning set is not
linearly separable
In Euclidean geometry, linear separability is a property of two sets of points. This is most easily visualized in two dimensions (the Euclidean plane) by thinking of one set of points as being colored blue and the other set of points as being colo ...
. If the vectors are not linearly separable learning will never reach a point where all vectors are classified properly. The most famous example of the perceptron's inability to solve problems with linearly nonseparable vectors is the Boolean
exclusive-or
Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false).
It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , , ...
problem. The solution spaces of decision boundaries for all binary functions and learning behaviors are studied in the reference.
In the context of neural networks, a perceptron is an
artificial neuron
An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing e ...
using the
Heaviside step function
The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function, named after Oliver Heaviside (1850–1925), the value of which is zero for negative arguments and one for positive argume ...
as the activation function. The perceptron algorithm is also termed the single-layer perceptron, to distinguish it from a
multilayer perceptron
A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean ''any'' feedforward ANN, sometimes strictly to refer to networks composed of mu ...
, which is a misnomer for a more complicated neural network. As a linear classifier, the single-layer perceptron is the simplest
feedforward neural network
A feedforward neural network (FNN) is an artificial neural network wherein connections between the nodes do ''not'' form a cycle. As such, it is different from its descendant: recurrent neural networks.
The feedforward neural network was the ...
.
Learning algorithm
Below is an example of a learning algorithm for a single-layer perceptron. For
multilayer perceptron
A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean ''any'' feedforward ANN, sometimes strictly to refer to networks composed of mu ...
s, where a hidden layer exists, more sophisticated algorithms such as
backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward neural network, feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANN ...
must be used. If the activation function or the underlying process being modeled by the perceptron is
nonlinear
In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many other ...
, alternative learning algorithms such as the
delta rule
In machine learning, the delta rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network. It is a special case of the more general backpropagation algorithm. For a n ...
can be used as long as the activation function is
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...
. Nonetheless, the learning algorithm described in the steps below will often work, even for multilayer perceptrons with nonlinear activation functions.
When multiple perceptrons are combined in an artificial neural network, each output neuron operates independently of all the others; thus, learning each output can be considered in isolation.
Definitions
We first define some variables:
*r is the learning rate of the perceptron. Learning rate is between 0 and 1. Larger values make the weight changes more volatile.
*
denotes the ''output'' from the perceptron for an input vector
.
*
is the ''training set'' of
samples, where:
**
is the
-dimensional input vector.
**
is the desired output value of the perceptron for that input.
We show the values of the features as follows:
*
is the value of the
th feature of the
th training ''input vector''.
*
.
To represent the weights:
*
is the
th value in the ''weight vector'', to be multiplied by the value of the
th input feature.
*Because
, the
is effectively a bias that we use instead of the bias constant
.
To show the time-dependence of
, we use:
*
is the weight
at time
.
Steps
The algorithm updates the weights after steps 2a and 2b. These weights are immediately applied to a pair in the training set, and subsequently updated, rather than waiting until all pairs in the training set have undergone these steps.
Convergence
The perceptron is a
linear classifier
In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the val ...
, therefore it will never get to the state with all the input vectors classified correctly if the training set is not
linearly separable
In Euclidean geometry, linear separability is a property of two sets of points. This is most easily visualized in two dimensions (the Euclidean plane) by thinking of one set of points as being colored blue and the other set of points as being colo ...
, i.e. if the positive examples cannot be separated from the negative examples by a hyperplane. In this case, no "approximate" solution will be gradually approached under the standard learning algorithm, but instead, learning will fail completely. Hence, if linear separability of the training set is not known a priori, one of the training variants below should be used.
If the training set ''is'' linearly separable, then the perceptron is guaranteed to converge. Furthermore, there is an upper bound on the number of times the perceptron will adjust its weights during the training.
Suppose that the input vectors from the two classes can be separated by a hyperplane with a margin
, i.e. there exists a weight vector
, and a bias term such that
for all
with
and
for all
with
, where
is the desired output value of the perceptron for input
. Also, let denote the maximum norm of an input vector. Novikoff (1962) proved that in this case the perceptron algorithm converges after making
updates. The idea of the proof is that the weight vector is always adjusted by a bounded amount in a direction with which it has a negative
dot product
In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
, and thus can be bounded above by , where is the number of changes to the weight vector. However, it can also be bounded below by because if there exists an (unknown) satisfactory weight vector, then every change makes progress in this (unknown) direction by a positive amount that depends only on the input vector.
While the perceptron algorithm is guaranteed to converge on ''some'' solution in the case of a linearly separable training set, it may still pick ''any'' solution and problems may admit many solutions of varying quality. The ''perceptron of optimal stability'', nowadays better known as the linear
support-vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
, was designed to solve this problem (Krauth and Mezard, 1987).
Variants
The pocket algorithm with ratchet (Gallant, 1990) solves the stability problem of perceptron learning by keeping the best solution seen so far "in its pocket". The pocket algorithm then returns the solution in the pocket, rather than the last solution. It can be used also for non-separable data sets, where the aim is to find a perceptron with a small number of misclassifications. However, these solutions appear purely stochastically and hence the pocket algorithm neither approaches them gradually in the course of learning, nor are they guaranteed to show up within a given number of learning steps.
The Maxover algorithm (Wendemuth, 1995) is
"robust" in the sense that it will converge regardless of (prior) knowledge of linear separability of the data set. In the linearly separable case, it will solve the training problem – if desired, even with optimal stability (
maximum margin between the classes). For non-separable data sets, it will return a solution with a small number of misclassifications. In all cases, the algorithm gradually approaches the solution in the course of learning, without memorizing previous states and without stochastic jumps. Convergence is to global optimality for separable data sets and to local optimality for non-separable data sets.
The Voted Perceptron (Freund and Schapire, 1999), is a variant using multiple weighted perceptrons. The algorithm starts a new perceptron every time an example is wrongly classified, initializing the weights vector with the final weights of the last perceptron. Each perceptron will also be given another weight corresponding to how many examples do they correctly classify before wrongly classifying one, and at the end the output will be a weighted vote on all perceptrons.
In separable problems, perceptron training can also aim at finding the largest separating margin between the classes. The so-called perceptron of optimal stability can be determined by means of iterative training and optimization schemes, such as the Min-Over algorithm (Krauth and Mezard, 1987)
or the AdaTron (Anlauf and Biehl, 1989)). AdaTron uses the fact that the corresponding quadratic optimization problem is convex. The perceptron of optimal stability, together with the
kernel trick
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
, are the conceptual foundations of the
support-vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
.
The
-perceptron further used a pre-processing layer of fixed random weights, with thresholded output units. This enabled the perceptron to classify
analogue patterns, by projecting them into a
binary space. In fact, for a projection space of sufficiently high dimension, patterns can become linearly separable.
Another way to solve nonlinear problems without using multiple layers is to use higher order networks (sigma-pi unit). In this type of network, each element in the input vector is extended with each pairwise combination of multiplied inputs (second order). This can be extended to an ''n''-order network.
It should be kept in mind, however, that the best classifier is not necessarily that which classifies all the training data perfectly. Indeed, if we had the prior constraint that the data come from equi-variant Gaussian distributions, the linear separation in the input space is optimal, and the nonlinear solution is
overfitted.
Other linear classification algorithms include
Winnow,
support-vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
, and
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
.
Multiclass perceptron
Like most other techniques for training linear classifiers, the perceptron generalizes naturally to
multiclass classification
In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...
. Here, the input
and the output
are drawn from arbitrary sets. A feature representation function
maps each possible input/output pair to a finite-dimensional real-valued feature vector. As before, the feature vector is multiplied by a weight vector
, but now the resulting score is used to choose among many possible outputs:
:
Learning again iterates over the examples, predicting an output for each, leaving the weights unchanged when the predicted output matches the target, and changing them when it does not. The update becomes:
:
This multiclass feedback formulation reduces to the original perceptron when
is a real-valued vector,
is chosen from
, and
.
For certain problems, input/output representations and features can be chosen so that
can be found efficiently even though
is chosen from a very large or even infinite set.
Since 2002, perceptron training has become popular in the field of
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
for such tasks as
part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
and
syntactic parsing
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lat ...
(Collins, 2002). It has also been applied to large-scale machine learning problems in a
distributed computing
A distributed system is a system whose components are located on different computer network, networked computers, which communicate and coordinate their actions by message passing, passing messages to one another from any system. Distributed com ...
setting.
References
Further reading
* Aizerman, M. A. and Braverman, E. M. and Lev I. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964.
*
Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386–408. .
*
Rosenblatt, Frank (1962), Principles of Neurodynamics. Washington, DC: Spartan Books.
*
Minsky, M. L. and Papert, S. A. 1969. ''Perceptrons''. Cambridge, MA: MIT Press.
* Gallant, S. I. (1990)
Perceptron-based learning algorithms.IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 179–191.
* Mohri, Mehryar and Rostamizadeh, Afshin (2013)
Perceptron Mistake BoundsarXiv:1305.0208, 2013.
* Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, 615–622. Polytechnic Institute of Brooklyn.
*
Widrow, B., Lehr, M.A.,
30 years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation" ''Proc. IEEE'', vol 78, no 9, pp. 1415–1442, (1990).
*
Collins, M. 2002
Discriminative training methods for hidden Markov models: Theory and experiments with the perceptron algorithmin Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '02).
* Yin, Hongfeng (1996), Perceptron-Based Algorithms and Analysis, Spectrum Library, Concordia University, Canada
External links
A Perceptron implemented in MATLAB to learn binary NAND function* Chapter
Weighted networks - the perceptronand chapter
Perceptron learningo
by
Raúl Rojas
Raúl Rojas González (born 1955, in Mexico City) is an emeritus professor of Computer Science and Mathematics at the Free University of Berlin, and a renowned specialist in artificial neural networks. The FU-Fighters, football-playing robots ...
()
History of perceptrons* Applying a perceptron model using
scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support-vector ...
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html
{{Authority control
Classification algorithms
Artificial neural networks
Articles with example Python (programming language) code