HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, the perceptron is an algorithm for
supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
of binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of
linear classifier In machine learning, a linear classifier makes a classification decision for each object based on a linear combination of its features. Such classifiers work well for practical problems such as document classification, and more generally for prob ...
, i.e. a classification algorithm that makes its predictions based on a
linear predictor function In statistics and in machine learning, a linear predictor function is a linear function (linear combination) of a set of coefficients and explanatory variables (independent variables), whose value is used to predict the outcome of a dependent varia ...
combining a set of weights with the
feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a data set. Choosing informative, discriminating, and independent features is crucial to produce effective algorithms for pattern re ...
.


History

The artificial neuron network was invented in 1943 by
Warren McCulloch Warren Sturgis McCulloch (November 16, 1898 – September 24, 1969) was an American neurophysiologist and cybernetician known for his work on the foundation for certain brain theories and his contribution to the cybernetics movement.Ken Aizawa ...
and
Walter Pitts Walter Harry Pitts, Jr. (April 23, 1923 – May 14, 1969) was an American logician who worked in the field of computational neuroscience.Smalheiser, Neil R"Walter Pitts", ''Perspectives in Biology and Medicine'', Volume 43, Number 2, Wint ...
in ''
A logical calculus of the ideas immanent in nervous activity "A Logical Calculus of the Ideas Immanent to Nervous Activity" is a 1943 article written by Warren McCulloch and Walter Pitts. The paper, published in the journal '' The Bulletin of Mathematical Biophysics,'' proposed a mathematical model of the ...
''. In 1957,
Frank Rosenblatt Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning for his pioneering work on artificial neural networks. Life and career ...
was at the Cornell Aeronautical Laboratory. He simulated the perceptron on an
IBM 704 The IBM 704 is the model name of a large digital computer, digital mainframe computer introduced by IBM in 1954. Designed by John Backus and Gene Amdahl, it was the first mass-produced computer with hardware for floating-point arithmetic. The I ...
. Later, he obtained funding by the Information Systems Branch of the United States
Office of Naval Research The Office of Naval Research (ONR) is an organization within the United States Department of the Navy responsible for the science and technology programs of the U.S. Navy and Marine Corps. Established by Congress in 1946, its mission is to plan ...
and the
Rome Air Development Center Rome Laboratory (Rome Air Development Center until 1991) is a U.S. Air Force research laboratory for " command, control, and communications" research and development and is responsible for planning and executing the USAF science and technology pr ...
, to build a custom-made computer, the Mark I Perceptron. It was first publicly demonstrated on 23 June 1960. The machine was "part of a previously secret four-year NPIC he US' National Photographic Interpretation Center">National Photographic Interpretation Center">he US' National Photographic Interpretation Centereffort from 1963 through 1966 to develop this algorithm into a useful tool for photo-interpreters". Rosenblatt described the details of the perceptron in a 1958 paper. His organization of a perceptron is constructed of three kinds of cells ("units"): AI, AII, R, which stand for " projection Projection or projections may refer to: Physics * Projection (physics), the action/process of light, heat, or sound reflecting from a surface to another in a different direction * The display of images by a projector Optics, graphics, and carto ...
", "association" and "response". He presented at the first international symposium on AI, ''Mechanisation of Thought Processes'', which took place in 1958 November. Rosenblatt's project was funded under Contract Nonr-401(40) "Cognitive Systems Research Program", which lasted from 1959 to 1970, and Contract Nonr-2381(00) "Project PARA" ("PARA" means "Perceiving and Recognition Automata"), which lasted from 1957 to 1963. In 1959, the Institute for Defense Analysis awarded his group a $10,000 contract. By September 1961, the ONR awarded further $153,000 worth of contracts, with $108,000 committed for 1962. The ONR research manager, Marvin Denicoff, stated that ONR, instead of ARPA, funded the Perceptron project, because the project was unlikely to produce technological results in the near or medium term. Funding from ARPA go up to the order of millions dollars, while from ONR are on the order of 10,000 dollars. Meanwhile, the head of IPTO at ARPA, J.C.R. Licklider">Information Processing Techniques Office">IPTO at ARPA, J.C.R. Licklider, was interested in 'self-organizing', 'adaptive' and other biologically-inspired methods in the 1950s; but by the mid-1960s he was openly critical of these, including the perceptron. Instead he strongly favored the logical AI approach of Simon and Allen Newell">Newell.


Mark I Perceptron machine

The perceptron was intended to be a machine, rather than a program, and while its first implementation was in software for the
IBM 704 The IBM 704 is the model name of a large digital computer, digital mainframe computer introduced by IBM in 1954. Designed by John Backus and Gene Amdahl, it was the first mass-produced computer with hardware for floating-point arithmetic. The I ...
, it was subsequently implemented in custom-built hardware as the Mark I Perceptron with the project name "Project PARA", designed for image recognition. The machine is currently in National Museum of American History, Smithsonian National Museum of American History. The Mark I Perceptron had three layers. One version was implemented as follows: * An array of 400
photocell Photodetectors, also called photosensors, are devices that detect light or other forms of electromagnetic radiation and convert it into an electrical signal. They are essential in a wide range of applications, from digital imaging and optical c ...
s arranged in a 20x20 grid, named "sensory units" (S-units), or "input retina". Each S-unit can connect to up to 40 A-units. * A hidden layer of 512 perceptrons, named "association units" (A-units). * An output layer of eight perceptrons, named "response units" (R-units). Rosenblatt called this three-layered perceptron network the ''alpha-perceptron'', to distinguish it from other perceptron models he experimented with. The S-units are connected to the A-units randomly (according to a table of random numbers) via a plugboard (see photo), to "eliminate any particular intentional bias in the perceptron". The connection weights are fixed, not learned. Rosenblatt was adamant about the random connections, as he believed the retina was randomly connected to the visual cortex, and he wanted his perceptron machine to resemble human visual perception. The A-units are connected to the R-units, with adjustable weights encoded in
potentiometer A potentiometer is a three- terminal resistor with a sliding or rotating contact that forms an adjustable voltage divider. If only two terminals are used, one end and the wiper, it acts as a variable resistor or rheostat. The measuring instrum ...
s, and weight updates during learning were performed by electric motors.The hardware details are in an operators' manual. In a 1958 press conference organized by the US Navy, Rosenblatt made statements about the perceptron that caused a heated controversy among the fledgling AI community; based on Rosenblatt's statements, ''
The New York Times ''The New York Times'' (''NYT'') is an American daily newspaper based in New York City. ''The New York Times'' covers domestic, national, and international news, and publishes opinion pieces, investigative reports, and reviews. As one of ...
'' reported the perceptron to be "the embryo of an electronic computer that he Navyexpects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The Photo Division of
Central Intelligence Agency The Central Intelligence Agency (CIA; ) is a civilian foreign intelligence service of the federal government of the United States tasked with advancing national security through collecting and analyzing intelligence from around the world and ...
, from 1960 to 1964, studied the use of Mark I Perceptron machine for recognizing militarily interesting silhouetted targets (such as planes and ships) in aerial photos.


''Principles of Neurodynamics'' (1962)

Rosenblatt described his experiments with many variants of the Perceptron machine in a book ''Principles of Neurodynamics'' (1962). The book is a published version of the 1961 report. Among the variants are: * "cross-coupling" (connections between units within the same layer) with possibly closed loops, * "back-coupling" (connections from units in a later layer to units in a previous layer), * four-layer perceptrons where the last two layers have adjustible weights (and thus a proper multilayer perceptron), * incorporating time-delays to perceptron units, to allow for processing sequential data, * analyzing audio (instead of images). The machine was shipped from Cornell to Smithsonian in 1967, under a government transfer administered by the Office of Naval Research.


''Perceptrons'' (1969)

Although the perceptron initially seemed promising, it was quickly proved that perceptrons could not be trained to recognise many classes of patterns. This caused the field of
neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
research to stagnate for many years, before it was recognised that a
feedforward neural network Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...
with two or more layers (also called a
multilayer perceptron In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...
) had greater processing power than perceptrons with one layer (also called a single-layer perceptron). Single-layer perceptrons are only capable of learning linearly separable patterns. For a classification task with some step activation function, a single node will have a single line dividing the data points forming the patterns. More nodes can create more dividing lines, but those lines must somehow be combined to form more complex classifications. A second layer of perceptrons, or even linear nodes, are sufficient to solve many otherwise non-separable problems. In 1969, a famous book entitled '' Perceptrons'' by
Marvin Minsky Marvin Lee Minsky (August 9, 1927 – January 24, 2016) was an American cognitive scientist, cognitive and computer scientist concerned largely with research in artificial intelligence (AI). He co-founded the Massachusetts Institute of Technology ...
and
Seymour Papert Seymour Aubrey Papert (; 29 February 1928 – 31 July 2016) was a South African-born American mathematician, computer scientist, and educator, who spent most of his career teaching and researching at MIT. He was one of the pioneers of artif ...
showed that it was impossible for these classes of network to learn an XOR function. It is often incorrectly believed that they also conjectured that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function. (See the page on '' Perceptrons (book)'' for more information.) Nevertheless, the often-miscited Minsky and Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until neural network research experienced a resurgence in the 1980s. This text was reprinted in 1987 as "Perceptrons - Expanded Edition" where some errors in the original text are shown and corrected.


Subsequent work

Rosenblatt continued working on perceptrons despite diminishing funding. The last attempt was Tobermory, built between 1961 and 1967, built for speech recognition. It occupied an entire room.Nagy, George. 1963.
System and circuit designs for the Tobermory perceptron
'. Technical report number 5, Cognitive Systems Research Program, Cornell University, Ithaca New York.
It had 4 layers with 12,000 weights implemented by toroidal
magnetic core A magnetic core is a piece of magnetism, magnetic material with a high magnetic permeability used to confine and guide magnetic fields in electrical, electromechanical and magnetic devices such as electromagnets, transformers, electric motors, ele ...
s. By the time of its completion, simulation on digital computers had become faster than purpose-built perceptron machines. He died in a boating accident in 1971. The kernel perceptron algorithm was already introduced in 1964 by Aizerman et al. Margin bounds guarantees were given for the Perceptron algorithm in the general non-separable case first by Freund and Schapire (1998), and more recently by Mohri and Rostamizadeh (2013) who extend previous results and give new and more favorable L1 bounds. The perceptron is a simplified model of a biological
neuron A neuron (American English), neurone (British English), or nerve cell, is an membrane potential#Cell excitability, excitable cell (biology), cell that fires electric signals called action potentials across a neural network (biology), neural net ...
. While the complexity of biological neuron models is often required to fully understand neural behavior, research suggests a perceptron-like linear model can produce some behavior seen in real neurons. The solution spaces of decision boundaries for all binary functions and learning behaviors are studied in.


Definition

In the modern sense, the perceptron is an algorithm for learning a binary classifier called a threshold function: a function that maps its input \mathbf (a real-valued
vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...
) to an output value f(\mathbf) (a single binary value): f(\mathbf) = h(\mathbf \cdot \mathbf + b) where h is the Heaviside step-function (where an input of > 0 outputs 1; otherwise 0 is the output ), \mathbf is a vector of real-valued weights, \mathbf \cdot \mathbf is the
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...
\sum_^m w_i x_i, where is the number of inputs to the perceptron, and is the ''bias''. The bias shifts the decision boundary away from the origin and does not depend on any input value. Equivalently, since \mathbf\cdot \mathbf + b = (\mathbf, b) \cdot (\mathbf, 1), we can add the bias term b as another weight \mathbf_ and add a coordinate 1 to each input \mathbf, and then write it as a linear classifier that passes the origin: f(\mathbf) = h(\mathbf \cdot \mathbf) The binary value of f(\mathbf) (0 or 1) is used to perform binary classification on \mathbf as either a positive or a negative instance. Spatially, the bias shifts the position (though not the orientation) of the planar
decision boundary __NOTOC__ In a statistical-classification problem with two classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class. The classifier will classify all the poin ...
. In the context of neural networks, a perceptron is an
artificial neuron An artificial neuron is a mathematical function conceived as a model of a biological neuron in a neural network. The artificial neuron is the elementary unit of an ''artificial neural network''. The design of the artificial neuron was inspired ...
using the
Heaviside step function The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function named after Oliver Heaviside, the value of which is zero for negative arguments and one for positive arguments. Differen ...
as the activation function. The perceptron algorithm is also termed the single-layer perceptron, to distinguish it from a
multilayer perceptron In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...
, which is a misnomer for a more complicated neural network. As a linear classifier, the single-layer perceptron is the simplest
feedforward neural network Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...
.


Power of representation


Information theory

From an
information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
point of view, a single perceptron with ''K'' inputs has a capacity of ''2K'' bits of information. This result is due to Thomas Cover. Specifically let T(N, K) be the number of ways to linearly separate ''N'' points in ''K'' dimensions, thenT(N, K)=\left\{\begin{array}{cc} 2^N & K \geq N \\ 2 \sum_{k=0}^{K-1}\left(\begin{array}{c} N-1 \\ k \end{array}\right) & KWhen ''K'' is large, T(N, K)/2^N is very close to one when N \leq 2K, but very close to zero when N> 2K. In words, one perceptron unit can almost certainly memorize a random assignment of binary labels on N points when N \leq 2K, but almost certainly not when N> 2K.


Boolean function

When operating on only binary inputs, a perceptron is called a linearly separable Boolean function, or threshold Boolean function. The sequence of numbers of threshold Boolean functions on n inputs is
OEIS The On-Line Encyclopedia of Integer Sequences (OEIS) is an online database of integer sequences. It was created and maintained by Neil Sloane while researching at AT&T Labs. He transferred the intellectual property and hosting of the OEIS to th ...
A000609. The value is only known exactly up to n=9 case, but the order of magnitude is known quite exactly: it has upper bound 2^{n^2 - n \log_2 n + O(n)} and lower bound 2^{n^2 - n \log_2 n - O(n)}. Any Boolean linear threshold function can be implemented with only integer weights. Furthermore, the number of bits necessary and sufficient for representing a single integer weight parameter is \Theta(n \ln n).


Universal approximation theorem

* A single perceptron can learn to classify any half-space. It cannot solve any linearly nonseparable vectors, such as the Boolean
exclusive-or Exclusive or, exclusive disjunction, exclusive alternation, logical non-equivalence, or logical inequality is a logical operator whose negation is the logical biconditional. With two inputs, XOR is true if and only if the inputs differ (one ...
problem (the famous "XOR problem"). A perceptron network with one hidden layer can learn to classify any compact subset arbitrarily closely. Similarly, it can also approximate any compactly-supported
continuous function In mathematics, a continuous function is a function such that a small variation of the argument induces a small variation of the value of the function. This implies there are no abrupt changes in value, known as '' discontinuities''. More preci ...
arbitrarily closely. This is essentially a special case of the theorems by George Cybenko and Kurt Hornik.


Conjunctively local perceptron

''Perceptrons'' (Minsky and Papert, 1969) studied the kind of perceptron networks necessary to learn various Boolean functions. Consider a perceptron network with n input units, one hidden layer, and one output, similar to the Mark I Perceptron machine. It computes a Boolean function of type f: 2^n \to 2 . They call a function conjunctively local of order k, iff there exists a perceptron network such that each unit in the hidden layer connects to at most k input units. Theorem. (Theorem 3.1.1): The parity function is conjunctively local of order n. Theorem. (Section 5.5): The connectedness function is conjunctively local of order \Omega(n^{1/2}).


Learning algorithm for a single-layer perceptron

Below is an example of a learning algorithm for a single-layer perceptron with a single output unit. For a single-layer perceptron with multiple output units, since the weights of one output unit are completely separate from all the others', the same algorithm can be run for each output unit. For
multilayer perceptron In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...
s, where a hidden layer exists, more sophisticated algorithms such as
backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...
must be used. If the activation function or the underlying process being modeled by the perceptron is
nonlinear In mathematics and science, a nonlinear system (or a non-linear system) is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathe ...
, alternative learning algorithms such as the delta rule can be used as long as the activation function is
differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...
. Nonetheless, the learning algorithm described in the steps below will often work, even for multilayer perceptrons with nonlinear activation functions. When multiple perceptrons are combined in an artificial neural network, each output neuron operates independently of all the others; thus, learning each output can be considered in isolation.


Definitions

We first define some variables: *r is the learning rate of the perceptron. Learning rate is a positive number usually chosen to be less than 1. The larger the value, the greater the chance for volatility in the weight changes. *y = f(\mathbf{z}) denotes the ''output'' from the perceptron for an input vector \mathbf{z}. *D = \{(\mathbf{x}_1,d_1),\dots,(\mathbf{x}_s,d_s)\} is the ''training set'' of s samples, where: ** \mathbf{x}_j is the n-dimensional input vector. ** d_j is the desired output value of the perceptron for that input. We show the values of the features as follows: *x_{j,i} is the value of the ith feature of the jth training ''input vector''. *x_{j,0} = 1 . To represent the weights: *w_i is the ith value in the ''weight vector'', to be multiplied by the value of the ith input feature. *Because x_{j,0} = 1 , the w_0 is effectively a bias that we use instead of the bias constant b. To show the time-dependence of \mathbf{w}, we use: *w_i(t) is the weight i at time t.


Steps

The algorithm updates the weights after every training sample in step 2b.


Convergence of one perceptron on a linearly separable dataset

A single perceptron is a
linear classifier In machine learning, a linear classifier makes a classification decision for each object based on a linear combination of its features. Such classifiers work well for practical problems such as document classification, and more generally for prob ...
. It can only reach a stable state if all input vectors are classified correctly. In case the training set is ''not'' linearly separable, i.e. if the positive examples cannot be separated from the negative examples by a hyperplane, then the algorithm would not converge since there is no solution. Hence, if linear separability of the training set is not known a priori, one of the training variants below should be used. Detailed analysis and extensions to the convergence theorem are in Chapter 11 of ''Perceptrons'' (1969). Linear separability is testable in time \min(O(n^{d/2}), O(d^{2n}), O(n^{d-1} \ln n)) , where n is the number of data points, and d is the dimension of each point. If the training set ''is'' linearly separable, then the perceptron is guaranteed to converge after making finitely many mistakes. The theorem is proved by Rosenblatt et al. The following simple proof is due to Novikoff (1962). The idea of the proof is that the weight vector is always adjusted by a bounded amount in a direction with which it has a negative
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...
, and thus can be bounded above by , where is the number of changes to the weight vector. However, it can also be bounded below by because if there exists an (unknown) satisfactory weight vector, then every change makes progress in this (unknown) direction by a positive amount that depends only on the input vector. While the perceptron algorithm is guaranteed to converge on ''some'' solution in the case of a linearly separable training set, it may still pick ''any'' solution and problems may admit many solutions of varying quality. The ''perceptron of optimal stability'', nowadays better known as the linear
support-vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning, supervised Maximum-margin hyperplane, max-margin models with associated learning algorithms that analyze data for Statistical classification ...
, was designed to solve this problem (Krauth and Mezard, 1987).


Perceptron cycling theorem

When the dataset is not linearly separable, then there is no way for a single perceptron to converge. However, we still have This is proved first by Bradley Efron.


Learning a Boolean function

Consider a dataset where the x are from \{-1, +1\}^n, that is, the vertices of an n-dimensional hypercube centered at origin, and y = \theta(x_i). That is, all data points with positive x_i have y=1, and vice versa. By the perceptron convergence theorem, a perceptron would converge after making at most n mistakes. If we were to write a logical program to perform the same task, each positive example shows that one of the coordinates is the right one, and each negative example shows that its ''complement'' is a positive example. By collecting all the known positive examples, we eventually eliminate all but one coordinate, at which point the dataset is learned. This bound is asymptotically tight in terms of the worst-case. In the worst-case, the first presented example is entirely new, and gives n bits of information, but each subsequent example would differ minimally from previous examples, and gives 1 bit each. After n+1 examples, there are 2n bits of information, which is sufficient for the perceptron (with 2n bits of information). However, it is not tight in terms of expectation if the examples are presented uniformly at random, since the first would give n bits, the second n/2 bits, and so on, taking O(\ln n) examples in total.


Variants

The pocket algorithm with ratchet (Gallant, 1990) solves the stability problem of perceptron learning by keeping the best solution seen so far "in its pocket". The pocket algorithm then returns the solution in the pocket, rather than the last solution. It can be used also for non-separable data sets, where the aim is to find a perceptron with a small number of misclassifications. However, these solutions appear purely stochastically and hence the pocket algorithm neither approaches them gradually in the course of learning, nor are they guaranteed to show up within a given number of learning steps. The Maxover algorithm (Wendemuth, 1995) is "robust" in the sense that it will converge regardless of (prior) knowledge of linear separability of the data set. In the linearly separable case, it will solve the training problem – if desired, even with optimal stability ( maximum margin between the classes). For non-separable data sets, it will return a solution with a computable small number of misclassifications. In all cases, the algorithm gradually approaches the solution in the course of learning, without memorizing previous states and without stochastic jumps. Convergence is to global optimality for separable data sets and to local optimality for non-separable data sets. The Voted Perceptron (Freund and Schapire, 1999), is a variant using multiple weighted perceptrons. The algorithm starts a new perceptron every time an example is wrongly classified, initializing the weights vector with the final weights of the last perceptron. Each perceptron will also be given another weight corresponding to how many examples do they correctly classify before wrongly classifying one, and at the end the output will be a weighted vote on all perceptrons. In separable problems, perceptron training can also aim at finding the largest separating margin between the classes. The so-called perceptron of optimal stability can be determined by means of iterative training and optimization schemes, such as the Min-Over algorithm (Krauth and Mezard, 1987) or the AdaTron (Anlauf and Biehl, 1989)). AdaTron uses the fact that the corresponding quadratic optimization problem is convex. The perceptron of optimal stability, together with the
kernel trick In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pa ...
, are the conceptual foundations of the
support-vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning, supervised Maximum-margin hyperplane, max-margin models with associated learning algorithms that analyze data for Statistical classification ...
. The \alpha-perceptron further used a pre-processing layer of fixed random weights, with thresholded output units. This enabled the perceptron to classify analogue patterns, by projecting them into a binary space. In fact, for a projection space of sufficiently high dimension, patterns can become linearly separable. Another way to solve nonlinear problems without using multiple layers is to use higher order networks (sigma-pi unit). In this type of network, each element in the input vector is extended with each pairwise combination of multiplied inputs (second order). This can be extended to an ''n''-order network. It should be kept in mind, however, that the best classifier is not necessarily that which classifies all the training data perfectly. Indeed, if we had the prior constraint that the data come from equi-variant Gaussian distributions, the linear separation in the input space is optimal, and the nonlinear solution is overfitted. Other linear classification algorithms include Winnow,
support-vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning, supervised Maximum-margin hyperplane, max-margin models with associated learning algorithms that analyze data for Statistical classification ...
, and
logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...
.


Multiclass perceptron

Like most other techniques for training linear classifiers, the perceptron generalizes naturally to multiclass classification. Here, the input x and the output y are drawn from arbitrary sets. A feature representation function f(x,y) maps each possible input/output pair to a finite-dimensional real-valued feature vector. As before, the feature vector is multiplied by a weight vector w, but now the resulting score is used to choose among many possible outputs: :\hat y = \operatorname{argmax}_y f(x,y) \cdot w. Learning again iterates over the examples, predicting an output for each, leaving the weights unchanged when the predicted output matches the target, and changing them when it does not. The update becomes: : w_{t+1} = w_t + f(x, y) - f(x,\hat y). This multiclass feedback formulation reduces to the original perceptron when x is a real-valued vector, y is chosen from \{0,1\}, and f(x,y) = y x. For certain problems, input/output representations and features can be chosen so that \mathrm{argmax}_y f(x,y) \cdot w can be found efficiently even though y is chosen from a very large or even infinite set. Since 2002, perceptron training has become popular in the field of
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
for such tasks as
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
and
syntactic parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term ''pa ...
(Collins, 2002). It has also been applied to large-scale machine learning problems in a
distributed computing Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system commu ...
setting.


References


Further reading

* Aizerman, M. A. and Braverman, E. M. and Lev I. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964. * Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386–408. . * Rosenblatt, Frank (1962), Principles of Neurodynamics. Washington, DC: Spartan Books. * Minsky, M. L. and Papert, S. A. 1969. ''Perceptrons''. Cambridge, MA: MIT Press. * Gallant, S. I. (1990)
Perceptron-based learning algorithms.
IEEE Transactions on Neural Networks, vol. 1, no. 2, pp. 179–191. * Olazaran Rodriguez, Jose Miguel.
A historical sociology of neural network research
'. PhD Dissertation. University of Edinburgh, 1991. * Mohri, Mehryar and Rostamizadeh, Afshin (2013)
Perceptron Mistake Bounds
arXiv:1305.0208, 2013. * Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, 615–622. Polytechnic Institute of Brooklyn. * Widrow, B., Lehr, M.A.,
30 years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation
" ''Proc. IEEE'', vol 78, no 9, pp. 1415–1442, (1990). * Collins, M. 2002
Discriminative training methods for hidden Markov models: Theory and experiments with the perceptron algorithm
in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '02). * Yin, Hongfeng (1996), Perceptron-Based Algorithms and Analysis, Spectrum Library, Concordia University, Canada


External links


A Perceptron implemented in MATLAB to learn binary NAND function
* Chapter
Weighted networks - the perceptron
and chapter
Perceptron learning
o

by Raúl Rojas ()
History of perceptrons


* Applying a perceptron model using
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html {{Authority control Classification algorithms Artificial neural networks