A capsule neural network (CapsNet) is a machine learning system that is a type of

artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...

(ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization. The idea is to add structures called “capsules” to a

convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

(CNN), and to reuse output from several of those capsules to form more stable (with respect to various perturbations) representations for higher capsules. The output is a vector consisting of the probability of an observation, and a pose for that observation. This vector is similar to what is done for example when doing ''

classification with localization Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

'' in CNNs. Among other benefits, capsnets address the "Picasso problem" in image recognition: images that have all the right parts but that are not in the correct spatial relationship (e.g., in a "face", the positions of the mouth and one eye are switched). For image recognition, capsnets exploit the fact that while viewpoint changes have nonlinear effects at the pixel level, they have linear effects at the part/object level. This can be compared to inverting the rendering of an object of multiple parts.

History

In 2000,

Geoffrey Hinton Geoffrey Everest Hinton One or more of the preceding sentences incorporates text from the royalsociety.org website where: (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on ...

et al. described an imaging system that combined segmentation and recognition into a single inference process using

parse tree A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in comp ...

s. So-called credibility networks described the joint distribution over the latent variables and over the possible parse trees. That system proved useful on the MNIST handwritten digit database. A dynamic routing mechanism for capsule networks was introduced by Hinton and his team in 2017. The approach was claimed to reduce error rates on MNIST and to reduce training set sizes. Results were claimed to be considerably better than a CNN on highly overlapped digits. In Hinton's original idea one minicolumn would represent and detect one multidimensional entity.

Transformations

An invariant is an object property that does not change as a result of some transformation. For example, the area of a circle does not change if the circle is shifted to the left. Informally, an

equivariant In mathematics, equivariance is a form of symmetry for functions from one space with symmetry to another (such as symmetric spaces). A function is said to be an equivariant map when its domain and codomain are acted on by the same symmetry gro ...

is a property that changes predictably under transformation. For example, the center of a circle moves by the same amount as the circle when shifted. A nonequivariant is a property whose value does not change predictably under a transformation. For example, transforming a circle into an ellipse means that its perimeter can no longer be computed as π times the diameter. In computer vision, the class of an object is expected to be an invariant over many transformations. I.e., a cat is still a cat if it is shifted, turned upside down or shrunken in size. However, many other properties are instead equivariant. The volume of a cat changes when it is scaled. Equivariant properties such as a spatial relationship are captured in a ''pose'', data that describes an object's

translation Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...

rotation Rotation, or spin, is the circular movement of an object around a '' central axis''. A two-dimensional rotating object has only one possible central axis and can rotate in either a clockwise or counterclockwise direction. A three-dimensional ...

, scale and reflection. Translation is a change in location in one or more dimensions. Rotation is a change in orientation. Scale is a change in size. Reflection is a mirror image.

Unsupervised ''Unsupervised'' is an American adult animated sitcom created by David Hornsby, Rob Rosell, and Scott Marder which ran on FX from January 19 to December 20, 2012. The show was created, and for the most part, written by David Hornsby, Scott Marder ...

capsnets learn a global linear manifold between an object and its pose as a matrix of weights. In other words, capsnets can identify an object independent of its pose, rather than having to learn to recognize the object while including its spatial relationships as part of the object. In capsnets, the pose can incorporate properties other than spatial relationships, e.g., color (cats can be of various colors). Multiplying the object by the manifold poses the object (for an object, in space).

Pooling

Capsnets reject the pooling layer strategy of conventional CNNs that reduces the amount of detail to be processed at the next higher layer. Pooling allows a degree of translational invariance (it can recognize the same object in a somewhat different location) and allows a larger number of feature types to be represented. Capsnet proponents argue that pooling: * violates biological shape perception in that it has no intrinsic coordinate frame; * provides invariance (discarding positional information) instead of equivariance (disentangling that information); * ignores the linear manifold that underlies many variations among images; * routes statically instead of communicating a potential "find" to the feature that can appreciate it; * damages nearby feature detectors, by deleting the information they rely upon.

Capsules

A capsule is a set of neurons that individually activate for various properties of a type of object, such as position, size and hue. Formally, a capsule is a set of neurons that collectively produce an ''activity vector'' with one element for each neuron to hold that neuron's instantiation value (e.g., hue). Graphics programs use instantiation value to draw an object. Capsnets attempt to derive these from their input. The probability of the entity's presence in a specific input is the vector's length, while the vector's orientation quantifies the capsule's properties.

Artificial neuron An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing ...

s traditionally output a scalar, real-valued activation that loosely represents the probability of an observation. Capsnets replace scalar-output feature detectors with vector-output capsules and max-pooling with routing-by-agreement. Because capsules are independent, when multiple capsules agree, the probability of correct detection is much higher. A minimal cluster of two capsules considering a six-dimensional entity would agree within 10% by chance only once in a million trials. As the number of dimensions increase, the likelihood of a chance agreement across a larger cluster with higher dimensions decreases exponentially. Capsules in higher layers take outputs from capsules at lower layers, and accept those whose outputs cluster. A cluster causes the higher capsule to output a high probability of observation that an entity is present and also output a high-dimensional (20-50+) pose. Higher-level capsules ignore outliers, concentrating on clusters. This is similar to the

Hough transform The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting pro ...

, the RHT and

RANSAC Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it a ...

from classic

digital image processing Digital image processing is the use of a digital computer to process digital images through an algorithm. As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing. It allow ...

Routing by agreement

The outputs from one capsule (child) are routed to capsules in the next layer (parent) according to the child's ability to predict the parents' outputs. Over the course of a few iterations, each parents' outputs may converge with the predictions of some children and diverge from those of others, meaning that that parent is present or absent from the scene. For each possible parent, each child computes a prediction vector by multiplying its output by a weight matrix (trained by

backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...

). Next the output of the parent is computed as the

scalar product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alg ...

of a prediction with a coefficient representing the probability that this child belongs to that parent. A child whose predictions are relatively close to the resulting output successively increases the coefficient between that parent and child and decreases it for parents that it matches less well. This increases the contribution that that child makes to that parent, thus increasing the scalar product of the capsule's prediction with the parent's output. After a few iterations, the coefficients strongly connect a parent to its most likely children, indicating that the presence of the children imply the presence of the parent in the scene. The more children whose predictions are close to a parent's output, the more quickly the coefficients grow, driving convergence. The pose of the parent (reflected in its output) progressively becomes compatible with that of its children. The coefficients' initial logits are the log prior probabilities that a child belongs to a parent. The priors can be trained discriminatively along with the weights. The priors depend on the location and type of the child and parent capsules, but not on the current input. At each iteration, the coefficients are adjusted via a "routing"

softmax The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

so that they continue to sum to 1 (to express the probability that a given capsule is the parent of a given child.) Softmax amplifies larger values and diminishes smaller values beyond their proportion of the total. Similarly, the probability that a feature is present in the input is exaggerated by a nonlinear "squashing" function that reduces values (smaller ones drastically and larger ones such that they are less than 1). This dynamic routing mechanism provides the necessary deprecation of alternatives ("explaining away") that is needed for segmenting overlapped objects. This learned routing of signals has no clear biological equivalent. Some operations can be found in cortical layers, but they do not seem to relate this technique.

Math/code

The pose vector

\mathbf_

is rotated and translated by a matrix

\mathbf_

into a vector

\mathbf_

that predicts the output of the parent capsule.

\mathbf_ = \mathbf_ \mathbf_

Capsules

s_

in the next higher level are fed the sum of the predictions from all capsules in the lower layer, each with a coupling coefficient

c_

s_ = \sum

Procedure softmax

The coupling coefficients from a capsule

i

in layer

l

to all capsules in layer

l+1

sum to one, and are defined by a " routing softmax". The initial

logit In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the i ...

b_

are prior log probabilities for the routing. That is the

prior probability In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...

that capsule

i

in layer

l

should connect to capsule

j

in layer

l+1

. Normalization of the coupling coefficients:

\begin
1: \mathbf~ \mathrm ( \mathbf, i ) \\
2: \quad \triangleright \mbox \\
3: \quad \triangleright \mbox \\
4: \quad \triangleright \mbox~\mathbf \\
5: \quad \triangleright \mbox \\
6: \quad \mathbf~\mbox~i,j~\mathbf \\
7: \qquad c_ \leftarrow \frac \\
8: \quad \mathbf~ \mathbf_ \\
\end

For this procedure to be optimum it would have to memorize several values, and reset those values on each iteration. That is if the vector

\mathbf

changes, then the memorized values must be updated. It is not shown how this should be done. Neither memorizing the divisor is shown.

Procedure squash

Because the length of the vectors represents probabilities they should be between zero and one, and to do that a squashing function is applied:

\begin
1: \mathbf~ \mathrm ( \mathbf ) \\
2: \quad \triangleright \mbox \\
2: \quad \triangleright \mbox \\
3: \qquad \mathbf \leftarrow \frac \frac \\
4: \quad \mathbf~ \mathbf \\
\end

A vector squashed to zero has a vanishing gradient.

Procedure routing

One approach to routing is the following

\begin
~~1: \mathbf~ \mathrm ( \mathbf_, r, l ) \\
~~2: \quad \triangleright \mbox \\
~~3: \quad \triangleright \mbox \\
~~4: \quad \triangleright \mbox \\
~~5: \quad \triangleright \mbox \\
~~6: \quad \mathbf~\mbox~i~\mbox~l,~\mbox~j~\mbox~(l+1)~\mathbf~b_ \leftarrow 0 \\
~~7: \quad \mathbf~\mbox~r~\mathbf \\
~~8: \qquad \mathbf~\mbox~i~\mbox~l~\mathbf~\mathbf_ \leftarrow \operatorname(\mathbf, i) \\
~~9: \qquad \mathbf~\mbox~j~\mbox~(l+1)~\mathbf~\mathbf_ \leftarrow \sum_ \\
10: \qquad \mathbf~\mbox~j~\mbox~(l+1)~\mathbf~\mathbf_ \leftarrow \operatorname(\mathbf_) \\
11: \qquad \mathbf~\mbox~i~\mbox~l,~j~\mbox~(l+1)~\mathbf~\mathbf_ \leftarrow \mathbf_ + \mathbf_ \cdot \mathbf_ \\
12: \quad \mathbf~\mathbf_ \\
\end

At line 8, the softmax function can be replaced by any type of

winner-take-all Winner(s) take(s) (it) all may refer to: Competition, economics and politics * Winner-takes-all voting * Winner-take-all (computing) * Winner-take-all market Books Fiction * ''Winner Takes All'' (novel), a BBC Books Doctor Who novel * "Winner ...

network. Biologically this somewhat resembles

chandelier cell Chandelier neurons or chandelier cells are a subset of GABAergic cortical interneurons. They are described as parvalbumin-containing and fast- spiking to distinguish them from other subtypes of GABAergic neurons, although more recent work has sugg ...

s, but they can also be involved in calculation of coupling coefficients (line 9) or calculation of agreements (line 11). At line 9, the weight matrix for the coupling coefficients and the hidden prediction matrix are shown. The structure in layer I and II is somewhat similar to the

cerebral cortex The cerebral cortex, also known as the cerebral mantle, is the outer layer of neural tissue of the cerebrum of the brain in humans and other mammals. The cerebral cortex mostly consists of the six-layered neocortex, with just 10% consisting o ...

stellate cell Stellate cells are neurons in the central nervous system, named for their star-like shape formed by dendritic processes radiating from the cell body. Many stellate cells are GABAergic and are located in the molecular layer of the cerebellum. Ste ...

s are assumed to be involved in transposing input vectors. Whether both types of stellate cells have the same function is not clear, as layer I has excitatory spiny cells and layer II has inhibitory aspiny cells. The latter indicates a much different network. At line 10, the squash function can be replaced by other functions and network topologies that retain the vector direction. The procedure conducts

r

iterations, usually 4–5, with

l

the index for the source capsule layer or primary layer, where the routing goes ''from'', and the capsule layer

l+1

the next higher layer.

Training

Learning is supervised. The network is trained by minimizing the

euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore o ...

between the image and the output of a CNN that reconstructs the input from the output of the terminal capsules. The network is discriminatively trained, using iterative routing-by-agreement. The activity vectors of all but the correct parent are masked.

Margin loss

The length of the instantiation vector represents the probability that a capsule's entity is present in the scene. A top-level capsule has a long vector if and only if its associated entity is present. To allow for multiple entities, a separate margin loss is computed for each capsule. Downweighting the loss for absent entities stops the learning from shrinking activity vector lengths for all entities. The total loss is the sum of the losses of all entities. In Hinton's example the loss function is:

\begin
L_ & = \underbrace_\mbox
+ \underbrace_\mbox
, & T_ = \begin
1, & \mbox~k~\mbox \\
0, & \mbox \\
\end
\end

This type of loss function is common in ANNs. The parameters

m^

and

m^

are set so the length does not max out or collapse,

m^ = 0.9

and

m^ = 0.1

. Down-weighting of initial weights for absent classes are controlled by

\lambda

, with

\lambda = 0.5

as a reasonable choice.

Reconstruction loss

An additional reconstruction loss encourages entities to encode their inputs' instantiation parameters. The final activity vector is then used to reconstruct the input image via a CNN decoder consisting of 3 fully connected layers. The reconstruction minimizes the sum of squared differences between the outputs of the logistic units and the pixel intensities. This reconstruction loss is scaled down by 0.0005 so that it does not dominate the margin loss during training.

Example configuration

The first convolutional layers perform feature extraction. For the 28x28 pixel MNIST image test an initial 256 9x9

pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device. In most digital display devices, pixels are the s ...

convolutional kernels (using stride 1 and

rectified linear unit In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a n ...

(ReLU) activation, defining 20x20

receptive field The receptive field, or sensory space, is a delimited medium where some physiological stimuli can evoke a sensory neuronal response in specific organisms. Complexity of the receptive field ranges from the unidimensional chemical structure of odo ...

s) convert the pixel input into 1D feature activations and induce nonlinearity. The primary (lowest) capsule layer divides the 256 kernels into 32 capsules of 8 9x9 kernels each (using stride 2, defining 6x6 receptive fields). Capsule activations effectively invert the graphics rendering process, going from pixels to features. A single weight matrix is used by each capsule across all receptive fields. Each primary capsule sees all of the lower-layer outputs whose fields overlap with the center of the field in the primary layer. Each primary capsule output (for a particular field) is an 8-dimensional vector. A second, digit capsule layer has one 16-dimensional capsule for each digit (0-9). Dynamic routing connects (only) primary and digit capsule layers. A 2x6x6x 10 weight matrix controls the mapping between layers. Capsnets are hierarchical, in that each lower-level capsule contributes significantly to only one higher-level capsule. However, replicating learned knowledge remains valuable. To achieve this, a capsnet's lower layers are

convolution In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution' ...

al, including hidden capsule layers. Higher layers thus cover larger regions, while retaining information about the precise position of each object within the region. For low level capsules, location information is “place-coded” according to which capsule is active. Higher up, more and more of the positional information is rate-coded in the capsule's output vector. This shift from place-coding to rate-coding, combined with the fact that higher-level capsules represent more complex objects with more degrees of freedom, suggests that capsule dimensionality increases with level.

Human vision

Human vision examines a sequence of focal points (directed by

saccade A saccade ( , French for ''jerk'') is a quick, simultaneous movement of both eyes between two or more phases of fixation in the same direction.Cassin, B. and Solomon, S. ''Dictionary of Eye Terminology''. Gainesville, Florida: Triad Publishin ...

s), processing only a fraction of the scene at its highest resolution. Capsnets build on inspirations from

cortical minicolumn A cortical minicolumn is a vertical column through the cortical layers of the brain. Neurons within the microcolumn "receive common inputs, have common outputs, are interconnected, and may well constitute a fundamental computational unit of the ce ...

s (also called cortical microcolumns) in the

. A minicolumn is a structure containing 80-120 neurons, with a diameter of about 28-40 µm, spanning all layers in the cerebral cortex. All neurons in the larger minicolumns have the same

, and they output their activations as

action potential An action potential occurs when the membrane potential of a specific cell location rapidly rises and falls. This depolarization then causes adjacent locations to similarly depolarize. Action potentials occur in several types of animal cells, ...

s or spikes. Neurons within the microcolumn receive common inputs, have common outputs, are interconnected and may constitute a fundamental computational unit of the

. Capsnets explore the intuition that the human visual system creates a

tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...

-like structure for each focal point and coordinates these trees to recognize objects. However, with capsnets each tree is "carved" from a fixed network (by adjusting coefficients) rather than assembled on the fly.

Alternatives

CapsNets are claimed to have four major conceptual advantages over

s (CNN): * Viewpoint invariance: the use of pose matrices allows capsule networks to recognize objects regardless of the perspective from which they are viewed. * Fewer parameters: Because capsules group neurons, the connections between layers require fewer parameters. * Better generalization to new viewpoints: CNNs, when trained to understand rotations, often learn that an object can be viewed similarly from several different rotations. However, capsule networks generalize better to new viewpoints because pose matrices can capture these characteristics as linear transformations. * Defense against white-box adversarial attacks: the Fast Gradient Sign Method (FGSM) is a typical method for attacking CNNs. It evaluates the gradient of each pixel against the loss of the network, and changes each pixel by at most epsilon (the error term) to maximize the loss. Although this method can drop the accuracy of CNNs dramatically (e.g.: to below 20%), capsule networks maintain accuracy above 70%. Purely convolutional nets cannot generalize to unlearned viewpoints (other than translation). For other

affine transformation In Euclidean geometry, an affine transformation or affinity (from the Latin, ''affinis'', "connected with") is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles. More generall ...

s, either feature detectors must be repeated on a grid that grows exponentially with the number of transformation dimensions, or the size of the labelled training set must (exponentially) expand to encompass those viewpoints. These exponential explosions make them unsuitable for larger problems. Capsnet's transformation matrices learn the (viewpoint independent) spatial relationship between a part and a whole, allowing the latter to be recognized based on such relationships. However, capsnets assume that each location displays at most one instance of a capsule's object. This assumption allows a capsule to use a distributed representation (its activity vector) of an object to represent that object at that location. Capsnets use neural activities that vary with viewpoint. They do not have to normalize objects (as in spatial transformer networks) and can even recognize multiply transformed objects. Capsnets can also process segmented objects.

Notes

In Hinton's own words this is "wild speculation".

References

External links

* * * * * * * * * * * * * * {{cite arXiv, last1=Sun, first1=Weiwei, last2=Tagliasacchi, first2=Andrea, last3=Deng, first3=Boyang, last4=Sabour, first4=Sara, last5=Yazdani, first5=Soroosh, last6=Hinton, first6=Geoffrey, last7=Yi, first7=Kwang Moo, date=2020-12-08, title=Canonical Capsules: Unsupervised Capsules in Canonical Pose, class=cs.CV, eprint=2012.04718 Artificial neural networks

History

Transformations

Pooling

Capsules

Routing by agreement

Math/code

Procedure softmax

Procedure squash

Procedure routing

Training

Margin loss

Reconstruction loss

Example configuration

Human vision

Alternatives

See also

Notes

References

External links