artificial neural networks In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

, a convolutional layer is a type of

network layer In the seven-layer OSI model of computer networking, the network layer is layer 3. The network layer is responsible for packet forwarding including routing through intermediate Router (computing), routers. Functions The network layer provides t ...

that applies a

convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions f and g that produces a third function f*g, as the integral of the product of the two ...

operation to the input. Convolutional layers are some of the primary building blocks of

convolutional neural networks A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different type ...

(CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform

translational symmetry In physics and mathematics, continuous translational symmetry is the invariance of a system of equations under any translation (without rotation). Discrete translational symmetry is invariant under discrete translation. Analogously, an operato ...

. The convolution operation in a convolutional layer involves sliding a small window (called a

kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...

or filter) across the input data and computing the

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...

between the values in the kernel and the input at each position. This process creates a feature map that represents detected

features Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

in the input.

Concepts

Kernel

Kernels, also known as filters, are small matrices of weights that are learned during the training process. Each kernel is responsible for detecting a specific feature in the input data. The size of the kernel is a hyperparameter that affects the network's behavior.

Convolution

For a 2D input

x

and a 2D kernel

w

, the 2D convolution operation can be expressed as:

y,j = \sum_^ \sum_^ x +m,j+n \cdot w,n

where

k_h

and

k_w

are the height and width of the kernel, respectively. This generalizes immediately to nD convolutions. Commonly used convolutions are 1D (for audio and text), 2D (for images), and 3D (for spatial objects, and videos).

Stride

Stride determines how the kernel moves across the input data. A stride of 1 means the kernel shifts by one pixel at a time, while a larger stride (e.g., 2 or 3) results in less overlap between convolutions and produces smaller output feature maps.

Padding

Padding involves adding extra pixels around the edges of the input data. It serves two main purposes: * Preserving spatial dimensions: Without padding, each convolution reduces the size of the feature map. * Handling border pixels: Padding ensures that border pixels are given equal importance in the convolution process. Common padding strategies include: * No padding/valid padding. This strategy typically causes the output to shrink. * Same padding: Any method that ensures the output size same as input size is a same padding strategy. * Full padding: Any method that ensures each input entry is convolved over for the same number of times is a full padding strategy. Common padding algorithms include: * Zero padding: Add zero entries to the borders of input. * Mirror/reflect/symmetric padding: Reflect the input array on the border. * Circular padding: Cycle the input array back to the opposite border, like a torus. The exact numbers used in convolutions is complicated, for which we refer to (Dumoulin and Visin, 2018) for details.

Variants

Standard

The basic form of convolution as described above, where each kernel is applied to the entire input volume.

Depthwise separable

Depthwise separable convolution separates the standard convolution into two steps: depthwise convolution and pointwise convolution. The depthwise separable convolution decomposes a single standard convolution into two convolutions: a depthwise convolution that filters each input channel independently and a pointwise convolution (

1 \times 1

convolution) that combines the outputs of the depthwise convolution. This factorization significantly reduces computational cost. It was first developed by Laurent Sifre during an internship at

Google Brain Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence ...

in 2013 as an architectural variation on

AlexNet AlexNet is a convolutional neural network architecture developed for image classification tasks, notably achieving prominence through its performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1, ...

to improve convergence speed and model size.

Dilated

Dilated convolution, or atrous convolution, introduces gaps between kernel elements, allowing the network to capture a larger receptive field without increasing the kernel size.

Transposed

Transposed convolution, also known as deconvolution, fractionally strided convolution, and upsampling convolution, is a convolution where the output tensor is larger than its input tensor. It's often used in encoder-decoder architectures for upsampling. It's used in image generation,

semantic segmentation In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects ( sets of pixels). The goal of segmentation is to simpl ...

, and super-resolution tasks.

History

The concept of convolution in neural networks was inspired by the visual cortex in biological brains. Early work by Hubel and Wiesel in the 1960s on the cat's visual system laid the groundwork for artificial convolution networks. An early convolution neural network was developed by

Kunihiko Fukushima Kunihiko Fukushima ( Japanese: 福島邦彦, born 16 March 1936) is a Japanese computer scientist, most noted for his work on artificial neural networks and deep learning. He is currently working part-time as a senior research scientist at the F ...

in 1969. It had mostly hand-designed kernels inspired by convolutions in mammalian vision. In 1979 he improved it to the

Neocognitron __NOTOC__ The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese handwritten character recognition and other pattern recognition tasks, and served as the i ...

, which ''learns'' all convolutional kernels by

unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...

(in his terminology, " self-organized by 'learning without a teacher'"). During the 1988 to 1998 period, a series of CNN were introduced by

Yann LeCun Yann André Le Cun ( , ; usually spelled LeCun; born 8 July 1960) is a French-American computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Pr ...

et al., ending with LeNet-5 in 1998. It was an early influential CNN architecture for handwritten digit recognition, trained on the MNIST dataset, and was used in ATM. ( Olshausen & Field, 1996) discovered that simple cells in the mammalian

primary visual cortex The visual cortex of the brain is the area of the cerebral cortex that processes visual information. It is located in the occipital lobe. Sensory input originating from the eyes travels through the lateral geniculate nucleus in the thalamus ...

implement localized, oriented, bandpass receptive fields, which could be recreated by fitting sparse linear codes for natural scenes. This was later found to also occur in the lowest-level kernels of trained CNNs. The field saw a resurgence in the 2010s with the development of deeper architectures and the availability of large datasets and powerful GPUs.

, developed by Alex Krizhevsky et al. in 2012, was a catalytic event in modern

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

. In that year’s ImageNet competition, the AlexNet model achieved a 16% top-five error rate, significantly outperforming the next best entry, which had a 26% error rate. The network used eight trainable layers, approximately 650,000 neurons, and around 60 million parameters, highlighting the impact of deeper architectures and

GPU A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

acceleration on

image recognition Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form o ...

performance. From the 2013 ImageNet competition, most entries adopted deep convolutional neural networks, building on the success of AlexNet. Over the following years, performance steadily improved, with the top-five error rate falling from 16% in 2012 and 12% in 2013 to below 3% by 2017, as networks grew increasingly deep.

References

{{Differentiable computing Artificial neural networks Computer vision Deep learning