An autoencoder is a type of

artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

used to learn efficient codings of unlabeled data (

unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...

). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

, to generate lower-dimensional embeddings for subsequent use by other

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

algorithms. Variants exist which aim to make the learned representations assume useful properties. Examples are regularized autoencoders (''sparse'', ''denoising'' and ''contractive'' autoencoders), which are effective in learning representations for subsequent

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

tasks, and ''variational'' autoencoders, which can be used as

generative model In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsiste ...

s. Autoencoders are applied to many problems, including

facial recognition Facial recognition or face recognition may refer to: *Face detection, often a step done before facial recognition *Face perception, the process by which the human brain understands and interprets the face *Pareidolia, which involves, in part, seein ...

, feature detection,

anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of ...

, and learning the meaning of words. In terms of data synthesis, autoencoders can also be used to randomly generate new data that is similar to the input (training) data.

Mathematical principles

Definition

An autoencoder is defined by the following components:

Two sets: the space of decoded messages $\mathcal X$ ; the space of encoded messages $\mathcal Z$ . Typically $\mathcal X$ and $\mathcal Z$ are
Euclidean space Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...
s, that is, $\mathcal X = \R^m, \mathcal Z = \R^n$ with $m > n.$

Two parametrized families of functions: the encoder family $E_\phi:\mathcal \rightarrow \mathcal$ , parametrized by $\phi$ ; the decoder family $D_\theta:\mathcal \rightarrow \mathcal$ , parametrized by $\theta$ .

For any

x\in \mathcal X

, we usually write

z = E_\phi(x)

, and refer to it as the code, the

latent variable In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such '' latent va ...

, latent representation, latent vector, etc. Conversely, for any

z\in \mathcal Z

, we usually write

x' = D_\theta(z)

, and refer to it as the (decoded) message. Usually, both the encoder and the decoder are defined as

multilayer perceptron In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...

s (MLPs). For example, a one-layer-MLP encoder

E_\phi

is: :

E_\phi(\mathbf x) = \sigma(Wx+b)

where

\sigma

is an element-wise

activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...

W

is a "weight" matrix, and

b

is a "bias" vector.

Training an autoencoder

An autoencoder, by itself, is simply a tuple of two functions. To judge its ''quality'', we need a ''task''. A task is defined by a reference probability distribution

\mu_

over

\mathcal X

, and a "reconstruction quality" function

d: \mathcal X \times  \mathcal X \to, \infty /math>, such that d(x, x') measures how much x' differs from x .

With those, we can define the loss function for the autoencoder as L(\theta, \phi) := \mathbb \mathbb E_(x, D_\theta(E_\phi(x))) /math>The ''optimal'' autoencoder for the given task (\mu_, d) is then \arg\min_L(\theta, \phi) . The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

. This search process is referred to as "training the autoencoder". In most situations, the reference distribution is just the empirical distribution given by a dataset

\ \subset \mathcal X

, so that

\mu_ = \frac\sum_^N \delta_

where

\delta_

is the

Dirac measure In mathematics, a Dirac measure assigns a size to a set based solely on whether it contains a fixed element ''x'' or not. It is one way of formalizing the idea of the Dirac delta function, an important tool in physics and other technical fields. ...

, the quality function is just L2 loss:

d(x, x') = \, x - x'\, _2^2

, and

\, \cdot\, _2

is the

Euclidean norm Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces'' ...

. Then the problem of searching for the optimal autoencoder is just a

least-squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

optimization:

\min_ L(\theta, \phi),\qquad \text L(\theta, \phi) = \frac\sum_^N \, x_i - D_\theta(E_\phi(x_i))\, _2^2

Interpretation

An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function

d

. The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space

\mathcal Z

usually has fewer dimensions than the message space

\mathcal

. Such an autoencoder is called ''undercomplete''. It can be interpreted as compressing the message, or reducing its dimensionality. At the limit of an ideal undercomplete autoencoder, every possible code

z

in the code space is used to encode a message

x

that really appears in the distribution

\mu_

, and the decoder is also perfect:

D_\theta(E_\phi(x)) = x

. This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code

z

and obtaining

D_\theta(z)

, which is a message that really appears in the distribution

\mu_

. If the code space

\mathcal Z

has dimension larger than (''overcomplete''), or equal to, the message space

\mathcal

, or the hidden units are given enough capacity, an autoencoder can learn the

identity function Graph of the identity function on the real numbers In mathematics, an identity function, also called an identity relation, identity map or identity transformation, is a function that always returns the value that was used as its argument, unc ...

and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features. In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below.

Variations

Variational autoencoder (VAE)

Variational autoencoder In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian metho ...

s (VAEs) belong to the families of

variational Bayesian methods Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...

. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. Given an input dataset

x

characterized by an unknown probability function

P(x)

and a multivariate latent encoding vector

z

, the objective is to model the data as a distribution

p_\theta(x)

, with

\theta

defined as the set of the network parameters so that

p_\theta(x) = \int_p_\theta(x,z)dz

Sparse autoencoder (SAE)

Inspired by the

sparse coding Neural coding (or neural representation) is a neuroscience field concerned with characterising the hypothetical relationship between the Stimulus (physiology), stimulus and the neuronal responses, and the relationship among the Electrophysiology, e ...

hypothesis in neuroscience, ''sparse autoencoders'' (SAE) are variants of autoencoders, such that the codes

E_\phi(x)

for messages tend to be ''sparse codes'', that is,

E_\phi(x)

is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time. Encouraging sparsity improves performance on classification tasks. Autoencoder sparso

There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder. The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder:

f_k(x_1, ..., x_n) = (x_1 b_1, ..., x_n b_n)

where

b_i = 1

, x_i,

ranks in the top k, and 0 otherwise. Backpropagating through

f_k

is simple: set gradient to 0 for

b_i = 0

entries, and keep gradient for

b_i=1

entries. This is essentially a generalized

ReLU In the context of Neural network (machine learning), artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function ...

function. The other way is a relaxed version of the k-sparse autoencoder. Instead of forcing sparsity, we add a sparsity regularization loss, then optimize for

\min_L(\theta, \phi) + \lambda L_ (\theta, \phi)

where

\lambda > 0

measures how much sparsity we want to enforce. Let the autoencoder architecture have

K

layers. To define a sparsity regularization loss, we need a "desired" sparsity

\hat \rho_k

for each layer, a weight

w_k

for how much to enforce each sparsity, and a function

s:

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

times

\to , \infty/math> to measure how much two sparsities differ. For each input

x

, let the actual sparsity of activation in each layer

k

\rho_k(x) = \frac 1n \sum_^n a_(x)

where

a_(x)

is the activation in the

i

-th neuron of the

k

-th layer upon input

x

. The sparsity loss upon input

x

for one layer is

s(\hat\rho_k, \rho_k(x))

, and the sparsity regularization loss for the entire autoencoder is the expected weighted sum of sparsity losses:

L_(\theta, \phi) = \mathbb \mathbb E_\left sum_ w_k s(\hat\rho_k, \rho_k(x)) \right /math>Typically, the function s is either the Kullback-Leibler (KL) divergence, as Ng, A. (2011)

Sparse autoencoder
''CS294A Lecture notes'', ''72''(2011), 1-19. ::

s(\rho, \hat\rho) = KL(\rho , ,  \hat) = \rho \log \frac+(1- \rho)\log \frac

or the L1 loss, as

s(\rho, \hat\rho) = , \rho- \hat\rho,

, or the L2 loss, as

s(\rho, \hat\rho) = , \rho- \hat\rho, ^2

. Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can define the sparsity regularization loss as

L_(\theta, \phi) = \mathbb \mathbb E_\left h_k\, 
\right /math>where h_k is the activation vector in the k -th layer of the autoencoder. The norm \, \cdot\, is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder).

Denoising autoencoder (DAE)

''Denoising autoencoders'' (DAE) try to achieve a ''good'' representation by changing the ''reconstruction criterion''. A DAE, originally called a "robust autoassociative network" by Mark A. Kramer, is trained by intentionally corrupting the inputs of a standard autoencoder during training. A noise process is defined by a probability distribution

\mu_T

over functions

T:\mathcal X \to \mathcal X

. That is, the function

T

takes a message

x\in \mathcal X

, and corrupts it to a noisy version

T(x)

. The function

T

is selected randomly, with a probability distribution

\mu_T

. Given a task

(\mu_, d)

, the problem of training a DAE is the optimization problem:

\min_L(\theta, \phi) = \mathbb \mathbb E_(x, (D_\theta\circ E_\phi \circ T)(x)) /math>That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising"''.''

Usually, the noise process T is applied only during training and testing, not during downstream use.

The use of DAE depends on two assumptions:
* There exist representations to the messages that are relatively stable and robust to the type of noise we are likely to encounter;
* The said representations capture structures in the input distribution that are useful for our purposes. Example noise processes include:

* additive isotropic

Gaussian noise Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

, * masking noise (a fraction of the input is randomly chosen and set to 0) * salt-and-pepper noise (a fraction of the input is randomly chosen and randomly set to its minimum or maximum value).

Contractive autoencoder (CAE)

A ''contractive autoencoder'' (CAE) adds the contractive regularization loss to the standard autoencoder loss:

\min_L(\theta, \phi) + \lambda L_ (\theta, \phi)

where

\lambda > 0

measures how much contractive-ness we want to enforce. The contractive regularization loss itself is defined as the expected square of

Frobenius norm In the field of mathematics, norms are defined for elements within a vector space. Specifically, when the vector space comprises matrices, such norms are referred to as matrix norms. Matrix norms differ from vector norms in that they must also ...

of the

Jacobian matrix In vector calculus, the Jacobian matrix (, ) of a vector-valued function of several variables is the matrix of all its first-order partial derivatives. If this matrix is square, that is, if the number of variables equals the number of component ...

of the encoder activations with respect to the input:

L_(\theta, \phi) = \mathbb E_ \, \nabla_x E_\phi(x) \, _F^2

To understand what

L_

measures, note the fact

\, E_\phi(x + \delta x)  - E_\phi(x)\, _2 \leq \, \nabla_x E_\phi(x) \, _F \, \delta x\, _2

for any message

x\in \mathcal X

, and small variation

\delta x

in it. Thus, if

\, \nabla_x E_\phi(x) \, _F^2

is small, it means that a small neighborhood of the message maps to a small neighborhood of its code. This is a desired property, as it means small variation in the message leads to small, perhaps even zero, variation in its code, like how two pictures may look the same even if they are not exactly the same. The DAE can be understood as an infinitesimal limit of CAE: in the limit of small Gaussian input noise, DAEs make the reconstruction function resist small but finite-sized input perturbations, while CAEs make the extracted features resist infinitesimal input perturbations.

Minimum description length autoencoder (MDL-AE)

A ''minimum description length autoencoder'' (MDL-AE) is an advanced variation of the traditional autoencoder, which leverages principles from information theory, specifically the Minimum Description Length (MDL) principle. The MDL principle posits that the best model for a dataset is the one that provides the shortest combined encoding of the model and the data. In the context of

autoencoders An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function ...

, this principle is applied to ensure that the learned representation is not only compact but also interpretable and efficient for reconstruction. The MDL-AE seeks to minimize the total description length of the data, which includes the size of the latent representation (code length) and the error in reconstructing the original data. The objective can be expressed as

L_ + L_

, where

L_

represents the length of the compressed latent representation and

L_

denotes the reconstruction error.

Concrete autoencoder (CAE)

The ''concrete autoencoder'' is designed for discrete feature selection. A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous relaxation of the

categorical distribution In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution) is a discrete probability distribution that describes the possible results of a random variable that can ...

to allow gradients to pass through the feature selector layer, which makes it possible to use standard

backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

to learn an optimal subset of input features that minimize reconstruction loss.

Advantages of depth

Autoencoders are often trained with a single-layer encoder and a single-layer decoder, but using many-layered (deep) encoders and decoders offers many advantages. * Depth can exponentially reduce the computational cost of representing some functions. * Depth can exponentially decrease the amount of training data needed to learn some functions. * Experimentally, deep autoencoders yield better compression compared to shallow or linear autoencoders.

Training

Geoffrey Hinton Geoffrey Everest Hinton (born 1947) is a British-Canadian computer scientist, cognitive scientist, and cognitive psychologist known for his work on artificial neural networks, which earned him the title "the Godfather of AI". Hinton is Univer ...

developed the

deep belief network In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not b ...

technique for training many-layered deep autoencoders. His method involves treating each neighboring set of two layers as a

restricted Boltzmann machine A restricted Boltzmann machine (RBM) (also called a restricted Sherrington–Kirkpatrick model with external field or restricted stochastic Ising–Lenz–Little model) is a generative stochastic artificial neural network that can learn a prob ...

so that pretraining approximates a good solution, then using backpropagation to fine-tune the results. Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders. A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method. However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.

History

(Oja, 1982) noted that PCA is equivalent to a neural network with one hidden layer with identity activation function. In the language of autoencoding, the input-to-hidden module is the encoder, and the hidden-to-output module is the decoder. Subsequently, in (Baldi and Hornik, 1989) and (Kramer, 1991) generalized PCA to autoencoders, which they termed as "nonlinear PCA". Immediately after the resurgence of neural networks in the 1980s, it was suggested in 1986 that a neural network be put in "auto-association mode". This was then implemented in (Harrison, 1987) and (Elman, Zipser, 1988) for speech and in (Cottrell, Munro, Zipser, 1987) for images. In (Hinton, Salakhutdinov, 2006),

s were developed. These train a pair

s as encoder-decoder pairs, then train another pair on the latent representation of the first pair, and so on. The first applications of AE date to early 1990s. Their most traditional application was

feature learning In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual fea ...

, but the concept became widely used for learning

s of data.Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 Some of the most powerful AIs in the 2010s involved autoencoder modules as a component of larger AI systems, such as VAE in

Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...

, discrete VAE in Transformer-based image generators like DALL-E 1, etc. During the early days, when the terminology was uncertain, the autoencoder has also been called identity mapping, auto-associating, self-supervised

, or Diabolo network.

Applications

The two main applications of autoencoders are

and

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

(or associative memory), but modern variations have been applied to other tasks.

Dimensionality reduction

Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

was one of the first

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

applications. For Hinton's 2006 study, he pretrained a multi-layer autoencoder with a stack of RBMs and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until hitting a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 components of a principal component analysis (PCA), and learned a representation that was qualitatively easier to interpret, clearly separating data clusters. Reducing dimensions can improve performance on tasks such as classification. Indeed, the hallmark of dimensionality reduction is to place semantically related examples near each other.

Principal component analysis

If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to

principal component analysis Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data is linearly transformed onto a new coordinate system such that th ...

(PCA). The weights of an autoencoder with a single hidden layer of size

p

(where

p

is less than the size of the input) span the same vector subspace as the one spanned by the first

p

principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the

singular value decomposition In linear algebra, the singular value decomposition (SVD) is a Matrix decomposition, factorization of a real number, real or complex number, complex matrix (mathematics), matrix into a rotation, followed by a rescaling followed by another rota ...

. However, the potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss.

Information retrieval and Search engine optimization

Information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

benefits particularly from

in that search can become more efficient in certain kinds of low dimensional spaces. Autoencoders were indeed applied to semantic hashing, proposed by Salakhutdinov and Hinton in 2007. By training the algorithm to produce a low-dimensional binary code, all database entries could be stored in a

hash table In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps Unique key, keys to Value (computer science), values. ...

mapping binary code vectors to entries. This table would then support information retrieval by returning all entries with the same binary code as the query, or slightly less similar entries by flipping some bits from the query encoding. The encoder-decoder architecture, often used in natural language processing and neural networks, can be scientifically applied in the field of SEO (Search Engine Optimization) in various ways: # Text Processing: By using an autoencoder, it's possible to compress the text of web pages into a more compact vector representation. This can help reduce page loading times and improve indexing by search engines. # # Noise Reduction: Autoencoders can be used to remove noise from the textual data of web pages. This can lead to a better understanding of the content by search engines, thereby enhancing ranking in search engine result pages. # # Meta Tag and Snippet Generation: Autoencoders can be trained to automatically generate meta tags, snippets, and descriptions for web pages using the page content. This can optimize the presentation in search results, increasing the Click-Through Rate (CTR). # # Content Clustering: Using an autoencoder, web pages with similar content can be automatically grouped together. This can help organize the website logically and improve navigation, potentially positively affecting user experience and search engine rankings. # # Generation of Related Content: An autoencoder can be employed to generate content related to what is already present on the site. This can enhance the website's attractiveness to search engines and provide users with additional relevant information. # # Keyword Detection: Autoencoders can be trained to identify keywords and important concepts within the content of web pages. This can assist in optimizing keyword usage for better indexing. # # Semantic Search: By using autoencoder techniques, semantic representation models of content can be created. These models can be used to enhance search engines' understanding of the themes covered in web pages. In essence, the encoder-decoder architecture or autoencoders can be leveraged in SEO to optimize web page content, improve their indexing, and enhance their appeal to both search engines and users.

Anomaly detection

Another application for autoencoders is

.An, J., & Cho, S. (2015)
Variational Autoencoder based Anomaly Detection using Reconstruction Probability
''Special Lecture on IE'', ''2'', 1-18. By learning to replicate the most salient features in the training data under some of the constraints described previously, the model is encouraged to learn to precisely reproduce the most frequently observed characteristics. When facing anomalies, the model should worsen its reconstruction performance. In most cases, only data with normal instances are used to train the autoencoder; in others, the frequency of anomalies is small compared to the observation set so that its contribution to the learned representation could be ignored. After training, the autoencoder will accurately reconstruct "normal" data, while failing to do so with unfamiliar anomalous data. Reconstruction error (the error between the original data and its low dimensional reconstruction) is used as an anomaly score to detect anomalies. Recent literature has however shown that certain autoencoding models can, counterintuitively, be very good at reconstructing anomalous examples and consequently not able to reliably perform anomaly detection.

Image processing

The characteristics of autoencoders are useful in image processing. One example can be found in lossy

image compression Image compression is a type of data compression applied to digital images, to reduce their cost for computer data storage, storage or data transmission, transmission. Algorithms may take advantage of visual perception and the statistical properti ...

, where autoencoders outperformed other approaches and proved competitive against

JPEG 2000 JPEG 2000 (JP2) is an image compression standard and coding system. It was developed from 1997 to 2000 by a Joint Photographic Experts Group committee chaired by Touradj Ebrahimi (later the JPEG president), with the intention of superseding their ...

. Another useful application of autoencoders in image preprocessing is image denoising. Autoencoders found use in more demanding contexts such as

medical imaging Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to revea ...

where they have been used for image denoising as well as super-resolution. In image-assisted diagnosis, experiments have applied autoencoders for

breast cancer Breast cancer is a cancer that develops from breast tissue. Signs of breast cancer may include a Breast lump, lump in the breast, a change in breast shape, dimpling of the skin, Milk-rejection sign, milk rejection, fluid coming from the nipp ...

detection and for modelling the relation between the cognitive decline of

Alzheimer's disease Alzheimer's disease (AD) is a neurodegenerative disease and the cause of 60–70% of cases of dementia. The most common early symptom is difficulty in remembering recent events. As the disease advances, symptoms can include problems wit ...

and the latent features of an autoencoder trained with

MRI Magnetic resonance imaging (MRI) is a medical imaging technique used in radiology to generate pictures of the anatomy and the physiological processes inside the body. MRI scanners use strong magnetic fields, magnetic field gradients, and rad ...

Drug discovery

In 2019 molecules generated with variational autoencoders were validated experimentally in mice.

Popularity prediction

Recently, a stacked autoencoder framework produced promising results in predicting popularity of social media posts, which is helpful for online advertising strategies.

Machine translation

Autoencoders have been applied to

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

, which is usually referred to as

neural machine translation Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. It is the dominant a ...

(NMT). Unlike traditional autoencoders, the output does not match the input - it is in another language. In NMT, texts are treated as sequences to be encoded into the learning procedure, while on the decoder side sequences in the target language(s) are generated.

Language Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...

-specific autoencoders incorporate further

linguistic Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

features into the learning procedure, such as Chinese decomposition features. Machine translation is rarely still done with autoencoders, due to the availability of more effective

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

networks.

Communication Systems

Autoencoders in communication systems are important because they help in encoding data into a more resilient representation for channel impairments, which is crucial for transmitting information while minimizing errors. In Addition, AE-based systems can optimize end-to-end communication performance. This approach can solve the several limitations of designing communication systems such as the inherent difficulty in accurately modeling the complex behavior of real-world channels.

References

{{Noise Neural network architectures Unsupervised learning Dimension reduction

Mathematical principles

Definition

Training an autoencoder

Interpretation

Variations

Variational autoencoder (VAE)

Sparse autoencoder (SAE)

Denoising autoencoder (DAE)

Contractive autoencoder (CAE)

Minimum description length autoencoder (MDL-AE)

Concrete autoencoder (CAE)

Advantages of depth

Training

History

Applications

Dimensionality reduction

Principal component analysis

Information retrieval and Search engine optimization

Anomaly detection

Image processing

Drug discovery

Popularity prediction

Machine translation

Communication Systems

See also

Further reading

References