Extreme learning machines are

feedforward neural network A feedforward neural network (FNN) is an artificial neural network wherein connections between the nodes do ''not'' form a cycle. As such, it is different from its descendant: recurrent neural networks. The feedforward neural network was the ...

s for

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

, regression, clustering,

sparse approximation Sparse approximation (also known as sparse representation) theory deals with sparse solutions for systems of linear equations. Techniques for finding these solutions and exploiting them in applications have found wide use in image processing, signal ...

, compression and

feature learning In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature ...

with a single layer or multiple layers of hidden nodes, where the parameters of hidden nodes (not just the weights connecting inputs to hidden nodes) need to be tuned. These hidden nodes can be randomly assigned and never updated (i.e. they are

random projection In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. Random projection methods are known for their power, simplicity, and low error rates when compared ...

but with nonlinear transforms), or can be inherited from their ancestors without being changed. In most cases, the output weights of hidden nodes are usually learned in a single step, which essentially amounts to learning a linear model. The name "extreme learning machine" (ELM) was given to such models by its main inventor Guang-Bin Huang. According to their creators, these models are able to produce good generalization performance and learn thousands of times faster than networks trained using

backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...

. In literature, it also shows that these models can outperform

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

s in both classification and regression applications.

History

From 2001-2010, ELM research mainly focused on the unified learning framework for "generalized" single-hidden layer feedforward neural networks (SLFNs), including but not limited to sigmoid networks, RBF networks, threshold networks, trigonometric networks, fuzzy inference systems, Fourier series, Laplacian transform, wavelet networks, etc. One significant achievement made in those years is to successfully prove the universal approximation and classification capabilities of ELM in theory. From 2010 to 2015, ELM research extended to the unified learning framework for kernel learning, SVM and a few typical feature learning methods such as

Principal Component Analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...

(PCA) and

Non-negative Matrix Factorization Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix is factorized into (usually) two matrices and , with the property that ...

(NMF). It is shown that SVM actually provides suboptimal solutions compared to ELM, and ELM can provide the whitebox kernel mapping, which is implemented by ELM random feature mapping, instead of the blackbox kernel used in SVM. PCA and NMF can be considered as special cases where linear hidden nodes are used in ELM. From 2015 to 2017, an increased focus has been placed on hierarchical implementations of ELM. Additionally since 2011, significant biological studies have been made that support certain ELM theories. From 2017 onwards, to overcome low-convergence problem during training

LU decomposition In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix (see matrix decomposition). The product sometimes includes a ...

, Hessenberg decomposition and

QR decomposition In linear algebra, a QR decomposition, also known as a QR factorization or QU factorization, is a decomposition of a matrix ''A'' into a product ''A'' = ''QR'' of an orthogonal matrix ''Q'' and an upper triangular matrix ''R''. QR decom ...

based approaches with

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

have begun to attract attention In a 2017 announcement from

Google Scholar Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes p ...

:
Classic Papers: Articles That Have Stood The Test of Time
, two ELM papers have been listed in the
Top 10 in Artificial Intelligence for 2006
" taking positions 2 and 7.

Algorithms

Given a single hidden layer of ELM, suppose that the output function of the

i

-th hidden node is

h_i(\mathbf)=G(\mathbf_i,b_i,\mathbf)

, where

\mathbf_i

and

b_i

are the parameters of the

i

-th hidden node. The output function of the ELM for single hidden layer feedforward networks (SLFN) with

L

hidden nodes is:

f_L()=\sum_^L_ih_i()

, where

_i

is the output weight of the

i

-th hidden node.

\mathbf(\mathbf)=_i(\mathbf),...,h_L(\mathbf) /math> is the hidden layer output mapping of ELM. Given N training samples, the hidden layer output matrix \mathbf of ELM is given as: =\left begin
(_1)\\
\vdots\\
(_N)
\end\right \left begin
G(_1, b_1, _1) &\cdots & G(_L, b_L, _1)\\
\vdots &\vdots&\vdots\\
G(_1, b_1, _N) &\cdots & G(_L, b_L, _N)
\end\right and \mathbf is the training data target matrix: =\left begin
_1\\
\vdots\\
_N
\end\right Generally speaking, ELM is a kind of regularization neural networks but with non-tuned hidden layer mappings (formed by either random hidden nodes, kernels or other implementations), its objective function is: \text \, \, _p^+C\, -\, _q^where \sigma_1>0, \sigma_2>0, p,q=0, \frac, 1, 2, \cdots, +\infty .

Different combinations of \sigma_1, \sigma_2, p and q can be used and result in different learning algorithms for regression, classification, sparse coding, compression, feature learning and clustering.

As a special case, a simplest ELM training algorithm learns a model of the form (for single hidden layer sigmoid neural networks):

: \mathbf = \mathbf_2 \sigma(\mathbf_1 x) where  is the matrix of input-to-hidden-layer weights, \sigma is an activation function, and  is the matrix of hidden-to-output-layer weights. The algorithm proceeds as follows:

# Fill  with random values (e.g., Gaussian random noise);
# estimate  by

least-squares fit The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

to a matrix of response variables , computed using the

pseudoinverse In mathematics, and in particular, algebra, a generalized inverse (or, g-inverse) of an element ''x'' is an element ''y'' that has some properties of an inverse element but not necessarily all of them. The purpose of constructing a generalized in ...

, given a

design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ...

: #:

\mathbf_2 = \sigma(\mathbf_1 \mathbf)^+ \mathbf

Architectures

In most cases, ELM is used as a single hidden layer feedforward network (SLFN) including but not limited to sigmoid networks, RBF networks, threshold networks, fuzzy inference networks, complex neural networks, wavelet networks, Fourier transform, Laplacian transform, etc. Due to its different learning algorithm implementations for regression, classification, sparse coding, compression, feature learning and clustering, multi ELMs have been used to form multi hidden layer networks, deep learning or hierarchical networks. A hidden node in ELM is a computational element, which need not be considered as classical neuron. A hidden node in ELM can be classical artificial neurons, basis functions, or a subnetwork formed by some hidden nodes.

Theories

Both universal approximation and classification capabilities have been proved for ELM in literature. Especially, Guang-Bin Huang and his team spent almost seven years (2001-2008) on the rigorous proofs of ELM's universal approximation capability.

Universal approximation capability

In theory, any nonconstant piecewise continuous function can be used as activation function in ELM hidden nodes, such an activation function need not be differential. If tuning the parameters of hidden nodes could make SLFNs approximate any target function

f(\mathbf)

, then hidden node parameters can be randomly generated according to any continuous distribution probability, and

\lim_\left\, \sum_^L_ih_i()-f()\right\, =0

holds with probability one with appropriate output weights

\boldsymbol\beta

Classification capability

Given any nonconstant piecewise continuous function as the activation function in SLFNs, if tuning the parameters of hidden nodes can make SLFNs approximate any target function

f(\mathbf)

, then SLFNs with random hidden layer mapping

\mathbf(\mathbf)

can separate arbitrary disjoint regions of any shapes.

Neurons

A wide range of nonlinear piecewise continuous functions

G(\mathbf, b, \mathbf)

can be used in hidden neurons of ELM, for example:

Real domain

Sigmoid function:

G(\mathbf, b, \mathbf)=\frac

Fourier function:

G(\mathbf, b, \mathbf)=\sin(\mathbf\cdot\mathbf+b)

Hardlimit function:

G(\mathbf, b, \mathbf)=
\begin
1, &\text\cdot-b\geq 0\\
0, &\text
\end

Gaussian function:

G(\mathbf, b, \mathbf)=\exp(-b\, \mathbf-\mathbf\, ^2)

Multiquadrics function:

G(\mathbf, b, \mathbf)=(\, \mathbf-\mathbf\, ^2+b^2)^

Wavelet:

G(\mathbf, b, \mathbf)=\, a\, ^\Psi\left(\frac\right)

where

\Psi

is a single mother wavelet function.

Complex domain

Circular functions:

\tan(z)= \frac

\sin(z)= \frac

Inverse circular functions:

\arctan(z)= \int_0^z\frac

\arccos(z)= \int_0^z\frac

Hyperbolic functions:

\tanh(z)= \frac

\sinh(z)= \frac

Inverse hyperbolic functions:

\text(z)=\int_0^z\frac

\text(z)=\int_0^z\frac

Reliability

The

black-box In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is "opaque" (black). The te ...

character of neural networks in general and extreme learning machines (ELM) in particular is one of the major concerns that repels engineers from application in unsafe automation tasks. This particular issue was approached by means of several different techniques. One approach is to reduce the dependence on the random input. Another approach focuses on the incorporation of continuous constraints into the learning process of ELMs which are derived from prior knowledge about the specific task. This is reasonable, because machine learning solutions have to guarantee a safe operation in many application domains. The mentioned studies revealed that the special form of ELMs, with its functional separation and the linear read-out weights, is particularly well suited for the efficient incorporation of continuous constraints in predefined regions of the input space.

Controversy

There are two main complaints from academic community concerning this work, the first one is about "reinventing and ignoring previous ideas", the second one is about "improper naming and popularizing", as shown in some debates in 2008 and 2015. In particular, it was pointed out in a letter to the editor of ''IEEE Transactions on Neural Networks'' that the idea of using a hidden layer connected to the inputs by random untrained weights was already suggested in the original papers on

RBF network In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the ...

s in the late 1980s; Guang-Bin Huang replied by pointing out subtle differences. In a 2015 paper, Huang responded to complaints about his invention of the name ELM for already-existing methods, complaining of "very negative and unhelpful comments on ELM in neither academic nor professional manner due to various reasons and intentions" and an "irresponsible anonymous attack which intends to destroy harmony research environment", arguing that his work "provides a unifying learning platform" for various types of neural nets, including hierarchical structured ELM. In 2015, Huang also gave a formal rebuttal to what he considered as "malign and attack." Recent research replaces the random weights with constrained random weights.

Open sources

Matlab Library
* Python Library

References

{{reflist Artificial neural networks