Symbolic regression (SR) is a type of
regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity.
No particular model is provided as a starting point for symbolic regression. Instead, initial expressions are formed by randomly combining mathematical building blocks such as
mathematical operators
Mathematical Operators is a Unicode block containing characters for mathematical, logical, and set notation.
Notably absent are the plus sign (+), greater than sign (>) and less than sign (<), due to them already appearing in the Basi ...
,
analytic function
In mathematics, an analytic function is a function that is locally given by a convergent power series. There exist both real analytic functions and complex analytic functions. Functions of each type are infinitely differentiable, but complex an ...
s,
constants, and
state variable
A state variable is one of the set of variables that are used to describe the mathematical "state" of a dynamical system. Intuitively, the state of a system describes enough about the system to determine its future behaviour in the absence of a ...
s. Usually, a subset of these primitives will be specified by the person operating it, but that's not a requirement of the technique. The symbolic regression problem for mathematical functions has been tackled with a variety of methods, including recombining equations most commonly using
genetic programming
In artificial intelligence, genetic programming (GP) is a technique of evolving programs, starting from a population of unfit (usually random) programs, fit for a particular task by applying operations analogous to natural genetic processes to t ...
,
as well as more recent methods utilizing
Bayesian methods
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and e ...
and
neural networks
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
.
Another non-classical alternative method to SR is called Universal Functions Originator (UFO), which has a different mechanism, search-space, and building strategy.
Further methods such as Exact Learning attempt to transform the fitting problem into a
moments problem in a natural function space, usually built around generalizations of the
Meijer-G function.
By not requiring ''a priori'' specification of a model, symbolic regression isn't affected by human bias, or unknown gaps in
domain knowledge
Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engin ...
. It attempts to uncover the intrinsic relationships of the dataset, by letting the patterns in the data itself reveal the appropriate models, rather than imposing a model structure that is deemed mathematically tractable from a human perspective. The
fitness function {{no footnotes, date=May 2015
A fitness function is a particular type of objective function that is used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims. Fitness functions are used in genetic ...
that drives the evolution of the models takes into account not only
error metrics (to ensure the models accurately predict the data), but also special complexity measures,
thus ensuring that the resulting models reveal the data's underlying structure in a way that's understandable from a human perspective. This facilitates reasoning and favors the odds of getting insights about the data-generating system, as well as improving generalisability and extrapolation behaviour by preventing
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. Accuracy and simplicity may be left as two separate objectives of the regression -- in which case the optimum solutions form a
Pareto front
In multi-objective optimization, the Pareto front (also called Pareto frontier or Pareto curve) is the set of all Pareto efficient solutions. The concept is widely used in engineering. It allows the designer to restrict attention to the set of eff ...
-- or they may be combined into a single objective by means of a model selection principle such as
minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
.
It has been proven that symbolic regression is an
NP-hard
In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
problem, in the sense that one cannot always find the best possible mathematical expression to fit to a given dataset in
polynomial time
In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
. Nevertheless, if the sought-for equation is not too complex it is possible to solve the symbolic regression problem exactly by generating every possible function (built from some predefined set of operators) and evaluating them on the dataset in question.
Difference from classical regression
While conventional regression techniques seek to optimize the parameters for a pre-specified model structure, symbolic regression avoids imposing prior assumptions, and instead infers the model from the data. In other words, it attempts to discover both model structures and model parameters.
This approach has the disadvantage of having a much larger space to search, because not only the search space in symbolic regression is infinite, but there are an infinite number of models which will perfectly fit a finite data set (provided that the model complexity isn't artificially limited). This means that it will possibly take a symbolic regression algorithm longer to find an appropriate model and parametrization, than traditional regression techniques. This can be attenuated by limiting the set of building blocks provided to the algorithm, based on existing knowledge of the system that produced the data; but in the end, using symbolic regression is a decision that has to be balanced with how much is known about the underlying system.
Nevertheless, this characteristic of symbolic regression also has advantages: because the
evolutionary algorithm
In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduc ...
requires diversity in order to effectively explore the search space, the result is likely to be a selection of high-scoring models (and their corresponding set of parameters). Examining this collection could provide better insight into the underlying process, and allows the user to identify an approximation that better fits their needs in terms of accuracy and simplicity.
Benchmarking
SRBench
In 2021
SRBenchwas proposed as a large benchmark for symbolic regression.
In its inception, SRBench featured 14 symbolic regression methods, 7 other ML methods, and 252 datasets fro
PMLB
The benchmark intends to be a living project: it encourages the submission of improvements, new datasets, and new methods, to keep track of the state of the art in SR.
SRBench Competition 2022
In 2022, SRBench announced the competition Interpretable Symbolic Regression for Data Science, which was held at the
GECCO conference in Boston, MA. The competition pitted nine leading symbolic regression algorithms against each other on a novel set of data problems and considered different evaluation criteria. The competition was organized in two tracks, a synthetic track and a real-world data track.
Synthetic Track
In the synthetic track, methods were compared according to five properties: re-discovery of exact expressions; feature selection; resistance to local optima; extrapolation; and sensitivity to noise. Rankings of the methods were:
#
QLattice
The QLattice is a software library which provides a framework for symbolic regression in Python. It works on Linux, Windows, and macOS. The QLattice algorithm is developed by the Danish/Spanish AI research company Abzu. Since its creation, the QLa ...
PySR (Python Symbolic Regression)uDSR (Deep Symbolic Optimization)
Real-world Track
In the real-world track, methods were trained to build interpretable predictive models for 14-day forecast counts of COVID-19 cases, hospitalizations, and deaths in New York State. These models were reviewed by a subject expert and assigned trust ratings and evaluated for accuracy and simplicity. The ranking of the methods was:
uDSR (Deep Symbolic Optimization)#
QLattice
The QLattice is a software library which provides a framework for symbolic regression in Python. It works on Linux, Windows, and macOS. The QLattice algorithm is developed by the Danish/Spanish AI research company Abzu. Since its creation, the QLa ...
geneticengine (Genetic Engine)
Non-Standard Methods
Most symbolic regression algorithms prevent
combinatorial explosion
In mathematics, a combinatorial explosion is the rapid growth of the complexity of a problem due to how the combinatorics of the problem is affected by the input, constraints, and bounds of the problem. Combinatorial explosion is sometimes used ...
by implementing evolutionary algorithms that iteratively improve the best-fit expression over many generations. Recently, researchers have proposed algorithms utilizing other tactics in
AI.
Silviu-Marian Udrescu and
Max Tegmark
Max Erik Tegmark (born 5 May 1967) is a Swedish-American physicist, cosmologist and machine learning researcher. He is a professor at the Massachusetts Institute of Technology and the president of the Future of Life Institute. He is also a scienti ...
developed the "AI Feynman" algorithm, which attempts symbolic regression by training a neural network to represent the mystery function, then runs tests against the neural network to attempt to break up the problem into smaller parts. For example, if
, tests against the neural network can recognize the separation and proceed to solve for
and
separately and with different variables as inputs. This is an example of
divide and conquer, which reduces the size of the problem to be more manageable. AI Feynman also transforms the inputs and outputs of the mystery function in order to produce a new function which can be solved with other techniques, and performs
dimensional analysis
In engineering and science, dimensional analysis is the analysis of the relationships between different physical quantities by identifying their base quantities (such as length, mass, time, and electric current) and units of measure (such as m ...
to reduce the number of independent variables involved. The algorithm was able to "discover" 100 equations from
The Feynman Lectures on Physics
''The Feynman Lectures on Physics'' is a physics textbook based on some lectures by Richard Feynman, a Nobel laureate who has sometimes been called "The Great Explainer". The lectures were presented before undergraduate students at the Californ ...
, while a leading software using evolutionary algorithms,
Eureqa
Eureqa is a proprietary modeling engine originally created by Cornell's Artificial Intelligence Lab and later commercialized by Nutonian, Inc. The software uses evolutionary search to determine mathematical equations that describe sets of data in ...
, solved only 71. AI Feynman, in contrast to classic symbolic regression methods, requires a very large dataset in order to first train the neural network and is naturally biased towards equations that are common in elementary physics.
Software
End-user software
*
QLattice
The QLattice is a software library which provides a framework for symbolic regression in Python. It works on Linux, Windows, and macOS. The QLattice algorithm is developed by the Danish/Spanish AI research company Abzu. Since its creation, the QLa ...
is a quantum-inspired simulation and machine learning technology that helps search through an infinite list of potential mathematical models to solve a problem.
uDSRis a deep learning framework for symbolic optimization tasks
dCGP differentiable Cartesian Genetic Programming in python (free, open source)
*
HeuristicLab
HeuristicLab is a software environment for heuristic and evolutionary algorithms, developed by members of the Heuristic and Evolutionary Algorithm Laboratory (HEAL) at the University of Applied Sciences Upper Austria, in Hagenberg im Mühlkrei ...
, a software environment for heuristic and evolutionary algorithms, including symbolic regression (free, open source)
*
GeneXProTools, - an implementation of
Gene expression programming
In computer programming, gene expression programming (GEP) is an evolutionary algorithm that creates computer programs or models. These computer programs are complex tree structures that learn and adapt by changing their sizes, shapes, and compos ...
technique for various problems including symbolic regression (commercial)
*
Multi Expression Programming X, an implementation of
Multi expression programming
Multi Expression Programming (MEP) is an evolutionary algorithm for generating mathematical functions describing a given set of data. MEP is a Genetic Programming variant encoding multiple solutions in the same chromosome. MEP representation is not ...
for symbolic regression and classification (free, open source)
*
Eureqa
Eureqa is a proprietary modeling engine originally created by Cornell's Artificial Intelligence Lab and later commercialized by Nutonian, Inc. The software uses evolutionary search to determine mathematical equations that describe sets of data in ...
, evolutionary symbolic regression software (commercial), and
software library
In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and subr ...
TuringBot symbolic regression software based on simulated annealing (commercial)
PySR symbolic regression environment written in
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
and
Julia
Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e.g. ...
, using regularized evolution,
simulated annealing
Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It ...
, and
gradient
In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gradi ...
-free optimization (free, open source)
GP-GOMEA fast (
C++
C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
back-end)
evolutionary
Evolution is change in the heredity, heritable Phenotypic trait, characteristics of biological populations over successive generations. These characteristics are the Gene expression, expressions of genes, which are passed on from parent to ...
symbolic regression with
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support-vector ...
-compatible interface, achieved one of the best trade-offs between accuracy and simplicity of discovered models o
SRBenchin 2021 (free, open source)
See also
*
Closed-form expression § Conversion from numerical forms
*
Genetic programming
In artificial intelligence, genetic programming (GP) is a technique of evolving programs, starting from a population of unfit (usually random) programs, fit for a particular task by applying operations analogous to natural genetic processes to t ...
*
Gene expression programming
In computer programming, gene expression programming (GEP) is an evolutionary algorithm that creates computer programs or models. These computer programs are complex tree structures that learn and adapt by changing their sizes, shapes, and compos ...
*
Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
*
Linear genetic programming
:''"Linear genetic programming" is unrelated to "linear programming".''
Linear genetic programming (LGP) is a particular subset of genetic programming wherein computer programs in a population are represented as a sequence of instructions from i ...
*
Mathematical optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
*
Multi expression programming
Multi Expression Programming (MEP) is an evolutionary algorithm for generating mathematical functions describing a given set of data. MEP is a Genetic Programming variant encoding multiple solutions in the same chromosome. MEP representation is not ...
*
Regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
*
Reverse mathematics
Reverse mathematics is a program in mathematical logic that seeks to determine which axioms are required to prove theorems of mathematics. Its defining method can briefly be described as "going backwards from the theorems to the axioms", in cont ...
*
Discovery system (AI research)
A discovery system is an artificial intelligence system that attempts to discover new scientific concepts or laws. The aim of discovery systems is to automate scientific data analysis and the scientific discovery process. Ideally, an artificial in ...
References
Further reading
*
*
*
External links
*
* (Java applet) — approximates a function by evolving combinations of simple arithmetic operators, using algorithms developed by
John Koza
John R. Koza is a computer scientist and a former adjunct professor at Stanford University, most notable for his work in pioneering the use of genetic programming for the optimization of complex problems. Koza co-founded Scientific Games Corporati ...
.
* {{cite web , title = Symbolic Regression: Function Discovery & More , url = http://www.symbolicregression.com , author = Katya Vladislavleva , archive-url = https://web.archive.org/web/20141218105301/http://symbolicregression.com/ , archive-date = 2014-12-18
Regression analysis
Genetic programming
Computer algebra