Bayesian optimization is a sequential design strategy for

global optimization Global optimization is a branch of operations research, applied mathematics, and numerical analysis that attempts to find the global minimum or maximum of a function or a set of functions on a given set. It is usually described as a minimization ...

black-box In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is "opaque" (black). The te ...

functions, that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions. With the rise of

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

innovation in the 21st century, Bayesian optimizations have found prominent use in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

problems for optimizing hyperparameter values.

History

The term is generally attributed to and is coined in his work from a series of publications on global optimization in the 1970s and 1980s.

Early mathematics foundations

From 1960s to 1980s

The earliest idea of Bayesian optimization sprang in 1964, from a paper by American applied mathematician Harold J. Kushner
“A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise”
Although not directly proposing Bayesian optimization, in this paper, he first proposed a new method of locating the maximum point of an arbitrary multipeak curve in a noisy environment. This method provided an important theoretical foundation for subsequent Bayesian optimization. By the 1980s, the framework we now use for Bayesian optimization was explicitly established. In 1978, the Lithuanian scientist Jonas Mockus, in his paper “The Application of Bayesian Methods for Seeking the Extremum”, discussed how to use Bayesian methods to find the extreme value of a function under various uncertain conditions. In his paper, Mockus first proposed th

which is one of the core sampling strategies of Bayesian optimization. This criterion balances exploration while optimizing the function efficiently by maximizing the expected improvement. Because of the usefulness and profound impact of this principle, Jonas Mockus is widely regarded as the founder of Bayesian optimization. Although Expected Improvement principle (IE) is one of the earliest proposed core sampling strategies for Bayesian optimization, it is not the only one, with the development of modern society, we also have Probability of Improvement (PI), or Upper Confidence Bound (UCB) and so on.

From theory to practice

In the 1990s, Bayesian optimization began to gradually transition from pure theory to real-world applications. In 1998, Donald R. Jones and his coworkers published a paper titled “Gaussian Optimization”. In this paper, they proposed the Gaussian Process (GP) and elaborated on the Expected Improvement principle (EI) proposed by Jonas Mockus in 1978. Through the efforts of Donald R. Jones and his colleagues, Bayesian Optimization began to shine in the fields like computers science and engineering. However, the computational complexity of Bayesian optimization for the computing power at that time still affected its development to a large extent. In the 21st century, with the gradual rise of artificial intelligence and bionic robots, Bayesian optimization has been widely used in machine learning and deep learning, and has become an important tool for

Hyperparameter Tuning Hyperparameter may refer to: * Hyperparameter (machine learning) * Hyperparameter (Bayesian statistics) In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of th ...

. Companies such as Google, Facebook and OpenAI have added Bayesian optimization to their deep learning frameworks to improve search efficiency. However, Bayesian optimization still faces many challenges, for example, because of the use of Gaussian Process as a proxy model for optimization, when there is a lot of data, the training of Gaussian Process will be very slow and the computational cost is very high. This makes it difficult for this optimization method to work well in more complex drug development and medical experiments.

Strategy

Bayesian optimization is used on problems of the form

\max_ f(x)

, with

X

being the set of all possible parameters

x

, typically with less than or equal to 20

dimension In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coo ...

s for optimal usage (

X \rightarrow \mathbb^d \mid d \le 20

), and whose membership can easily be evaluated. Bayesian optimization is particularly advantageous for problems where

f(x)

is difficult to evaluate due to its computational cost. The objective function,

f

, is continuous and takes the form of some unknown structure, referred to as a "black box". Upon its evaluation, only

f(x)

is observed and its

derivative In mathematics, the derivative is a fundamental tool that quantifies the sensitivity to change of a function's output with respect to its input. The derivative of a function of a single variable at a chosen input value, when it exists, is t ...

s are not evaluated. Since the objective function is unknown, the Bayesian strategy is to treat it as a random function and place a

prior The term prior may refer to: * Prior (ecclesiastical), the head of a priory (monastery) * Prior convictions, the life history and previous convictions of a suspect or defendant in a criminal case * Prior probability, in Bayesian statistics * Prio ...

over it. The prior captures beliefs about the behavior of the function. After gathering the function evaluations, which are treated as data, the prior is updated to form the

posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...

over the objective function. The posterior distribution, in turn, is used to construct an acquisition function (often also referred to as infill sampling criteria) that determines the next query point. There are several methods used to define the prior/posterior distribution over the objective function. The most common two methods use

Gaussian process In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution. The di ...

es in a method called

kriging In statistics, originally in geostatistics, kriging or Kriging (), also known as Gaussian process regression, is a method of interpolation based on Gaussian process governed by prior covariances. Under suitable assumptions of the prior, kriging g ...

. Another less expensive method uses the Parzen-Tree Estimator to construct two distributions for 'high' and 'low' points, and then finds the location that maximizes the expected improvement. Standard Bayesian optimization relies upon each

x \in X

being easy to evaluate, and problems that deviate from this assumption are known as ''exotic Bayesian optimization'' problems. Optimization problems can become exotic if it is known that there is noise, the evaluations are being done in parallel, the quality of evaluations relies upon a tradeoff between difficulty and accuracy, the presence of random environmental conditions, or if the evaluation involves derivatives.

Acquisition functions

Examples of acquisition functions include * probability of improvement * expected improvement * Bayesian expected losses * upper confidence bounds (UCB) or lower confidence bounds * Thompson sampling and hybrids of these. They all trade-off exploration and exploitation so as to minimize the number of function queries. As such, Bayesian optimization is well suited for functions that are expensive to evaluate.

Solution methods

The maximum of the acquisition function is typically found by resorting to discretization or by means of an auxiliary optimizer. Acquisition functions are maximized using a numerical optimization technique, such as

Newton's method In numerical analysis, the Newton–Raphson method, also known simply as Newton's method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a ...

or quasi-Newton methods like the

Broyden–Fletcher–Goldfarb–Shanno algorithm In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. Like the related Davidon–Fletcher–Powell method, BFGS determines the ...

Applications

The approach has been applied to solve a wide range of problems, including

learning to rank Learning to rank. Slides from Tie-Yan Liu's talk at World Wide Web Conference, WWW 2009 conference aravailable online or machine-learned ranking (MLR) is the application of machine learning, typically Supervised learning, supervised, Semi-supervi ...

computer graphics Computer graphics deals with generating images and art with the aid of computers. Computer graphics is a core technology in digital photography, film, video games, digital art, cell phone and computer displays, and many specialized applications. ...

and visual design,

robotics Robotics is the interdisciplinary study and practice of the design, construction, operation, and use of robots. Within mechanical engineering, robotics is the design and construction of the physical structures of robots, while in computer s ...

sensor networks Wireless sensor networks (WSNs) refer to networks of spatially dispersed and dedicated sensors that monitor and record the physical conditions of the environment and forward the collected data to a central location. WSNs can measure environmental ...

, automatic algorithm configuration, automatic machine learning toolboxes,

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

, planning, visual attention, architecture configuration in

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

, static program analysis, experimental

particle physics Particle physics or high-energy physics is the study of Elementary particle, fundamental particles and fundamental interaction, forces that constitute matter and radiation. The field also studies combinations of elementary particles up to the s ...

, quality-diversity optimization, chemistry, material design, and drug development.Griffiths et al
Constrained Bayesian Optimization for Automatic Chemical Design using Variational Autoencoders
Chemical Science: 11, 577-586 (2020) Bayesian optimization has been applied in the field of facial recognition.Mohammed Mehdi Bouchene: Bayesian Optimization of Histogram of Oriented Gradients (Hog) Parameters for Facial Recognition. SSRN (2023) The performance of the Histogram of Oriented Gradients (HOG) algorithm, a popular feature extraction method, heavily relies on its parameter settings. Optimizing these parameters can be challenging but crucial for achieving high accuracy. A novel approach to optimize the HOG algorithm parameters and image size for facial recognition using a Tree-structured Parzen Estimator (TPE) based Bayesian optimization technique has been proposed. This optimized approach has the potential to be adapted for other computer vision applications and contributes to the ongoing development of hand-crafted parameter-based feature extraction algorithms in computer vision.

References

{{Optimization algorithms Sequential methods Sequential experiments Stochastic optimization Machine learning