HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machin ...
, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a
parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned. The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
on given independent data. The objective function takes a tuple of hyperparameters and returns the associated loss. Cross-validation is often used to estimate this generalization performance.


Approaches


Grid search

The traditional way of performing hyperparameter optimization has been ''grid search'', or a ''parameter sweep'', which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a hold-out validation set. Since the parameter space of a machine learner may include real-valued or unbounded value spaces for certain parameters, manually set bounds and discretization may be necessary before applying grid search. For example, a typical soft-margin SVM classifier equipped with an
RBF kernel In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification. The RBF kernel on two s ...
has at least two hyperparameters that need to be tuned for good performance on unseen data: a regularization constant ''C'' and a kernel hyperparameter γ. Both parameters are continuous, so to perform grid search, one selects a finite set of "reasonable" values for each, say :C \in \ :\gamma \in \ Grid search then trains an SVM with each pair (''C'', γ) in the
Cartesian product In mathematics, specifically set theory, the Cartesian product of two sets ''A'' and ''B'', denoted ''A''×''B'', is the set of all ordered pairs where ''a'' is in ''A'' and ''b'' is in ''B''. In terms of set-builder notation, that is : A\tim ...
of these two sets and evaluates their performance on a held-out validation set (or by internal cross-validation on the training set, in which case multiple SVMs are trained per pair). Finally, the grid search algorithm outputs the settings that achieved the highest score in the validation procedure. Grid search suffers from the
curse of dimensionality The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The ...
, but is often
embarrassingly parallel In parallel computing, an embarrassingly parallel workload or problem (also called embarrassingly parallelizable, perfectly parallel, delightfully parallel or pleasingly parallel) is one where little or no effort is needed to separate the problem i ...
because the hyperparameter settings it evaluates are typically independent of each other.


Random search

Random Search replaces the exhaustive enumeration of all combinations by selecting them randomly. This can be simply applied to the discrete setting described above, but also generalizes to continuous and mixed spaces. It can outperform Grid search, especially when only a small number of hyperparameters affects the final performance of the machine learning algorithm. In this case, the optimization problem is said to have a low intrinsic dimensionality. Random Search is also
embarrassingly parallel In parallel computing, an embarrassingly parallel workload or problem (also called embarrassingly parallelizable, perfectly parallel, delightfully parallel or pleasingly parallel) is one where little or no effort is needed to separate the problem i ...
, and additionally allows the inclusion of prior knowledge by specifying the distribution from which to sample. Despite its simplicity, random search remains one of the important base-lines against which to compare the performance of new hyperparameter optimization methods.


Bayesian optimization

Bayesian optimization is a global optimization method for noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum). In practice, Bayesian optimization has been shown to obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run.


Gradient-based optimization

For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using gradient descent. The first usage of these techniques was focused on neural networks. Since then, these methods have been extended to other models such as
support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
s or logistic regression. A different approach in order to obtain a gradient with respect to hyperparameters consists in differentiating the steps of an iterative optimization algorithm using
automatic differentiation In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation, computational differentiation, auto-differentiation, or simply autodiff, is a set of techniques to evaluate the derivative of a function ...
. A more recent work along this direction uses the implicit function theorem to calculate hypergradients and proposes a stable approximation of the inverse Hessian. The method scales to millions of hyperparameters and requires constant memory. In a different approach, a hypernetwork is trained to approximate the best response function. One of the advantages of this method is that it can handle discrete hyperparameters as well. Self-tuning networks offer a memory efficient version of this approach by choosing a compact representation for the hypernetwork. More recently, Δ-STN has improved this method further by a slight reparameterization of the hypernetwork which speeds up training. Δ-STN also yields a better approximation of the best-response Jacobian by linearizing the network in the weights, hence removing unnecessary nonlinear effects of large changes in the weights. Apart from hypernetwork approaches, gradient-based methods can be used to optimize discrete hyperparameters also by adopting a continuous relaxation of the parameters. Such methods have been extensively used for the optimization of architecture hyperparameters in
neural architecture search Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed a ...
.


Evolutionary optimization

Evolutionary optimization is a methodology for the global optimization of noisy black-box functions. In hyperparameter optimization, evolutionary optimization uses
evolutionary algorithms In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduc ...
to search the space of hyperparameters for a given algorithm. Evolutionary hyperparameter optimization follows a
process A process is a series or set of activities that interact to produce a result; it may occur once-only or be recurrent or periodic. Things called a process include: Business and management * Business process, activities that produce a specific s ...
inspired by the biological concept of
evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
: # Create an initial population of random solutions (i.e., randomly generate tuples of hyperparameters, typically 100+) # Evaluate the hyperparameters tuples and acquire their
fitness function {{no footnotes, date=May 2015 A fitness function is a particular type of objective function that is used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims. Fitness functions are used in genetic ...
(e.g., 10-fold cross-validation accuracy of the machine learning algorithm with those hyperparameters) # Rank the hyperparameter tuples by their relative fitness # Replace the worst-performing hyperparameter tuples with new hyperparameter tuples generated through crossover and
mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mitos ...
# Repeat steps 2-4 until satisfactory algorithm performance is reached or algorithm performance is no longer improving Evolutionary optimization has been used in hyperparameter optimization for statistical machine learning algorithms,
automated machine learning Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready ...
, typical neural network and
deep neural network Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...
architecture search, as well as training of the weights in deep neural networks.


Population-based

Population Based Training (PBT) learns both hyperparameter values and network weights. Multiple learning processes operate independently, using different hyperparameters. As with evolutionary methods, poorly performing models are iteratively replaced with models that adopt modified hyperparameter values and weights based on the better performers. This replacement model warm starting is the primary differentiator between PBT and other evolutionary methods. PBT thus allows the hyperparameters to evolve and eliminates the need for manual hypertuning. The process makes no assumptions regarding model architecture, loss functions or training procedures. PBT and its variants are adaptive methods: they update hyperparameters during the training of the models. On the contrary, non-adaptive methods have the sub-optimal strategy to assign a constant set of hyperparameters for the whole training.


Early stopping-based

A class of early stopping-based hyperparameter optimization algorithms is purpose built for large search spaces of continuous and discrete hyperparameters, particularly when the computational cost to evaluate the performance of a set of hyperparameters is high. Irace implements the iterated racing algorithm, that focuses the search around the most promising configurations, using statistical tests to discard the ones that perform poorly. Another early stopping hyperparameter optimization algorithm is successive halving (SHA), which begins as a random search but periodically prunes low-performing models, thereby focusing computational resources on more promising models. Asynchronous successive halving (ASHA) further improves upon SHA's resource utilization profile by removing the need to synchronously evaluate and prune low-performing models. Hyperband is a higher level early stopping-based algorithm that invokes SHA or ASHA multiple times with varying levels of pruning aggressiveness, in order to be more widely applicable and with fewer required inputs.


Others

RBF and spectral approaches have also been developed.


Open-source software


Grid search


Determined
a DL Training Platform includes grid search for PyTorch and TensorFlow (Keras and Estimator) models.

provides grid search over algorithms in the H2O open source machine learning library.
Katib
is a Kubernetes-native system that includes grid search. *
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ma ...
is a Python package that include
grid
search.
Talos
includes grid search for
Keras Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. Up until version 2.3, Keras supported multiple backends, including TensorFlow, Micro ...
.
Tune
is a Python library for distributed hyperparameter tuning and supports grid search.
Oríon
is an asynchronous framework for distributed black-box optimization and hyperparameter optimization that includes grid search.
Syne Tune
is a Python library for asynchronous hyperparameter and architecture optimization that supports grid search.


Random search


Determined
is a DL Training Platform that supports random search for PyTorch and TensorFlow (Keras and Estimator) models.
hyperopt
also vi
hyperas
an
hyperopt-sklearn
are Python packages which include random search.
Katib
is a Kubernetes-native system that includes random search. *
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ma ...
is a Python package which include
random
search. * caret is a R package which include
grid & random
search.
Talos
includes a customizable random search for
Keras Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. Up until version 2.3, Keras supported multiple backends, including TensorFlow, Micro ...
.
Tune
is a Python library for distributed hyperparameter tuning and supports random search over arbitrary parameter distributions.
Oríon
is an asynchronous framework for distributed black-box optimization and hyperparameter optimization that includes random search.
Syne Tune
is a Python library for asynchronous hyperparameter and architecture optimization that supports random search.


Bayesian


Auto-sklearn
ref name="autosklearn">
is a Bayesian hyperparameter optimization layer on top of
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ma ...
.
Ax
ref name=AxBoTorch>
is a Python-based experimentation platform that supports Bayesian optimization and bandit optimization as exploration strategies.
BOCS
is a Matlab package which uses
semidefinite programming Semidefinite programming (SDP) is a subfield of convex optimization concerned with the optimization of a linear objective function (a user-specified function that the user wants to minimize or maximize) over the intersection of the cone of positive ...
for minimizing a black-box function over discrete inputs. A Python 3 implementation is also included.
HpBandSter
is a Python package which combines Bayesian optimization with bandit-based methods.
Katib
is a Kubernetes-native system which includes bayesian optimization.
mlrMBO
also wit
mlr
is an R package for model-based/Bayesian optimization of black-box functions.
optuna
is a Python package for black box optimization, compatible with arbitrary functions that need to be optimized.
scikit-optimize
is a Python package or sequential model-based optimization with a scipy.optimize interface.
SMAC
SMAC is a Python/Java library implementing Bayesian optimization.
tuneRanger
is an R package for tuning random forests using model-based optimization.
Open Source Vizier
a Python service which allows Bayesian Optimization development built upo
Tensorflow Probability

Oríon
is a Python package that provides a built-in implementation of the Bayesian optimization algorithm TPE as well as integrations wit
Axscikit-optimize
an
HpBandSter
for additional Bayesian optimization algorithms.
Syne Tune
is a Python library for asynchronous hyperparameter and architecture optimization that supports Bayesian optimization, as well as model-based multi-fidelity algorithms.


Gradient-based optimization


FAR-HO
is a Python package containing Tensorflow implementations and wrappers for gradient-based hyperparameter optimization with forward and reverse mode algorithmic differentiation.
XGBoost
is an open-source software library that provides a gradient boosting framework for C++, Java, Python, R, and Julia.


Evolutionary


deap
is a Python framework for general evolutionary computation which is flexible and integrates with parallelization packages lik
scoop
and pyspark, and other Python frameworks like sklearn vi
sklearn-deap

Determined
is a DL Training Platform that supports PBT for optimizing PyTorch and TensorFlow (Keras and Estimator) models.
devol
is a Python package that performs Deep Neural Network architecture search using
genetic programming In artificial intelligence, genetic programming (GP) is a technique of evolving programs, starting from a population of unfit (usually random) programs, fit for a particular task by applying operations analogous to natural genetic processes to t ...
.
nevergrad
ref name=nevergrad_issue1/> is a Python package which includes
Differential evolution In evolutionary computation, differential evolution (DE) is a method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Such methods are commonly known as metaheuristics ...
,
Evolution strategy In computer science, an evolution strategy (ES) is an optimization technique based on ideas of evolution. It belongs to the general class of evolutionary computation or artificial evolution methodologies. History The 'evolution strategy' optimiza ...
, Bayesian optimization, population control methods for the noisy case and Particle swarm optimization.
Tune
is a Python library for distributed hyperparameter tuning and leverage
nevergrad
for evolutionary algorithm support.
Open Source Vizier
is a Python service that supports algorithms such as NSGA-II and integrated wit
PyGlove
to allow evolutionary computing.
Oríon
is a Python package that includes the algorithm Population Based Training and Population Based Bandits, as well as an integration wit
nevergrad
for additional evolutionary algorithms.
Syne Tune
is a Python library for asynchronous hyperparameter and architecture optimization that supports evolutionary algorithms.


Early Stopping


Determined
is a DL Training Platform that supports Hyperband for PyTorch and TensorFlow (Keras and Estimator) models.
irace
is an R package that implements the iterated racing algorithm.
Katib
is a Kubernetes-native system that includes hyperband.
Syne Tune
is a Python library for asynchronous hyperparameter and architecture optimization that supports multi-fidelity algorithms (also known as early-stopping algorithms).


Other


Determined
is a DL Training Platform that supports random, grid, PBT, Hyperband and NAS approaches to hyperparameter optimization for PyTorch and TensorFlow (Keras and Estimator) models. *
dlib Dlib is a general purpose cross-platform software library written in the programming language C++. Its design is heavily influenced by ideas from design by contract and component-based software engineering. Thus it is, first and foremost, a set ...
is a C++ package with a Python API which has a parameter-free optimizer based o
LIPO
and
trust region In mathematical optimization, a trust region is the subset of the region of the objective function that is approximated using a model function (often a quadratic). If an adequate model of the objective function is found within the trust region, the ...
optimizers working in tandem.
Harmonica
is a Python package for spectral hyperparameter optimization.
hyperopt
also vi
hyperas
an
hyperopt-sklearn
are Python packages which include tree of Parzen estimators based distributed hyperparameter optimization.
Katib
is a Kubernetes-native system which includes grid, random search, bayesian optimization, hyperband, and NAS based on reinforcement learning.
nevergrad
ref name=nevergrad_issue1>
is a Python package for gradient-free optimization using techniques such as differential evolution, sequential quadratic programming, fastGA, covariance matrix adaptation, population control methods, and particle swarm optimization. * Neural Network Intelligence (NNI) is a Python package which includes hyperparameter tuning for neural networks in local and distributed environments. Its techniques include TPE, random, anneal, evolution, SMAC, batch, grid, and hyperband.
parameter-sherpa
is a similar Python package which includes several techniques grid search, Bayesian and genetic Optimization
photonai
is a high level Python API for designing and optimizing machine learning pipelines based on grid, random search and bayesian optimization.
pycma
is a Python implementation of Covariance Matrix Adaptation Evolution Strategy.
rbfopt
is a Python package that uses a
radial basis function A radial basis function (RBF) is a real-valued function \varphi whose value depends only on the distance between the input and some fixed point, either the origin, so that \varphi(\mathbf) = \hat\varphi(\left\, \mathbf\right\, ), or some other fixed ...
model
Tune
is a Python library for hyperparameter tuning execution and integrates with/scales many existing hyperparameter optimization libraries such a
hyperoptnevergrad
an
scikit-optimize

Open Source Vizier
is a Python-based service for blackbox hyperparameter optimization and research, based on Google's internal Vizier service.
Oríon
is an asynchronous framework for distributed black-box optimization and hyperparameter optimization that includes built-in algorithms such as grid search, random search, MOFA, Hyperband, PBT, TPE and several integrations with hyperparameter optimization libraries such a
Axscikit-optimizeHpBandSterHEBOnevergrad
an
DEHB


Commercial services


Amazon Sagemaker Automatic Model Tuning
offers grid search, random search, Gaussian process-based Bayesian optimization and asynchronous successive halving.
BigML OptiML
supports mixed search domains
Google Cloud Vertex Vizier
supports mixed search domains, multiobjective, multifidelity, and safety constraints.
Indie Solver
supports multiobjective, multifidelity and constraint optimization
Mind Foundry OPTaaS
supports mixed search domains, multiobjective, constraints, parallel optimization and surrogate models.
SigOpt
supports mixed search domains, multiobjective, multisolution, multifidelity, constraint (linear and black-box), and parallel optimization.


See also

*
Automated machine learning Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. AutoML potentially includes every stage from beginning with a raw dataset to building a machine learning model ready ...
*
Neural architecture search Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed a ...
* Meta-optimization *
Model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
*
Self-tuning In control theory a self-tuning system is capable of optimizing its own internal running parameters in order to maximize or minimize the fulfilment of an objective function; typically the maximization of efficiency or error minimization. Self-tun ...
* XGBoost


References

{{Differentiable computing Machine learning Mathematical optimization Model selection