Multi-task learning (MTL) is a subfield of
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately.
Inherently, Multi-task learning is a
multi-objective optimization
Multi-objective optimization or Pareto optimization (also known as multi-objective programming, vector optimization, multicriteria optimization, or multiattribute optimization) is an area of MCDM, multiple-criteria decision making that is concerned ...
problem having
trade-offs between different tasks.
Early versions of MTL were called "hints".
In a widely cited 1997 paper, Rich Caruana gave the following characterization:
Multitask Learning is an approach to inductive transfer that improves generalization
A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...
by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.
In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance. Further examples of settings for MTL include
multiclass classification and
multi-label classification.
Multi-task learning works because
regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents
overfitting
In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled.
However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.
[Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf]
Methods
The key challenge in multi-task learning, is how to combine learning signals from multiple tasks into a single model. This may strongly depend on how well different task agree with each other, or contradict each other. There are several ways to address this challenge:
Task grouping and overlap
Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a
linear combination
In mathematics, a linear combination or superposition is an Expression (mathematics), expression constructed from a Set (mathematics), set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of ''x'' a ...
of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with
sparsity, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases. Task relatedness can be imposed a priori or learned from the data.
Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly.
[Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. ] For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains.
Exploiting unrelated tasks
One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be
orthogonal
In mathematics, orthogonality (mathematics), orthogonality is the generalization of the geometric notion of ''perpendicularity''. Although many authors use the two terms ''perpendicular'' and ''orthogonal'' interchangeably, the term ''perpendic ...
. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.
Transfer of knowledge
Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep
convolutional neural network
A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
GoogLeNet, an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.
Multiple non-stationary tasks
Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed ''Group online adaptive learning'' (GOAL). Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting
financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.
Multi-task optimization
Multi-task optimization focuses on solving optimizing the whole process.
The paradigm has been inspired by the well-established concepts of
transfer learning
Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...
and multi-task learning in
predictive analytics
Predictive analytics encompasses a variety of Statistics, statistical techniques from data mining, Predictive modelling, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or other ...
.
The key motivation behind multi-task optimization is that if optimization tasks are related to each other in terms of their optimal solutions or the general characteristics of their function landscapes, the search progress can be transferred to substantially accelerate the search on the other.
The success of the paradigm is not necessarily limited to one-way knowledge transfers from simpler to more complex tasks. In practice an attempt is to intentionally solve a more difficult task that may unintentionally solve several smaller problems.
There is a direct relationship between multitask optimization and
multi-objective optimization
Multi-objective optimization or Pareto optimization (also known as multi-objective programming, vector optimization, multicriteria optimization, or multiattribute optimization) is an area of MCDM, multiple-criteria decision making that is concerned ...
.
In some cases, the simultaneous training of seemingly related tasks may hinder performance compared to single-task models. Commonly, MTL models employ task-specific modules on top of a joint feature representation obtained using a shared module. Since this joint representation must capture useful features across all tasks, MTL may hinder individual task performance if the different tasks seek conflicting representation, i.e., the gradients of different tasks point to opposing directions or differ significantly in magnitude. This phenomenon is commonly referred to as negative transfer. To mitigate this issue, various MTL optimization methods have been proposed. Commonly, the per-task gradients are combined into a joint update direction through various aggregation algorithms or heuristics.
There are several common approaches for multi-task optimization:
Bayesian optimization,
evolutionary computation
Evolutionary computation from computer science is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms. In technical terms ...
, and approaches based on
Game theory
Game theory is the study of mathematical models of strategic interactions. It has applications in many fields of social science, and is used extensively in economics, logic, systems science and computer science. Initially, game theory addressed ...
.
[
]
Multi-task Bayesian optimization
Multi-task Bayesian optimization is a modern model-based approach that leverages the concept of knowledge transfer to speed up the automatic hyperparameter optimization process of machine learning algorithms.[Swersky, K., Snoek, J., & Adams, R. P. (2013)]
Multi-task bayesian optimization
Advances in neural information processing systems (pp. 2004-2012). The method builds a multi-task Gaussian process
In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution. The di ...
model on the data originating from different searches progressing in tandem. The captured inter-task dependencies are thereafter utilized to better inform the subsequent sampling of candidate solutions in respective search spaces.
Evolutionary multi-tasking
Evolutionary multi-tasking has been explored as a means of exploiting the implicit parallelism of population-based search algorithms to simultaneously progress multiple distinct optimization tasks. By mapping all tasks to a unified search space, the evolving population of candidate solutions can harness the hidden relationships between them through continuous genetic transfer. This is induced when solutions associated with different tasks crossover.[Ong, Y. S., & Gupta, A. (2016)]
Evolutionary multitasking: a computer science view of cognitive multitasking
Cognitive Computation, 8(2), 125-142. Recently, modes of knowledge transfer that are different from direct solution crossover have been explored.
Game-theoretic optimization
Game-theoretic approaches to multi-task optimization propose to view the optimization problem as a game, where each task is a player. All players compete through the reward matrix of the game, and try to reach a solution that satisfies all players (all tasks). This view provide insight about how to build efficient algorithms based on gradient descent
Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function.
The idea is to take repeated steps in the opposite direction of the gradi ...
optimization (GD), which is particularly important for training deep neural networks
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
. In GD for MTL, the problem is that each task provides its own loss, and it is not clear how to combine all losses and create a single unified gradient, leading to several different aggregation strategies. This aggregation problem can be solved by defining a game matrix where the reward of each player is the agreement of its own gradient with the common gradient, and then setting the common gradient to be the Nash Cooperative bargaining of that system.
Applications
Algorithms for multi-task optimization span a wide array of real-world applications. Recent studies highlight the potential for speed-ups in the optimization of engineering design parameters by conducting related designs jointly in a multi-task manner.[ In ]machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, the transfer of optimized features across related data sets can enhance the efficiency of the training process as well as improve the generalization capability of learned models. In addition, the concept of multi-tasking has led to advances in automatic hyperparameter optimization of machine learning models and ensemble learning
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
Unlike a statistical ensemble in statist ...
.
Applications have also been reported in cloud computing, with future developments geared towards cloud-based on-demand optimization services that can cater to multiple customers simultaneously.[ Recent work has additionally shown applications in chemistry. In addition, some recent works have applied multi-task optimization algorithms in industrial manufacturing.
]
Mathematics
Reproducing Hilbert space of vector valued functions (RKHSvv)
The MTL problem can be cast within the context of RKHSvv (a complete inner product space
In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, ofte ...
of vector-valued function
A vector-valued function, also referred to as a vector function, is a mathematical function of one or more variables whose range is a set of multidimensional vectors or infinite-dimensional vectors. The input of a vector-valued function could ...
s equipped with a reproducing kernel). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al., 2015.
RKHSvv concepts
Suppose the training data set is , with , , where indexes task, and . Let . In this setting there is a consistent input and output space and the same loss function for each task: . This results in the regularized machine learning problem:
where is a vector valued reproducing kernel Hilbert space with functions having components .
The reproducing kernel for the space of functions is a symmetric matrix-valued function , such that and the following reproducing property holds:
The reproducing kernel gives rise to a representer theorem showing that any solution to equation has the form:
Separable kernels
The form of the kernel induces both the representation of the feature space
Feature may refer to:
Computing
* Feature recognition, could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (machine learning), in statistics: individual measurable properties of the phenom ...
and structures the output across tasks. A natural simplification is to choose a ''separable kernel,'' which factors into separate kernels on the input space and on the tasks . In this case the kernel relating scalar components and is given by . For vector valued functions we can write , where is a scalar reproducing kernel, and is a symmetric positive semi-definite matrix. Henceforth denote .
This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by . Methods for non-separable kernels is a current field of research.
For the separable case, the representation theorem is reduced to . The model output on the training data is then , where is the empirical kernel matrix with entries , and is the matrix of rows .
With the separable kernel, equation can be rewritten as
where is a (weighted) average of applied entry-wise to and . (The weight is zero if is a missing observation).
Note the second term in can be derived as follows:
:
Known task structure
= Task structure representations
=
There are three largely equivalent ways to represent task structure: through a regularizer; through an output metric, and through an output mapping.
= Task structure examples
=
Via the regularizer formulation, one can represent a variety of task structures easily.
* Letting (where is the ''T''x''T'' identity matrix, and is the ''T''x''T'' matrix of ones) is equivalent to letting control the variance of tasks from their mean . For example, blood levels of some biomarker may be taken on patients at time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients.
* Letting , where is equivalent to letting control the variance measured with respect to a group mean: . (Here the cardinality of group r, and is the indicator function). For example, people in different political parties (groups) might be regularized together with respect to predicting the favorability rating of a politician. Note that this penalty reduces to the first when all tasks are in the same group.
* Letting , where is the Laplacian
In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols \nabla\cdot\nabla, \nabla^2 (where \nabla is th ...
for the graph with adjacency matrix
In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph (discrete mathematics), graph. The elements of the matrix (mathematics), matrix indicate whether pairs of Vertex (graph theory), vertices ...
''M'' giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks ''t'' and ''s'' when they are more similar (according to the weight ,) i.e. regularizes .
* All of the above choices of A also induce the additional regularization term which penalizes complexity in f more broadly.
Learning tasks together with their structure
Learning problem can be generalized to admit learning task matrix A as follows:
Choice of must be designed to learn matrices ''A'' of a given type. See "Special cases" below.
= Optimization of
=
Restricting to the case of convex
Convex or convexity may refer to:
Science and technology
* Convex lens, in optics
Mathematics
* Convex set, containing the whole line segment that joins points
** Convex polygon, a polygon which encloses a convex set of points
** Convex polytop ...
losses and coercive
Coercion involves compelling a party to act in an involuntary manner through the use of threats, including threats to use force against that party. It involves a set of forceful actions which violate the free will of an individual in order to in ...
penalties Ciliberto ''et al.'' have shown that although is not convex jointly in ''C'' and ''A,'' a related problem is jointly convex.
Specifically on the convex set , the equivalent problem
is convex with the same minimum value. And if is a minimizer for then is a minimizer for .
may be solved by a barrier method on a closed set by introducing the following perturbation:
The perturbation via the barrier forces the objective functions to be equal to on the boundary of .
can be solved with a block coordinate descent method, alternating in ''C'' and ''A.'' This results in a sequence of minimizers in that converges to the solution in as , and hence gives the solution to .
= Special cases
=
Spectral penalties - Dinnuzo ''et al'' suggested setting ''F'' as the Frobenius norm . They optimized directly using block coordinate descent, not accounting for difficulties at the boundary of .
Clustered tasks learning - Jacob ''et al'' suggested to learn ''A'' in the setting where ''T'' tasks are organized in ''R'' disjoint clusters. In this case let be the matrix with . Setting , and , the task matrix can be parameterized as a function of : , with terms that penalize the average, between clusters variance and within clusters variance respectively of the task predictions. M is not convex, but there is a convex relaxation . In this formulation, .
= Generalizations
=
Non-convex penalties - Penalties can be constructed such that A is constrained to be a graph Laplacian, or that A has low rank factorization. However these penalties are not convex, and the analysis of the barrier method proposed by Ciliberto et al. does not go through in these cases.
Non-separable kernels - Separable kernels are limited, in particular they do not account for structures in the interaction space between the input and output domains jointly. Future work is needed to develop models for these kernels.
Software package
A Matlab package called Multi-Task Learning via StructurAl Regularization (MALSAR) implements the following multi-task learning algorithms: Mean-Regularized Multi-Task Learning, Multi-Task Learning with Joint Feature Selection, Robust Multi-Task Feature Learning, Trace-Norm Regularized Multi-Task Learning, Alternating Structural Optimization, Incoherent Low-Rank and Sparse Learning, Robust Low-Rank Multi-Task Learning, Clustered Multi-Task Learning,[Zhou, J., Chen, J., & Ye, J. (2011)]
Clustered multi-task learning via alternating structure optimization
Advances in Neural Information Processing Systems. Multi-Task Learning with Graph Structures.
Literature
* Multi-Target Prediction: A Unifying View on Problems and Methods Willem Waegeman, Krzysztof Dembczynski, Eyke Huellermeier https://arxiv.org/abs/1809.02352v1
See also
* Artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
* Artificial neural network
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
* Automated machine learning (AutoML)
* Evolutionary computation
Evolutionary computation from computer science is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms. In technical terms ...
* Foundation model
* General game playing
* Human-based genetic algorithm In evolutionary computation, a human-based genetic algorithm (HBGA) is a genetic algorithm that allows humans to contribute solution suggestions to the evolutionary process. For this purpose, a HBGA has human interfaces for initialization, mutation, ...
* Kernel methods for vector output Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a Kernel trick, computationally efficient way and allow algorith ...
* Multiple-criteria decision analysis
Multiple-criteria decision-making (MCDM) or multiple-criteria decision analysis (MCDA) is a sub-discipline of operations research that explicitly evaluates multiple conflicting wikt:criterion, criteria in decision making (both in daily life a ...
* Multi-objective optimization
Multi-objective optimization or Pareto optimization (also known as multi-objective programming, vector optimization, multicriteria optimization, or multiattribute optimization) is an area of MCDM, multiple-criteria decision making that is concerned ...
* Multicriteria classification
* Robot learning
* Transfer learning
Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...
* James–Stein estimator
References
External links
The Biosignals Intelligence Group at UIUC
Software
* ttps://web.archive.org/web/20131224113826/http://klcl.pku.edu.cn/member/sunxu/code.htm Online Multi-Task Learning Toolkit (OMT)A general-purpose online multi-task learning toolkit based on conditional random field models and stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...
training ( C#, .NET
The .NET platform (pronounced as "''dot net"'') is a free and open-source, managed code, managed computer software framework for Microsoft Windows, Windows, Linux, and macOS operating systems. The project is mainly developed by Microsoft emplo ...
)
{{Optimization algorithms
Machine learning