Last updated

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. [1] [2] [3] Early versions of MTL were called "hints" [4] [5] .

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop a conventional algorithm for effectively performing the task.

## Contents

In a widely cited 1997 paper, Rich Caruana gave the following characterization:

Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better. [3]

In supervised learning applications in machine learning and statistical learning theory, generalization error is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Because learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive to sampling error. As a result, measurements of prediction error on the current data may not provide much information about predictive ability on new data. Generalization error can be minimized by avoiding overfitting in the learning algorithm. The performance of a machine learning algorithm is measured by plots of the generalization error values through the learning process, which are called learning curves.

The inductive bias of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered.

In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance. [6] Further examples of settings for MTL include multiclass classification and multi-label classification. [7]

Not to be confused with multi-label classification.

In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to.

Multi-task learning works because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled. [8] [6] However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks. [8] [9]

In mathematics, statistics, and computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

## Methods

Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example, with sparsity, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases. [10] Task relatedness can be imposed a priori or learned from the data. [7] [11] Hierarchical task relatedness can also be exploited implicitly without assuming a priori knowledge or learning relations explicitly. [8] [12] . For example, the explicit learning of sample relevance across tasks can be done to guarantee the effectiveness of joint learning across multiple domains. [8]

In mathematics, a linear combination is an expression constructed from a set of terms by multiplying each term by a constant and adding the results. The concept of linear combinations is central to linear algebra and related fields of mathematics. Most of this article deals with linear combinations in the context of a vector space over a field, with some generalizations given at the end of the article.

One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be beneficial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods. [9]

### Transfer of knowledge

Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep convolutional neural network GoogLeNet, [13] an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task. [14]

Traditionally Multi-task learning and transfer of knowledge are applied to stationary learning settings. Their extension to non-stationary environments is termed Group online adaptive learning (GOAL). [15] Sharing information could be particularly useful if learners operate in continuously changing environments, because a learner could benefit from previous experience of another learner to quickly adapt to their new environment. Such group-adaptive learning has numerous applications, from predicting financial time-series, through content recommendation systems, to visual understanding for adaptive autonomous agents.

## Mathematics

### Reproducing Hilbert space of vector valued functions (RKHSvv)

The MTL problem can be cast within the context of RKHSvv (a complete inner product space of vector-valued functions equipped with a reproducing kernel). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al., 2015. [7]

#### RKHSvv concepts

Suppose the training data set is ${\displaystyle {\mathcal {S}}_{t}=\{(x_{i}^{t},y_{i}^{t})\}_{i=1}^{n_{t}}}$, with ${\displaystyle x_{i}^{t}\in {\mathcal {X}}}$, ${\displaystyle y_{i}^{t}\in {\mathcal {Y}}}$, where t indexes task, and ${\displaystyle t\in 1,...,T}$. Let ${\displaystyle n=\sum _{t=1}^{T}n_{t}}$. In this setting there is a consistent input and output space and the same loss function ${\displaystyle {\mathcal {L}}:\mathbb {R} \times \mathbb {R} \rightarrow \mathbb {R} _{+}}$ for each task: . This results in the regularized machine learning problem:

${\displaystyle \min _{f\in {\mathcal {H}}}\sum _{t=1}^{T}{\frac {1}{n_{t}}}\sum _{i=1}^{n_{t}}{\mathcal {L}}(y_{i}^{t},f_{t}(x_{i}^{t}))+\lambda ||f||_{\mathcal {H}}^{2}}$

(1)

where ${\displaystyle {\mathcal {H}}}$ is a vector valued reproducing kernel Hilbert space with functions ${\displaystyle f:{\mathcal {X}}\rightarrow {\mathcal {Y}}^{T}}$ having components ${\displaystyle f_{t}:{\mathcal {X}}\rightarrow {\mathcal {Y}}}$.

The reproducing kernel for the space ${\displaystyle {\mathcal {H}}}$ of functions ${\displaystyle f:{\mathcal {X}}\rightarrow \mathbb {R} ^{T}}$ is a symmetric matrix-valued function ${\displaystyle \Gamma :{\mathcal {X}}\times {\mathcal {X}}\rightarrow \mathbb {R} ^{T\times T}}$ , such that ${\displaystyle \Gamma (\cdot ,x)c\in {\mathcal {H}}}$ and the following reproducing property holds:

${\displaystyle \langle f(x),c\rangle _{\mathbb {R} ^{T}}=\langle f,\Gamma (x,\cdot )c\rangle _{\mathcal {H}}}$

(2)

The reproducing kernel gives rise to a representer theorem showing that any solution to equation 1 has the form:

${\displaystyle f(x)=\sum _{t=1}^{T}\sum _{i=1}^{n_{t}}\Gamma (x,x_{i}^{t})c_{i}^{t}}$

(3)

#### Separable kernels

The form of the kernel Γ induces both the representation of the feature space and structures the output across tasks. A natural simplification is to choose a separable kernel, which factors into separate kernels on the input space X and on the tasks ${\displaystyle \{1,...,T\}}$. In this case the kernel relating scalar components ${\displaystyle f_{t}}$ and ${\displaystyle f_{s}}$ is given by ${\textstyle \gamma ((x_{i},t),(x_{j},s))=k(x_{i},x_{j})k_{T}(s,t)=k(x_{i},x_{j})A_{s,t}}$. For vector valued functions ${\displaystyle f\in {\mathcal {H}}}$ we can write ${\displaystyle \Gamma (x_{i},x_{j})=k(x_{i},x_{j})A}$, where k is a scalar reproducing kernel, and A is a symmetric positive semi-definite ${\displaystyle T\times T}$ matrix. Henceforth denote ${\displaystyle S_{+}^{T}=\{{\text{PSD matrices}}\}\subset \mathbb {R} ^{T\times T}}$ .

This factorization property, separability, implies the input feature space representation does not vary by task. That is, there is no interaction between the input kernel and the task kernel. The structure on tasks is represented solely by A. Methods for non-separable kernels Γ is an current field of research.

For the separable case, the representation theorem is reduced to ${\textstyle f(x)=\sum _{i=1}^{N}k(x,x_{i})Ac_{i}}$. The model output on the training data is then KCA , where K is the ${\displaystyle n\times n}$ empirical kernel matrix with entries ${\textstyle K_{i,j}=k(x_{i},x_{j})}$, and C is the ${\displaystyle n\times T}$ matrix of rows ${\displaystyle c_{i}}$.

With the separable kernel, equation 1 can be rewritten as

${\displaystyle \min _{C\in \mathbb {R} ^{n\times T}}V(Y,KCA)+\lambda tr(KCAC^{\top })}$

(P)

where V is a (weighted) average of L applied entry-wise to Y and KCA. (The weight is zero if ${\displaystyle Y_{i}^{t}}$ is a missing observation).

Note the second term in P can be derived as follows:

{\displaystyle {\begin{aligned}\|f\|_{\mathcal {H}}^{2}&=\left\langle \sum _{i=1}^{n}k(\cdot ,x_{i})Ac_{i},\sum _{j=1}^{n}k(\cdot ,x_{j})Ac_{j}\right\rangle _{\mathcal {H}}\\&=\sum _{i,j=1}^{n}\langle k(\cdot ,x_{i})Ac_{i},k(\cdot ,x_{j})Ac_{j}\rangle _{\mathcal {H}}&{\text{(bilinearity)}}\\&=\sum _{i,j=1}^{n}\langle k(x_{i},x_{j})Ac_{i},c_{j}\rangle _{\mathbb {R} ^{T}}&{\text{(reproducing property)}}\\&=\sum _{i,j=1}^{n}k(x_{i},x_{j})c_{i}^{\top }Ac_{j}=tr(KCAC^{\top })\end{aligned}}}

There are three largely equivalent ways to represent task structure: through a regularizer; through an output metric, and through an output mapping.

Regularizer  With the separable kernel, it can be shown (below) that ${\textstyle ||f||_{\mathcal {H}}^{2}=\sum _{s,t=1}^{T}A_{t,s}^{\dagger }\langle f_{s},f_{t}\rangle _{{\mathcal {H}}_{k}}}$, where ${\displaystyle A_{t,s}^{\dagger }}$ is the ${\displaystyle t,s}$ element of the pseudoinverse of ${\displaystyle A}$, and ${\displaystyle {\mathcal {H}}_{k}}$ is the RKHS based on the scalar kernel ${\displaystyle k}$, and ${\textstyle f_{t}(x)=\sum _{i=1}^{n}k(x,x_{i})A_{t}^{\top }c_{i}}$. This formulation shows that ${\displaystyle A_{t,s}^{\dagger }}$ controls the weight of the penalty associated with ${\textstyle \langle f_{s},f_{t}\rangle _{{\mathcal {H}}_{k}}}$. (Note that ${\textstyle \langle f_{s},f_{t}\rangle _{{\mathcal {H}}_{k}}}$ arises from ${\textstyle ||f_{t}||_{{\mathcal {H}}_{k}}=\langle f_{t},f_{t}\rangle _{{\mathcal {H}}_{k}}}$.)

Output metric  an alternative output metric on ${\displaystyle {\mathcal {Y}}^{T}}$ can be induced by the inner product ${\displaystyle \langle y_{1},y_{2}\rangle _{\Theta }=\langle y_{1},\Theta y_{2}\rangle _{\mathbb {R} ^{T}}}$. With the squared loss there is an equivalence between the separable kernels ${\displaystyle k(\cdot ,\cdot )I_{T}}$ under the alternative metric, and ${\displaystyle k(\cdot ,\cdot )\Theta }$, under the canonical metric.

Output mapping  Outputs can be mapped as ${\displaystyle L:{\mathcal {Y}}^{T}\rightarrow {\mathcal {\tilde {Y}}}}$ to a higher dimensional space to encode complex structures such as trees, graphs and strings. For linear maps L, with appropriate choice of separable kernel, it can be shown that ${\displaystyle A=L^{\top }L}$.

Via the regularizer formulation, one can represent a variety of task structures easily.

• Letting ${\textstyle A^{\dagger }=\gamma I_{T}+(\gamma -\lambda ){\frac {1}{T}}\mathbf {1} \mathbf {1} ^{\top }}$ (where ${\displaystyle I_{T}}$ is the TxT identity matrix, and ${\textstyle \mathbf {1} \mathbf {1} ^{\top }}$ is the TxT matrix of ones) is equivalent to letting Γ control the variance ${\textstyle \sum _{t}||f_{t}-{\bar {f}}||_{{\mathcal {H}}_{k}}}$ of tasks from their mean ${\textstyle {\frac {1}{T}}\sum _{t}f_{t}}$. For example, blood levels of some biomarker may be taken on T patients at ${\displaystyle n_{t}}$ time points during the course of a day and interest may lie in regularizing the variance of the predictions across patients.
• Letting ${\displaystyle A^{\dagger }=\alpha I_{T}+(\alpha -\lambda )M}$ , where ${\displaystyle M_{t,s}={\frac {1}{|G_{r}|}}\mathbb {I} (t,s\in G_{r})}$ is equivalent to letting ${\displaystyle \alpha }$ control the variance measured with respect to a group mean: ${\displaystyle \sum _{r}\sum _{t\in G_{r}}||f_{t}-{\frac {1}{|G_{r}|}}\sum _{s\in G_{r})}f_{s}||}$. (Here ${\displaystyle |G_{r}|}$ the cardinality of group r, and ${\displaystyle \mathbb {I} }$ is the indicator function). For example, people in different political parties (groups) might be regularized together with respect to predicting the favorability rating of a politician. Note that this penalty reduces to the first when all tasks are in the same group.
• Letting ${\displaystyle A^{\dagger }=\delta I_{T}+(\delta -\lambda )L}$, where ${\displaystyle L=D-M}$ is the Laplacian for the graph with adjacency matrix M giving pairwise similarities of tasks. This is equivalent to giving a larger penalty to the distance separating tasks t and s when they are more similar (according to the weight ${\displaystyle M_{t,s}}$,) i.e. ${\displaystyle \delta }$ regularizes ${\displaystyle \sum _{t,s}||f_{t}-f_{s}||_{{\mathcal {H}}_{k}}^{2}M_{t,s}}$.
• All of the above choices of A also induce the additional regularization term ${\textstyle \lambda \sum _{t}||f||_{{\mathcal {H}}_{k}}^{2}}$ which penalizes complexity in f more broadly.

#### Learning tasks together with their structure

Learning problem P can be generalized to admit learning task matrix A as follows:

${\displaystyle \min _{C\in \mathbb {R} ^{n\times T},A\in S_{+}^{T}}V(Y,KCA)+\lambda tr(KCAC^{\top })+F(A)}$

(Q)

Choice of ${\displaystyle F:S_{+}^{T}\rightarrow \mathbb {R} _{+}}$ must be designed to learn matrices A of a given type. See "Special cases" below.

##### Optimization of Q

Restricting to the case of convex losses and coercive penalties Ciliberto et al. have shown that although Q is not convex jointly in C and A, a related problem is jointly convex.

Specifically on the convex set ${\displaystyle {\mathcal {C}}=\{(C,A)\in \mathbb {R} ^{n\times T}\times S_{+}^{T}|Range(C^{\top }KC)\subseteq Range(A)\}}$, the equivalent problem

${\displaystyle \min _{C,A\in {\mathcal {C}}}V(Y,KC)+\lambda tr(A^{\dagger }C^{\top }KC)+F(A)}$

(R)

is convex with the same minimum value. And if ${\displaystyle (C_{R},A_{R})}$ is a minimizer for R then ${\displaystyle (C_{R}A_{R}^{\dagger },A_{R})}$ is a minimizer for Q .

R may be solved by a barrier method on a closed set by introducing the following perturbation:

${\displaystyle \min _{C\in \mathbb {R} ^{n\times T},A\in S_{+}^{T}}V(Y,KC)+\lambda tr(A^{\dagger }(C^{\top }KC+\delta ^{2}I_{T}))+F(A)}$

(S)

The perturbation via the barrier ${\displaystyle \delta ^{2}tr(A^{\dagger })}$ forces the objective functions to be equal to ${\displaystyle +\infty }$ on the boundary of ${\displaystyle R^{n\times T}\times S_{+}^{T}}$ .

S can be solved with a block coordinate descent method, alternating in C and A. This results in a sequence of minimizers ${\displaystyle (C_{m},A_{m})}$ in S that converges to the solution in R as ${\displaystyle \delta _{m}\rightarrow 0}$, and hence gives the solution to Q .

##### Special cases

Spectral penalties - Dinnuzo et al [16] suggested setting F as the Frobenius norm ${\displaystyle {\sqrt {tr(A^{\top }A)}}}$. They optimized Q directly using block coordinate descent, not accounting for difficulties at the boundary of ${\displaystyle \mathbb {R} ^{n\times T}\times S_{+}^{T}}$.

Clustered tasks learning - Jacob et al [17] suggested to learn A in the setting where T tasks are organized in R disjoint clusters. In this case let ${\displaystyle E\in \{0,1\}^{T\times R}}$ be the matrix with ${\displaystyle E_{t,r}=\mathbb {I} ({\text{task }}t\in {\text{group }}r)}$. Setting ${\displaystyle M=I-E^{\dagger }E^{T}}$, and ${\displaystyle U={\frac {1}{T}}\mathbf {11} ^{\top }}$, the task matrix ${\displaystyle A^{\dagger }}$ can be parameterized as a function of ${\displaystyle M}$: ${\displaystyle A^{\dagger }(M)=\epsilon _{M}U+\epsilon _{B}(M-U)+\epsilon (I-M)}$ , with terms that penalize the average, between clusters variance and within clusters variance respectively of the task predictions. M is not convex, but there is a convex relaxation ${\displaystyle {\mathcal {S}}_{c}=\{M\in S_{+}^{T}:I-M\in S_{+}^{T}\land tr(M)=r\}}$. In this formulation, ${\displaystyle F(A)=\mathbb {I} (A(M)\in \{A:M\in {\mathcal {S}}_{C}\})}$.

##### Generalizations

Non-convex penalties - Penalties can be constructed such that A is constrained to be a graph Laplacian, or that A has low rank factorization. However these penalties are not convex, and the analysis of the barrier method proposed by Ciliberto et al. does not go through in these cases.

Non-separable kernels - Separable kernels are limited, in particular they do not account for structures in the interaction space between the input and output domains jointly. Future work is needed to develop models for these kernels.

## Applications

### Spam filtering

Using the principles of MTL, techniques for collaborative spam filtering that facilitates personalization have been proposed. In large scale open membership email systems, most users do not label enough messages for an individual local classifier to be effective, while the data is too noisy to be used for a global filter across all users. A hybrid global/individual classifier can be effective at absorbing the influence of users who label emails very diligently from the general public. This can be accomplished while still providing sufficient quality to users with few labeled instances. [18]

Using boosted decision trees, one can enable implicit data sharing and regularization. This learning method can be used on web-search ranking data sets. One example is to use ranking data sets from several countries. Here, multitask learning is particularly helpful as data sets from different countries vary largely in size because of the cost of editorial judgments. It has been demonstrated that learning various tasks jointly can lead to significant improvements in performance with surprising reliability. [19]

## Software package

The Multi-Task Learning via StructurAl Regularization (MALSAR) Matlab package [20] implements the following multi-task learning algorithms:

• Mean-Regularized Multi-Task Learning [21] [22]
• Multi-Task Learning with Joint Feature Selection [23]
• Robust Multi-Task Feature Learning [24]
• Trace-Norm Regularized Multi-Task Learning [25]
• Alternating Structural Optimization [26] [27]
• Incoherent Low-Rank and Sparse Learning [28]
• Clustered Multi-Task Learning [29] [30]
• Multi-Task Learning with Graph Structures

## Related Research Articles

In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions and in the RKHS are close in norm, i.e., is small, then and are also pointwise close, i.e., is small for all . The reverse needs not be true.

In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations and elementary functions. By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.

In linear algebra, the Gram matrix of a set of vectors in an inner product space is the Hermitian matrix of inner products, whose entries are given by .

In machine learning, kernel methods are a class of algorithms for pattern analysis, whose best known member is the support vector machine (SVM). The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over pairs of data points in raw representation.

In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then positive definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

The structured support vector machine is a machine learning algorithm that generalizes the Support Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

Coherent states have been introduced in a physical context, first as quasi-classical states in quantum mechanics, then as the backbone of quantum optics and they are described in that spirit in the article Coherent states. However, they have generated a huge variety of generalizations, which have led to a tremendous literature in mathematical physics. In this article, we sketch the main directions of research on this line. For further details, we refer to several existing surveys.

Regularization perspectives on support-vector machines provide a way of interpreting support-vector machines (SVMs) in the context of other machine-learning algorithms. SVM algorithms categorize multidimensional data, with the goal of fitting the training set data well, but also avoiding overfitting, so that the solution generalizes to new data points. Regularization algorithms also aim to fit training set data and avoid overfitting. They do this by choosing a fitting function that has low error on the training set, but also is not too complicated, where complicated functions are functions with high norms in some function space. Specifically, Tikhonov regularization algorithms choose a function that minimizes the sum of training-set error plus the function's norm. The training-set error can be calculated with different loss functions. For example, regularized least squares is a special case of Tikhonov regularization using the squared error loss as the loss function.

In machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

In statistical learning theory, a representer theorem is any of several related results stating that a minimizer of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.

Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

In mathematics, the Weil–Brezin map, named after André Weil and Jonathan Brezin, is a unitary transformation that maps a Schwartz function on the real line to a smooth function on the Heisenberg manifold. The Weil–Brezin map gives a geometric interpretation of the Fourier transform, the Plancherel theorem and the Poisson summation formula. The image of Gaussian functions under the Weil–Brezin map are nil-theta functions, which are related to theta functions. The Weil–Brezin map is sometimes referred to as the Zak transform, which is widely applied in the field of physics and signal processing; however, the Weil–Brezin Map is defined via Heisenberg group geometrically, whereas there is no direct geometric or group theoretic interpretation from the Zak transform.

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over

In statistics, kernel-independent component analysis is an efficient algorithm for independent component analysis which estimates source components by optimizing a generalized variance contrast function, which is based on representations in a reproducing kernel Hilbert space. Those contrast functions use the notion of mutual information as a measure of statistical independence.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

Low-rank matrix approximations are essential tools in the application of kernel methods to large-scale learning problems.

In statistics, functional correlation is a dimensionality reduction technique used to quantify the correlation and dependence between two variables when the data is functional. Several approaches have been developed to quantify the relation between two functional variables.

## References

1. Baxter, J. (2000). A model of inductive bias learning" Journal of Artificial Intelligence Research 12:149--198, On-line paper
2. Thrun, S. (1996). Is learning the n-th thing any easier than learning the first?. In Advances in Neural Information Processing Systems 8, pp. 640--646. MIT Press. Paper at Citeseer
3. Caruana, R. (1997). "Multi-task learning" (PDF). Machine Learning. 28: 41–75. doi:10.1023/A:1007379606734.
4. Suddarth, S., Kergosien, Y. (1990). Rule-injection hints as a means of improving network performance and learning time. EURASIP Workshop. Neural Networks pp. 120-129. Lecture Notes in Computer Science. Springer.
5. Abu-Mostafa, Y. S. (1990). "Learning from hints in neural networks". Journal of Complexity. 6 (2): 192–198. doi:10.1016/0885-064x(90)90006-y.
7. Ciliberto, C. (2015). "Convex Learning of Multiple Tasks and their Structure". arXiv: [cs.LG].
8. Hajiramezanali, E. & Dadaneh, S. Z. & Karbalayghareh, A. & Zhou, Z. & Qian, X. Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. https://arxiv.org/pdf/1810.09433.pdf
9. Romera-Paredes, B., Argyriou, A., Bianchi-Berthouze, N., & Pontil, M., (2012) Exploiting Unrelated Tasks in Multi-Task Learning. http://jmlr.csail.mit.edu/proceedings/papers/v22/romera12/romera12.pdf
10. Kumar, A., & Daume III, H., (2012) Learning Task Grouping and Overlap in Multi-Task Learning. http://icml.cc/2012/papers/690.pdf
11. Jawanpuria, P., & Saketha Nath, J., (2012) A Convex Feature Learning Formulation for Latent Task Structure Discovery. http://icml.cc/2012/papers/90.pdf
12. Zweig, A. & Weinshall, D. Hierarchical Regularization Cascade for Joint Learning. Proceedings: of 30th International Conference on Machine Learning (ICML), Atlanta GA, June 2013. http://www.cs.huji.ac.il/~daphna/papers/Zweig_ICML2013.pdf
13. Szegedy, C. (2014). Going Deeper with Convolutions. Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. pp. 1–9. arXiv:. doi:10.1109/CVPR.2015.7298594. ISBN   978-1-4673-6964-0.
14. Roig, Gemma. "Deep Learning Overview" (PDF).
15. Zweig, A. & Chechik, G. Group online adaptive learning. Machine Learning, DOI 10.1007/s10994-017- 5661-5, August 2017. http://rdcu.be/uFSv
16. Dinuzzo, Francesco (2011). "Learning output kernels with block coordinate descent" (PDF). Proceedings of the 28th International Conference on Machine Learning (ICML-11).
17. Jacob, Laurent (2009). "Clustered multi-task learning: A convex formulation". Advances in Neural Information Processing Systems.
18. Attenberg, J., Weinberger, K., & Dasgupta, A. Collaborative Email-Spam Filtering with the Hashing-Trick. http://www.cse.wustl.edu/~kilian/papers/ceas2009-paper-11.pdf
19. Chappelle, O., Shivaswamy, P., & Vadrevu, S. Multi-Task Learning for Boosting with Application to Web Search Ranking. http://www.cse.wustl.edu/~kilian/papers/multiboost2010.pdf
20. Zhou, J., Chen, J. and Ye, J. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2012. http://www.public.asu.edu/~jye02/Software/MALSAR. On-line manual
21. Evgeniou, T., & Pontil, M. (2004). Regularized multi–task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117).
22. Evgeniou, T.; Micchelli, C.; Pontil, M. (2005). "Learning multiple tasks with kernel methods" (PDF). Journal of Machine Learning Research. 6: 615.
23. Argyriou, A.; Evgeniou, T.; Pontil, M. (2008a). "Convex multi-task feature learning". Machine Learning. 73 (3): 243–272. doi:10.1007/s10994-007-5040-8.
24. Chen, J., Zhou, J., & Ye, J. (2011). Integrating low-rank and group-sparse structures for robust multi-task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.
25. Ji, S., & Ye, J. (2009). An accelerated gradient method for trace norm minimization. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 457–464).
26. Ando, R.; Zhang, T. (2005). "A framework for learning predictive structures from multiple tasks and unlabeled data" (PDF). The Journal of Machine Learning Research. 6: 1817–1853.
27. Chen, J., Tang, L., Liu, J., & Ye, J. (2009). A convex formulation for learning shared structures from multiple tasks. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 137–144).
28. Chen, J., Liu, J., & Ye, J. (2010). Learning incoherent sparse and low-rank patterns from multiple tasks. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1179–1188).
29. Jacob, L., Bach, F., & Vert, J. (2008). Clustered multi-task learning: A convex formulation. Advances in Neural Information Processing Systems， 2008
30. Zhou, J., Chen, J., & Ye, J. (2011). Clustered multi-task learning via alternating structure optimization. Advances in Neural Information Processing Systems.