For

computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...

, in

statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...

, a representer theorem is any of several related results stating that a minimizer

f^

of a regularized empirical risk functional defined over a

reproducing kernel Hilbert space In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions f and g in ...

can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data.

Formal statement

The following Representer Theorem and its proof are due to Schölkopf, Herbrich, and Smola: Theorem: Consider a positive-definite real-valued kernel

k : \mathcal \times \mathcal \to \R

on a non-empty set

\mathcal

with a corresponding reproducing kernel Hilbert space

H_k

. Let there be given * a training sample

(x_1, y_1), \dotsc, (x_n, y_n) \in \mathcal \times \R

, * a strictly increasing real-valued function

g \colon reproducing_property_together_show_that_applying_f_to_any_training_point_x_j_produces

:_f(x_j)_=_\left_\langle_\sum_^n_\alpha_i_\varphi(x_i)_+_v,_\varphi(x_j)_\right_\rangle_=_\sum_^n_\alpha_i_\langle_\varphi(x_i),_\varphi(x_j)_\rangle, which_we_observe_is_independent_of_v .__Consequently,_the_value_of_the_error_function_E_in_(*)_is_likewise_independent_of_v .__For_the_second_term_(the_regularization_term),_since_v_is_orthogonal_to_\sum_^n_\alpha_i_\varphi(x_i)_and_g_is_strictly_monotonic,_we_have

: \begin
_g\left(_\lVert_f_\rVert_\right)_&=_g_\left(__\lVert_\sum_^n_\alpha_i_\varphi(x_i)_+_v_\rVert_\right)_\\
&=_g_\left(_\sqrt_\right)_\\
&\ge_g_\left(__\lVert_\sum_^n_\alpha_i_\varphi(x_i)_\rVert_\right).
\end Therefore_setting_v_=_0_does_not_affect_the_first_term_of_(*),_while_it_strictly_decreases_the_second_term.__Consequently,_any_minimizer_f^_in_(*)_must_have_v_=_0,_i.e.,_it_must_be_of_the_form

:_f^(\cdot)_=_\sum_^n_\alpha_i_\varphi(x_i)_=_\sum_^n_\alpha_i_k(\cdot,_x_i), which_is_the_desired_result.

_Generalizations

The_Theorem_stated_above_is_a_particular_example_of_a_family_of_results_that_are_collectively_referred_to_as_"representer_theorems";_here_we_describe_several_such. The_first_statement_of_a_representer_theorem_was_due_to_Kimeldorf_and_Wahba_for_the_special_case_in_which :

\begin
E\left(_(x_1,_y_1,_f(x_1)),_\ldots,__(x_n,_y_n,_f(x_n))_\right)_&=_\frac_\sum_^n_(f(x_i)_-_y_i)^2,_\\
g(\lVert_f_\rVert)_&=_\lambda_\lVert_f_\rVert^2
\end

for_

\lambda_>_0

.__Schölkopf,_Herbrich,_and_Smola_generalized_this_result_by_relaxing_the_assumption_of_the_squared-loss_cost_and_allowing_the_regularizer_to_be_any_strictly_monotonically_increasing_function_

g(\cdot)

_of_the_Hilbert_space_norm. It_is_possible_to_generalize_further_by_augmenting_the_regularized_empirical_risk_functional_through_the_addition_of_unpenalized_offset_terms.__For_example,_Schölkopf,_Herbrich,_and_Smola_also_consider_the_minimization :

_\tilde^_=_\operatorname_\left\lbrace_E\left(_(x_1,_y_1,_\tilde(x_1)),__\ldots,__(x_n,_y_n,_\tilde(x_n))_\right)_+_g\left(_\lVert_f_\rVert_\right)_\mid_\tilde_=_f__+_h_\in_H_k_\oplus__\operatorname_\lbrace_\psi_p_\mid_1_\le_p_\le_M_\rbrace__\right_\rbrace,_\quad_(\dagger)

i.e.,_we_consider_functions_of_the_form_

\tilde_=_f_+_h

,_where_

f_\in_H_k

_and_

h

_is_an_unpenalized_function_lying_in_the_span_of_a_finite_set_of_real-valued_functions_

\lbrace_\psi_p_\colon_\mathcal_\to_\R_\mid_1_\le_p_\le_M_\rbrace

.__Under_the_assumption_that_the_

n_\times_M

_matrix_

\left(_\psi_p(x_i)_\right)_

_has_rank_

M

,_they_show_that_the_minimizer_

\tilde^

_in_

(\dagger)

admits_a_representation_of_the_form :

_\tilde^(\cdot)_=_\sum_^n_\alpha_i_k(\cdot,_x_i)_+_\sum_^M_\beta_p_\psi_p(\cdot)

where_

\alpha_i,_\beta_p_\in_\R

_and_the_

\beta_p

_are_all_uniquely_determined. The_conditions_under_which_a_representer_theorem_exists_were_investigated_by_Argyriou,_Micchelli,_and_Pontil,_who_proved_the_following: Theorem:_Let_

\mathcal

_be_a_nonempty_set,_

k

_a_positive-definite_real-valued_kernel_on_

\mathcal_\times_\mathcal

_with_corresponding_reproducing_kernel_Hilbert_space_

H_k

,_and_let_

R_\colon_H_k_\to_\R

_be_a_differentiable_regularization_function.__Then_given_a_training_sample_

(x_1,_y_1),_\ldots,_(x_n,_y_n)_\in_\mathcal_\times_\R

_and_an_arbitrary_error_function_

E_\colon_(\mathcal_\times_\R^2)^m_\to_\R_\cup_\lbrace_\infty_\rbrace

,_a_minimizer :

f^_=_\underset__\left\lbrace_E\left(_(x_1,_y_1,_f(x_1)),_\ldots,__(x_n,_y_n,_f(x_n))_\right)_+_R(f)_\right_\rbrace_\quad_(\ddagger)

of_the_regularized_empirical_risk_admits_a_representation_of_the_form :

_f^(\cdot)_=_\sum_^n_\alpha_i_k(\cdot,_x_i),

where_

\alpha_i_\in_\R

_for_all_

1_\le_i_\le_n

,_if_and_only_if_there_exists_a_nondecreasing_function_

h_\colon_[0,_\infty)_\to_\R

_for_which :

R(f)_=_h(\lVert_f_\rVert).

Effectively,_this_result_provides_a_necessary_and_sufficient_condition_on_a_differentiable_regularizer_

R(\cdot)

_under_which_the_corresponding_regularized_empirical_risk_minimization_

(\ddagger)

_will_have_a_representer_theorem.__In_particular,_this_shows_that_a_broad_class_of_regularized_risk_minimizations_(much_broader_than_those_originally_considered_by_Kimeldorf_and_Wahba)_have_representer_theorems.

_Applications

Representer_theorems_are_useful_from_a_practical_standpoint_because_they_dramatically_simplify_the_regularized_Empirical_risk_minimization.html" "title="Reproducing kernel Hilbert space#The Reproducing Property">reproducing property together show that applying

f

to any training point

x_j

produces :

f(x_j) = \left \langle \sum_^n \alpha_i \varphi(x_i) + v, \varphi(x_j) \right \rangle = \sum_^n \alpha_i \langle \varphi(x_i), \varphi(x_j) \rangle,

which we observe is independent of

v

. Consequently, the value of the error function

E

in (*) is likewise independent of

v

. For the second term (the regularization term), since

v

is orthogonal to

\sum_^n \alpha_i \varphi(x_i)

and

g

is strictly monotonic, we have :

\begin
 g\left( \lVert f \rVert \right) &= g \left(  \lVert \sum_^n \alpha_i \varphi(x_i) + v \rVert \right) \\
&= g \left( \sqrt \right) \\
&\ge g \left(  \lVert \sum_^n \alpha_i \varphi(x_i) \rVert \right).
\end

Therefore setting

v = 0

does not affect the first term of (*), while it strictly decreases the second term. Consequently, any minimizer

f^

in (*) must have

v = 0

, i.e., it must be of the form :

f^(\cdot) = \sum_^n \alpha_i \varphi(x_i) = \sum_^n \alpha_i k(\cdot, x_i),

which is the desired result.

Generalizations

The Theorem stated above is a particular example of a family of results that are collectively referred to as "representer theorems"; here we describe several such. The first statement of a representer theorem was due to Kimeldorf and Wahba for the special case in which :

\begin
E\left( (x_1, y_1, f(x_1)), \ldots,  (x_n, y_n, f(x_n)) \right) &= \frac \sum_^n (f(x_i) - y_i)^2, \\
g(\lVert f \rVert) &= \lambda \lVert f \rVert^2
\end

for

\lambda > 0

. Schölkopf, Herbrich, and Smola generalized this result by relaxing the assumption of the squared-loss cost and allowing the regularizer to be any strictly monotonically increasing function

g(\cdot)

of the Hilbert space norm. It is possible to generalize further by augmenting the regularized empirical risk functional through the addition of unpenalized offset terms. For example, Schölkopf, Herbrich, and Smola also consider the minimization :

\tilde^ = \operatorname \left\lbrace E\left( (x_1, y_1, \tilde(x_1)),  \ldots,  (x_n, y_n, \tilde(x_n)) \right) + g\left( \lVert f \rVert \right) \mid \tilde = f  + h \in H_k \oplus  \operatorname \lbrace \psi_p \mid 1 \le p \le M \rbrace  \right \rbrace, \quad (\dagger)

i.e., we consider functions of the form

\tilde = f + h

, where

f \in H_k

and

h

is an unpenalized function lying in the span of a finite set of real-valued functions

\lbrace \psi_p \colon \mathcal \to \R \mid 1 \le p \le M \rbrace

. Under the assumption that the

n \times M

matrix

\left( \psi_p(x_i) \right)_

has rank

M

, they show that the minimizer

\tilde^

(\dagger)

admits a representation of the form :

\tilde^(\cdot) = \sum_^n \alpha_i k(\cdot, x_i) + \sum_^M \beta_p \psi_p(\cdot)

where

\alpha_i, \beta_p \in \R

and the

\beta_p

are all uniquely determined. The conditions under which a representer theorem exists were investigated by Argyriou, Micchelli, and Pontil, who proved the following: Theorem: Let

\mathcal

be a nonempty set,

k

a positive-definite real-valued kernel on

\mathcal \times \mathcal

with corresponding reproducing kernel Hilbert space

H_k

, and let

R \colon H_k \to \R

be a differentiable regularization function. Then given a training sample

(x_1, y_1), \ldots, (x_n, y_n) \in \mathcal \times \R

and an arbitrary error function

E \colon (\mathcal \times \R^2)^m \to \R \cup \lbrace \infty \rbrace

, a minimizer :

f^ = \underset  \left\lbrace E\left( (x_1, y_1, f(x_1)), \ldots,  (x_n, y_n, f(x_n)) \right) + R(f) \right \rbrace \quad (\ddagger)

of the regularized empirical risk admits a representation of the form :

f^(\cdot) = \sum_^n \alpha_i k(\cdot, x_i),

where

\alpha_i \in \R

for all

1 \le i \le n

, if and only if there exists a nondecreasing function

h \colon [0, \infty) \to \R

for which :

R(f) = h(\lVert f \rVert).

Effectively, this result provides a necessary and sufficient condition on a differentiable regularizer

R(\cdot)

under which the corresponding regularized empirical risk minimization

(\ddagger)

will have a representer theorem. In particular, this shows that a broad class of regularized risk minimizations (much broader than those originally considered by Kimeldorf and Wahba) have representer theorems.

Applications

Representer theorems are useful from a practical standpoint because they dramatically simplify the regularized Empirical risk minimization">empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an alg ...

Formal statement

_Generalizations

_Applications

Generalizations

Applications

See also

References