machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, the polynomial kernel is a

kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...

commonly used with

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

s (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

, such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of

polynomial regression In statistics, polynomial regression is a form of regression analysis in which the relationship between the independent variable ''x'' and the dependent variable ''y'' is modelled as an ''n''th degree polynomial in ''x''. Polynomial regression fi ...

, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to

logical conjunction In logic, mathematics and linguistics, And (\wedge) is the truth-functional operator of logical conjunction; the ''and'' of a set of operands is true if and only if ''all'' of its operands are true. The logical connective that represents thi ...

s of input features.Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.

Definition

For degree- polynomials, the polynomial kernel is defined as :

K(x,y) = (x^\mathsf y + c)^

where and are vectors in the ''input space'', i.e. vectors of features computed from training or test samples and is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When , the kernel is called homogeneous. (A further generalized polykernel divides by a user-specified scalar parameter .) As a kernel, corresponds to an inner product in a feature space based on some mapping : :

K(x,y) = \langle \varphi(x), \varphi(y) \rangle

The nature of can be seen from an example. Let , so we get the special case of the quadratic kernel. After using the

multinomial theorem In mathematics, the multinomial theorem describes how to expand a power of a sum in terms of powers of the terms in that sum. It is the generalization of the binomial theorem from binomials to multinomials. Theorem For any positive integer ...

(twice—the outermost application is the

binomial theorem In elementary algebra, the binomial theorem (or binomial expansion) describes the algebraic expansion of powers of a binomial. According to the theorem, it is possible to expand the polynomial into a sum involving terms of the form , where the ...

) and regrouping, :

K(x,y) = \left(\sum_^n x_i y_i + c\right)^2 = 
\sum_^n \left(x_i^2\right) \left(y_i^2 \right) + 
\sum_^n \sum_^ \left( \sqrt x_i x_j \right) \left( \sqrt y_i y_j \right) 
+ \sum_^n \left( \sqrt x_i \right) \left( \sqrt y_i \right) + c^2

From this it follows that the feature map is given by: :

\varphi(x) = \langle x_n^2, \ldots, x_1^2, \sqrt x_n x_, \ldots, \sqrt x_n x_1, \sqrt x_ x_, \ldots, \sqrt x_ x_, \ldots, \sqrt x_ x_, \sqrt x_n, \ldots, \sqrt x_1, c \rangle

generalizing for

\left(\mathbf^\mathbf + c\right)^d

, where

\mathbf\in\mathbb^

\mathbf\in \mathbb^

and applying the

\begin 
\left(\mathbf^\mathbf + c\right)^d & = 
\sum_ 
\frac x_1^\cdots x_n^ \sqrt^
\frac y_1^\cdots y_n^ \sqrt^\\
&=\varphi(\mathbf)^ \varphi(\mathbf)
\end

The last summation has

l_d=\tbinom

elements, so that: :

\varphi(\mathbf)
=
\left(a_,\dots, a_,\dots,a_ \right )

where , :

a_=\frac x_1^\cdots x_n^ \sqrt^\quad, \quad j_1+j_2+\dots+j_n +j_ = d

Practical use

Although the RBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular in

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

(NLP). The most common degree is (quadratic), since larger degrees tend to

overfit mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

on NLP problems. Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including: * full expansion of the kernel prior to training/testing with a linear SVM, i.e. full computation of the mapping as in polynomial regression; * basket mining (using a variant of the

apriori algorithm AprioriRakesh Agrawal and Ramakrishnan SrikanFast algorithms for mining association rules Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994. is an algorithm for frequent ...

) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion; *

inverted index In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of do ...

ing of support vectors. One problem with the polynomial kernel is that it may suffer from numerical instability: when , tends to zero with increasing , whereas when , tends to infinity.{{cite conference , first=Chih-Jen , last=Lin , year=2012 , url=http://www.csie.ntu.edu.tw/~cjlin/talks/mlss_kyoto.pdf , title=Machine learning software: design and practical use , conference=Machine Learning Summer School , location=Kyoto

References

Kernel methods for machine learning