In
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and
data mining
Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
, a string kernel is a
kernel function that operates on
strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the
similarity of pairs of strings: the more similar two strings ''a'' and ''b'' are, the higher the value of a string kernel ''K''(''a'', ''b'') will be.
Using string kernels with
kernelized learning algorithms such as
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
s allow such algorithms to work with strings, without having to translate these to fixed-length, real-valued
feature vector
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a data set. Choosing informative, discriminating, and independent features is crucial to produce effective algorithms for pattern re ...
s.
String kernels are used in domains where sequence data are to be
clustered or
classified
Classified may refer to:
General
*Classified information, material that a government body deems to be sensitive
*Classified advertising or "classifieds"
Music
*Classified (rapper) (born 1977), Canadian rapper
* The Classified, a 1980s American ro ...
, e.g. in
text mining
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
and
gene analysis.
Informal introduction
Suppose one wants to compare some text passages automatically and indicate their relative similarity.
For many applications, it might be sufficient to find some keywords which match exactly.
One example where exact matching is not always enough is found in
spam detection.
Another would be in computational gene analysis, where
homologous genes
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
have
mutated
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA replication, DNA or viral rep ...
, resulting in common subsequences along with deleted, inserted or replaced symbols.
Motivation
Since several well-proven data clustering, classification and information retrieval
methods (for example support vector machines) are designed to work on vectors
(i.e. data are elements of a vector space), using a string kernel allows the extension of these methods to handle sequence data.
The string kernel method is to be contrasted with earlier approaches for text classification where feature vectors only indicated
the presence or absence of a word.
Not only does it improve on these approaches, but it is an example for a whole class of kernels adapted to data structures, which
began to appear at the turn of the 21st century. A survey of such methods has been compiled by Gärtner.
In bioinformatics string kernels are used especially to transform biological sequences such as proteins or DNA into vectors for further use in machine learning models. An example of a string kernel used for that purpose is the profile kernel.
Definition
A
kernel on a domain
is a function
satisfying some conditions (being
symmetric
Symmetry () in everyday life refers to a sense of harmonious and beautiful proportion and balance. In mathematics, the term has a more precise definition and is usually used to refer to an object that is invariant under some transformations ...
in the arguments,
continuous
Continuity or continuous may refer to:
Mathematics
* Continuity (mathematics), the opposing concept to discreteness; common examples include
** Continuous probability distribution or random variable in probability and statistics
** Continuous ...
and
positive semidefinite in a certain sense).
Mercer's theorem
In mathematics, specifically functional analysis, Mercer's theorem is a representation of a symmetric positive-definite function on a square as a sum of a convergent sequence of product functions. This theorem, presented in , is one of the most ...
asserts that
can then be expressed as
with
mapping the arguments into an
inner product space
In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, ofte ...
.
We can now reproduce the definition of a string subsequence kernel
on strings over an
alphabet
An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...
. Coordinate-wise, the mapping is defined as follows:
:
The
are
multiindices and
is a string of length
:
subsequences can occur in a non-contiguous manner, but gaps are penalized.
The multiindex
gives the positions of the characters matching
in
.
is the difference between the first and last entry in
, that is: how far apart in
the subsequence matching
is.
The parameter
may be set to any value between
(gaps are not allowed, as only
is not
but
) and
(even widely-spread "occurrences" are weighted the same as appearances as a contiguous substring, as
).
For several relevant algorithms, data enters into the algorithm only in expressions involving an inner product of feature vectors,
hence the name
kernel methods
In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pa ...
. A desirable consequence of this is that one does not need to explicitly calculate the transformation
, only the inner product via the kernel, which may be a lot quicker, especially when
approximated.
[
]
References
{{Reflist
Algorithms on strings
Kernel methods for machine learning
Natural language processing
String metrics