A BK-tree is a
metric tree
A metric tree is any tree data structure specialized to index data in metric spaces. Metric trees exploit properties of metric spaces such as the triangle inequality to make accesses to the data more efficient. Examples include the M-tree, vp-t ...
suggested by Walter Austin Burkhard and
Robert M. Keller specifically adapted to discrete
metric space
In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general settin ...
s.
For simplicity, consider integer discrete metric
. Then, BK-tree is defined in the following way. An arbitrary element ''a'' is selected as root node. The root node may have zero or more subtrees. The ''k-th'' subtree is recursively built of all elements ''b'' such that
. BK-trees can be used for
approximate string matching
In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching i ...
in a dictionary.
Example
This picture depicts the BK-tree for the set
of words obtained by using the
Levenshtein distance
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
* each node
is labeled by a string of
;
* each
arc
ARC may refer to:
Business
* Aircraft Radio Corporation, a major avionics manufacturer from the 1920s to the '50s
* Airlines Reporting Corporation, an airline-owned company that provides ticket distribution, reporting, and settlement services
* ...
is labeled by
where
denotes the word assigned to
.
The BK-tree is built so that:
* for all node
of the BK-tree, the weight assigned to its egress arcs are distinct;
* for all arc
labeled by
, each descendant
of
satisfies the following equation:
:
** ''Example 1:'' Consider the arc from "book" to "books". The distance between "book" and any word in is equal to 1;
** ''Example 2:'' Consider the arc from "books" to "boo". The distance between "books" and any word in is equal to 2.
Insertion
The insertion primitive is used to populate a BK-tree
according to a discrete metric
.
Input:
*
: the BK-tree;
**
denotes the weight assigned to an arc
;
**
denotes word assigned to a node
);
*
: the discrete metric used by
(e.g. the
Levenshtein distance
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
);
*
: the element to be inserted into
;
Output:
* The node of
corresponding to
Algorithm:
* If the
is empty:
** Create a root node
in
**
** Return
* Set
to the root of
* While
exists:
**
** If
:
*** Return
** Find
the child of
such that
** If
is not found:
*** Create the node
***
*** Create the arc
***
*** Return
**
Lookup
Given a searched element
, the lookup primitive traverses the BK-tree to find the closest element of
. The key idea is to restrict the exploration of
to nodes that can only improve the best candidate found so far by taking advantage of the BK-tree organization and of the triangle inequality (cut-off criterion).
Input:
*
: the BK-tree;
*
: the corresponding discrete metric (e.g. the
Levenshtein distance
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
);
*
: the searched element;
*
: the maximum distance allowed between the best match and
, defaults to
;
Output:
*
: the closest element to
stored in
and according to
or
if not found;
Algorithm:
* If
is empty:
** Return
* Create
a set of nodes to process, and insert the root of
into
.
*
* While
:
** Pop an arbitrary node
from
**
** If
:
***
** For each egress-arc
:
*** If
: (cut-off criterion)
**** Insert
into
.
* Return
Example of the lookup algorithm
Consider the example 8-node B-K Tree shown above and set
"cool".
is initialized to contain the root of the tree, which is subsequently popped as the first value of
with
="book". Further
since the distance from "book" to "cool" is 2, and
as this is the best (i.e. smallest) distance found thus far. Next each outgoing arc from the root is considered in turn: the arc from "book" to "books" has weight 1, and since
is less than
, the node containing "books" is inserted into
for further processing.
The next arc, from "book" to "cake," has weight 4, and since
is not less than
, the node containing "cake" is not inserted into
. Therefore, the subtree rooted at "cake" will be pruned from the search, as the word closest to "cool" cannot appear in that subtree. To see why this pruning is correct, notice that a candidate word
appearing in "cake"s subtree having distance less than 2 to "cool" would violate the triangle inequality: the triangle inequality requires that for this set of three numbers (as sides of a triangle), no two can sum to less than the third, but here the distance from "cool" to "book" (which is 2) plus the distance from "cool" to
(which is less than 2) cannot reach or exceed the distance from "book" to "cake" (which is 4). Therefore, it is safe to disregard the entire subtree rooted at "cake".
Next the node containing "books" is popped from
and now
, the distance from "cool" to "books." As
,
remains set at 2 and the single outgoing arc from the node containing "books" is considered. Next, the node containing "boo" is popped from
and
, the distance from "cool" to "boo." This again does not improve upon
. Each outgoing arc from "boo" is now considered; the arc from "boo" to "boon" has weight 1, and since
, "boon" is added to
. Similarly, since
, "cook" is also added to
.
Finally each of the two last elements in
are considered in arbitrary order: suppose the node containing "cook" is popped first, improving
to distance 1, then the node containing "boon" is popped last, which has distance 2 from "cool" and therefore does not improve the best result. Finally, "cook" is returned as the answer
with
.
See also
*
Levenshtein distance
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
– the distance metric commonly used when building a BK-tree
*
Damerau–Levenshtein distance In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein.) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Lev ...
– a modified form of Levenshtein distance that allows transpositions
References
*
W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973* R. Baeza-Yates, W. Cunto, U. Manber, and S. Wu. Proximity matching using fixed queries trees. In M. Crochemore and D. Gusfield, editors, 5th Combinatorial Pattern Matching, LNCS 807, pages 198–212, Asilomar, CA, June 1994.
*
Ricardo Baeza-Yates and Gonzalo Navarro. Fast Approximate String Matching in a Dictionary. Proc. SPIRE'98
External links
* A BK-tree implementation i
Common Lispwith test results and performance graphs.
* An explanation of BK-Trees and their relationship to metric space
* An explanation of BK-Trees with an implementation in C
* A BK-tree implementation in
Lua (programming language), Luabr>
* A BK-tree implementation in
Python (programming language), Pythonbr>
{{CS-Trees
Trees (data structures)