In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a ball tree, balltree
or metric tree, is a
space partitioning
In geometry, space partitioning is the process of dividing a space (usually a Euclidean space) into two or more disjoint subsets (see also partition of a set). In other words, space partitioning divides a space into non-overlapping regions. Any ...
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
for organizing points in a multi-dimensional space. A ball tree partitions data points into a nested set of
balls. The resulting data structure has characteristics that make it useful for a number of applications, most notably
nearest neighbor search
Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
.
Informal description
A ball tree is a
binary tree
In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binary t ...
in which every node defines a D-dimensional ball containing a subset of the points to be searched. Each internal node of the tree partitions the data points into two
disjoint sets
In mathematics, two sets are said to be disjoint sets if they have no element in common. Equivalently, two disjoint sets are sets whose intersection is the empty set.. For example, and are ''disjoint sets,'' while and are not disjoint. A ...
which are associated with different balls. While the balls themselves may intersect, each point is assigned to one or the other ball in the partition according to its distance from the ball's center. Each leaf node in the tree defines a ball and enumerates all data points inside that ball.
Each node in the tree defines the smallest ball that contains all data points in its subtree. This gives rise to the useful property that, for a given test point outside the ball, the distance to any point in a ball in the tree is greater than or equal to the distance from to the surface of the ball. Formally:
:
Where
is the minimum possible distance from any point in the ball to some point .
Ball-trees are related to the
M-tree
In computer science, M-trees are tree data structures that are similar to R-trees and B-trees. It is constructed using a metric and relies on the triangle inequality for efficient range and k-nearest neighbor (k-NN) queries.
While M-trees can perf ...
, but only support binary splits, whereas in the M-tree each level splits
to
fold, thus leading to a shallower tree structure, therefore need fewer distance computations, which usually yields faster queries. Furthermore, M-trees can better be stored
on disk, which is organized in
pages
Page most commonly refers to:
* Page (paper), one side of a leaf of paper, as in a book
Page, PAGE, pages, or paging may also refer to:
Roles
* Page (assistance occupation), a professional occupation
* Page (servant), traditionally a young mal ...
. The M-tree also keeps the distances from the parent node precomputed to speed up queries.
Vantage-point tree A vantage-point tree (or VP tree) is a metric tree that segregates data in a metric space by choosing a position in the space (the "vantage point") and partitioning the data points into two parts: those points that are nearer to the vantage point th ...
s are also similar, but they binary split into one ball, and the remaining data, instead of using two balls.
Construction
A number of ball tree construction algorithms are available.
[Omohundro, Stephen M. (1989) tp://ftp.icsi.berkeley.edu/pub/techreports/1989/tr-89-063.pdf "Five Balltree Construction Algorithms"/ref> The goal of such an algorithm is to produce a tree that will efficiently support queries of the desired type (e.g. nearest-neighbor) efficiently in the average case. The specific criteria of an ideal tree will depend on the type of question being answered and the distribution of the underlying data. However, a generally applicable measure of an efficient tree is one that minimizes the total volume of its internal nodes. Given the varied distributions of real-world data sets, this is a difficult task, but there are several heuristics that partition the data well in practice. In general, there is a tradeoff between the cost of constructing a tree and the efficiency achieved by this metric.
]
This section briefly describes the simplest of these algorithms. A more in-depth discussion of five algorithms was given by Stephen Omohundro.[
]
''k''-d construction algorithm
The simplest such procedure is termed the "''k''-d Construction Algorithm", by analogy with the process used to construct ''k''-d trees. This is an offline algorithm
In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.
In contrast, an o ...
, that is, an algorithm that operates on the entire data set at once. The tree is built top-down by recursively splitting the data points into two sets. Splits are chosen along the single dimension with the greatest spread of points, with the sets partitioned by the median value of all points along that dimension. Finding the split for each internal node requires linear time in the number of samples contained in that node, yielding an algorithm with time complexity
In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
, where ''n'' is the number of data points.
Pseudocode
function construct_balltree is
input: ''D'', an array of data points.
output: ''B'', the root of a constructed ball tree.
if a single point remains then
create a leaf ''B'' containing the single point in ''D''
return ''B''
else
let ''c'' be the dimension of greatest spread
let ''p'' be the central point selected considering c
let ''L'', ''R'' be the sets of points lying to the left and right of the median along dimension ''c''
create ''B'' with two children:
''B''.pivot := ''p''
''B''.child1 := construct_balltree(L),
''B''.child2 := construct_balltree(R),
let ''B''.radius be maximum distance from ''p'' among children
return ''B''
end if
end function
Nearest-neighbor search
An important application of ball trees is expediting nearest neighbor search
Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
queries, in which the objective is to find the k points in the tree that are closest to a given test point by some distance metric (e.g. Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
). A simple search algorithm, sometimes called KNS1, exploits the distance property of the ball tree. In particular, if the algorithm is searching the data structure with a test point ''t'', and has already seen some point ''p'' that is closest to ''t'' among the points encountered so far, then any subtree whose ball is further from ''t'' than ''p'' can be ignored for the rest of the search.
Description
The ball tree nearest-neighbor algorithm examines nodes in depth-first order, starting at the root. During the search, the algorithm
maintains a max-first priority queue
In computer science, a priority queue is an abstract data-type similar to a regular queue or stack data structure in which each element additionally has a ''priority'' associated with it. In a priority queue, an element with high priority is se ...
(often implemented with a heap
Heap or HEAP may refer to:
Computing and mathematics
* Heap (data structure), a data structure commonly used to implement a priority queue
* Heap (mathematics), a generalization of a group
* Heap (programming) (or free store), an area of memory f ...
), denoted ''Q'' here, of the k nearest points encountered so far. At each node ''B'', it may perform one of three operations, before finally returning an updated version of the priority queue:
# If the distance from the test point ''t'' to the current node ''B'' is greater than the furthest point in ''Q'', ignore ''B'' and return ''Q''.
# If ''B'' is a leaf node, scan through every point enumerated in ''B'' and update the nearest-neighbor queue appropriately. Return the updated queue.
# If ''B'' is an internal node, call the algorithm recursively on ''Bs two children, searching the child whose center is closer to ''t'' first. Return the queue after each of these calls has updated it in turn.
Performing the recursive search in the order described in point 3 above increases likelihood that the further child will be pruned
entirely during the search.
Pseudocode
function knn_search is
input:
t, the target point for the query
k, the number of nearest neighbors of t to search for
Q, max-first priority queue containing at most k points
B, a node, or ball, in the tree
output:
Q, containing the k nearest neighbors from within B
if distance(t, B.pivot) - B.radius ≥ distance(t, Q.first) then
return Q unchanged
else if B is a leaf node then
for each point p in B do
if distance(t, p) < distance(t, Q.first) then
add p to Q
if size(Q) > k then
remove the furthest neighbor from Q
end if
end if
repeat
else
let child1 be the child node closest to t
let child2 be the child node furthest from t
knn_search(t, k, Q, child1)
knn_search(t, k, Q, child2)
end if
return Q
end function
Performance
In comparison with several other data structures, ball trees have been shown to perform fairly well on
the nearest-neighbor search problem, particularly as their number of dimensions grows.
However, the best nearest-neighbor data structure for a given application will depend on the dimensionality, number of data points, and underlying structure of the data.
References
{{reflist
Trees (data structures)
Machine learning
Articles with example pseudocode