Optimal binary search tree
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a
binary search tree In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and ...
which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic. In the static optimality problem, the tree cannot be modified after it has been constructed. In this case, there exists some particular layout of the nodes of the tree which provides the smallest expected search time for the given access probabilities. Various algorithms exist to construct or approximate the statically optimal tree given the information on the access probabilities of the elements. In the dynamic optimality problem, the tree can be modified at any time, typically by permitting
tree rotation In discrete mathematics, tree rotation is an operation on a binary tree that changes the structure without interfering with the order of the elements. A tree rotation moves one node up in the tree and one node down. It is used to change the shape ...
s. The tree is considered to have a cursor starting at the root which it can move or use to perform modifications. In this case, there exists some minimal-cost sequence of these operations which causes the cursor to visit every node in the target access sequence in order. The splay tree is conjectured to have a constant
competitive ratio Competitive analysis is a method invented for analyzing online algorithms, in which the performance of an online algorithm (which must satisfy an unpredictable sequence of requests, completing each request without being able to see the future) is co ...
compared to the dynamically optimal tree in all cases, though this has not yet been proven.


Static optimality


Definition

In the static optimality problem as defined by Knuth, we are given a set of ordered elements and a set of 2n+1 probabilities. We will denote the elements a_1 through a_n and the probabilities A_1 through A_n and B_0 through B_n. A_i is the probability of a search being done for element a_i (or ''successful search''). For 1 \le i < n, B_i is the probability of a search being done for an element between a_i and a_(or ''unsuccessful search''), B_0 is the probability of a search being done for an element strictly less than a_1, and B_n is the probability of a search being done for an element strictly greater than a_n. These 2n+1 probabilities cover all possible searches, and therefore add up to one. The static optimality problem is the
optimization problem In mathematics, computer science and economics, an optimization problem is the problem of finding the ''best'' solution from all feasible solutions. Optimization problems can be divided into two categories, depending on whether the variables ...
of finding the binary search tree that minimizes the expected search time, given the 2n+1 probabilities. As the number of possible trees on a set of elements is \frac, which is exponential in ,
brute-force search In computer science, brute-force search or exhaustive search, also known as generate and test, is a very general problem-solving technique and algorithmic paradigm that consists of systematically enumerating all possible candidates for the soluti ...
is not usually a feasible solution.


Knuth's dynamic programming algorithm

In 1971, Knuth published a relatively straightforward
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. I ...
algorithm capable of constructing the statically optimal tree in only ''O''(''n''2) time. In this work, Knuth extended and improved the dynamic programming algorithm by
Edgar Gilbert Edgar Nelson Gilbert (July 25, 1923 – June 15, 2013) was an American mathematician and coding theorist, a longtime researcher at Bell Laboratories whose accomplishments include the Gilbert–Varshamov bound in coding theory, the Gilbert–Ell ...
and Edward F. Moore introduced in 1958. Gilbert's and Moore's algorithm required O(n^3) time and O(n^2) space and was designed for a particular case of optimal binary search trees construction (known as ''optimal alphabetic tree problem'') that considers only the probability of unsuccessful searches, that is, \sum _^ A_i = 0. Knuth's work relied upon the following insight: the static optimality problem exhibits
optimal substructure In computer science, a problem is said to have optimal substructure if an optimal solution can be constructed from optimal solutions of its subproblems. This property is used to determine the usefulness of greedy algorithms for a problem.{{cite boo ...
; that is, if a certain tree is statically optimal for a given probability distribution, then its left and right subtrees must also be statically optimal for their appropriate subsets of the distribution (known as monotonicity property of the roots). To see this, consider what Knuth calls the "weighted path length" of a tree. The weighted path length of a tree of n elements is the sum of the lengths of all 2n+1 possible search paths, weighted by their respective probabilities. The tree with the minimal weighted path length is, by definition, statically optimal. But weighted path lengths have an interesting property. Let E be the weighted path length of a binary tree, be the weighted path length of its left subtree, and be the weighted path length of its right subtree. Also let W be the sum of all the probabilities in the tree. Observe that when either subtree is attached to the root, the depth of each of its elements (and thus each of its search paths) is increased by one. Also observe that the root itself has a depth of one. This means that the difference in weighted path length between a tree and its two subtrees is exactly the sum of every single probability in the tree, leading to the following recurrence: :E = E_L + E_R + W This recurrence leads to a natural dynamic programming solution. Let E_ be the weighted path length of the statically optimal search tree for all values between and , let W_ be the total weight of that tree, and let R_ be the index of its root. The algorithm can be built using the following formulas: :\begin E_ = W_ &= B_ \operatorname 1 \leq i \leq n+1 \\ W_ &= W_ + A_j + B_j \\ E_ &= \min_(E_ + E_+W_) \operatorname 1 \leq i \leq j \leq n \end The naive implementation of this algorithm actually takes ''O''(''n''3) time, but Knuth's paper includes some additional observations which can be used to produce a modified algorithm taking only ''O''(''n''2) time. In addition to its dynamic programming algorithm, Knuth proposed two heuristics (or rules) to produce ''nearly (approximation of) optimal binary search trees''. Studying nearly optimal binary search trees was necessary since Knuth's algorithm time and space complexity can be prohibitive when n is substantially large. Knuth's rules can be seen as the following: * Rule I (Root-max): Place the most frequently occurring name at the root of the tree, then proceed similarly on the subtrees. * Rule II (Bisection): Choose the root so as to equalize the total weight of the left and right subtree as much as possible, then proceed similarly on the subtrees. Knuth's heuristics implements nearly optimal binary search trees in O(n \log n) time and O(n) space. The analysis on how far from the optimum Knuth's heuristics can be was further proposed by
Kurt Mehlhorn Kurt Mehlhorn (born 29 August 1949) is a German theoretical computer scientist. He has been a vice president of the Max Planck Society and is director of the Max Planck Institute for Computer Science. Education and career Mehlhorn graduated i ...
.


Mehlhorn's approximation algorithm

While the ''O''(''n''2) time taken by Knuth's algorithm is substantially better than the exponential time required for a brute-force search, it is still too slow to be practical when the number of elements in the tree is very large. In 1975, Kurt Mehlhorn published a paper proving important properties regarding Knuth's rules. Mehlhorn's major results state that only one of Knuth's heuristics (Rule II) always produces nearly optimal binary search trees. On the other hand, the root-max rule could often lead to very "bad" search trees based on the following simple argument. Let \begin n = 2^k - 1,~~A_i = 2^ + \varepsilon_i~~\operatorname~~\sum^_ \varepsilon_i = 2^ \end and \begin \varepsilon_1,\varepsilon_2,\dots,\varepsilon_n>0~~\operatorname~~1 \leqq i \leqq n~~\operatorname~~B_j = 0 \operatorname~~0 \leqq j \leqq n. \end Considering the weighted path length P of the tree constructed based on the previous definition, we have the following: \begin P &= \sum_^ A_i(a_i + 1) + \sum_^ B_j b_j \\ &= \sum_^ A_i i \\ &\geqq 2^ \sum_^ i = 2^ \frac\geqq \frac. \end Thus, the resulting tree by the root-max rule will be a tree that grows only on the right side (except for the deepest level of the tree), and the left side will always have terminal nodes. This tree has a path length bounded by \Omega (\frac) and, when compared with a balanced search tree (with path bounded by O (2 \log n)), will perform substantially worse for the same frequency distribution. In addition, Mehlhorn improved Knuth's work and introduced a much simpler algorithm that uses Rule II and closely approximates the performance of the statically optimal tree in only time. The algorithm follows the same idea of the bisection rule by choosing the tree's root to balance the total weight (by probability) of the left and right subtrees most closely. And the strategy is then applied recursively on each subtree. That this strategy produces a good approximation can be seen intuitively by noting that the weights of the subtrees along any path form something very close to a geometrically decreasing sequence. In fact, this strategy generates a tree whose weighted path length is at most :2+(1 - \log(\sqrt - 1))^H = 2 + \frac H where H is the
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...
of the probability distribution. Since no optimal binary search tree can ever do better than a weighted path length of :(1/\log3)H = \frac H this approximation is very close.


Hu–Tucker and Garsia–Wachs algorithms

In the special case that all of the A_i values are zero, the optimal tree can be found in time O(n\log n). This was first proved by T. C. Hu and Alan Tucker in a paper that they published in 1971. A later simplification by Garsia and Wachs, the
Garsia–Wachs algorithm The Garsia–Wachs algorithm is an efficient method for computers to construct optimal binary search trees and alphabetic Huffman codes, in linearithmic time. It is named after Adriano Garsia and Michelle L. Wachs. Problem description The input ...
, performs the same comparisons in the same order. The algorithm works by using a
greedy algorithm A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...
to build a tree that has the optimal height for each leaf, but is out of order, and then constructing another binary search tree with the same heights.


Dynamic optimality


Definition

There are several different definitions of dynamic optimality, all of which are effectively equivalent to within a constant factor in terms of running-time. The problem was first introduced implicitly by Sleator and Tarjan in their paper on splay trees, but Demaine et al. give a very good formal statement of it. In the dynamic optimality problem, we are given a sequence of accesses x1, ..., xm on the keys 1, ..., n. For each access, we are given a pointer to the root of our BST and may use the pointer to perform any of the following operations: # Move the pointer to the left child of the current node. # Move the pointer to the right child of the current node. # Move the pointer to the parent of the current node. # Perform a single
rotation Rotation, or spin, is the circular movement of an object around a '' central axis''. A two-dimensional rotating object has only one possible central axis and can rotate in either a clockwise or counterclockwise direction. A three-dimensional ...
on the current node and its parent. (It is the presence of the fourth operation, which rearranges the tree during the accesses, which makes this the ''dynamic'' optlmality problem.) For each access, our BST algorithm may perform any sequence of the above operations as long as the pointer eventually ends up on the node containing the target value xi. The time it takes a given dynamic BST algorithm to perform a sequence of accesses is equivalent to the total number of such operations performed during that sequence. Given any sequence of accesses on any set of elements, there is some minimum total number of operations required to perform those accesses. We would like to come close to this minimum. While it is impossible to implement this "
God's algorithm God's algorithm is a notion originating in discussions of ways to solve the Rubik's Cube puzzle, but which can also be applied to other combinatorial puzzles and mathematical games. It refers to any algorithm which produces a solution having the ...
" without foreknowledge of exactly what the access sequence will be, we can define OPT(X) as the number of operations it would perform for an access sequence X, and we can say that an algorithm is dynamically optimal if, for any X, it performs X in time O(OPT(X)) (that is, it has a constant
competitive ratio Competitive analysis is a method invented for analyzing online algorithms, in which the performance of an online algorithm (which must satisfy an unpredictable sequence of requests, completing each request without being able to see the future) is co ...
). There are several data structures conjectured to have this property, but none proven. It is an
open problem In science and mathematics, an open problem or an open question is a known problem which can be accurately stated, and which is assumed to have an objective and verifiable solution, but which has not yet been solved (i.e., no solution for it is know ...
whether there exists a dynamically optimal data structure in this model.


Splay trees

The splay tree is a form of
binary search tree In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and ...
invented in 1985 by Daniel Sleator and Robert Tarjan on which the standard search tree operations run in O(\log(n)) amortized time. It is conjectured to be dynamically optimal in the required sense. That is, a splay tree is believed to perform any sufficiently long access sequence X in time O(OPT(X)).


Tango trees

The
tango tree A tango tree is a type of binary search tree proposed by Erik D. Demaine, Dion Harmon, John Iacono, and Mihai Pătrașcu in 2004. It is named after Buenos Aires, of which the tango is emblematic. It is an online binary search tree that achieve ...
is a data structure proposed in 2004 by
Erik Demaine Erik D. Demaine (born February 28, 1981) is a professor of computer science at the Massachusetts Institute of Technology and a former child prodigy. Early life and education Demaine was born in Halifax, Nova Scotia, to artist sculptor Marti ...
and others which has been proven to perform any sufficiently-long access sequence X in time O(\log\log n \operatorname(X)). While this is not dynamically optimal, the competitive ratio of \log\log n is still very small for reasonable values of n.


Other results

In 2013,
John Iacono John Iacono is an American computer scientist specializing in data structures, algorithms and computational geometry. He is one of the inventors of the tango tree, the first known competitive binary search tree data structure. Iacono obtained ...
published a paper which uses the
geometry of binary search trees In computer science, one approach to the dynamic optimality problem on online algorithms for binary search trees involves reformulating the problem geometrically, in terms of augmenting a set of points in the plane with as few additional points as ...
to provide an algorithm which is dynamically optimal if any binary search tree algorithm is dynamically optimal. Nodes are interpreted as points in two dimensions, and the optimal access sequence is the smallest
arborally satisfied In computer science, one approach to the dynamic optimality problem on online algorithms for binary search trees involves reformulating the problem geometrically, in terms of augmenting a set of points in the plane with as few additional points as ...
superset of those points. Unlike splay trees and tango trees, Iacono's data structure is not known to be implementable in constant time per access sequence step, so even if it is dynamically optimal, it could still be slower than other search tree data structures by a non-constant factor. The
interleave lower bound In the theory of optimal binary search trees, the interleave lower bound is a lower bound on the number of operations required by a Binary Search Tree (BST) to execute a given sequence of accesses. Several variants of this lower bound have been pro ...
is an asymptotic lower bound on dynamic optimality.


See also

*
Trees In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are u ...
* Splay tree *
Tango tree A tango tree is a type of binary search tree proposed by Erik D. Demaine, Dion Harmon, John Iacono, and Mihai Pătrașcu in 2004. It is named after Buenos Aires, of which the tango is emblematic. It is an online binary search tree that achieve ...
*
Geometry of binary search trees In computer science, one approach to the dynamic optimality problem on online algorithms for binary search trees involves reformulating the problem geometrically, in terms of augmenting a set of points in the plane with as few additional points as ...
*
Interleave lower bound In the theory of optimal binary search trees, the interleave lower bound is a lower bound on the number of operations required by a Binary Search Tree (BST) to execute a given sequence of accesses. Several variants of this lower bound have been pro ...


Notes

{{DEFAULTSORT:Optimal Binary Search Trees Binary trees Search trees