The Garsia–Wachs algorithm is an efficient method for computers to construct

optimal binary search tree In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time (or Expected value, expected search time) for a given sequen ...

s and alphabetic Huffman codes, in

linearithmic In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations p ...

time. It is named after

Adriano Garsia Adriano Mario Garsia (20 August 1928 – 6 October 2024) was a Tunisian-born Italian American mathematician who worked in analysis, combinatorics, representation theory, and algebraic geometry. He was a student of Charles Loewner and published wo ...

and Michelle L. Wachs.

Problem description

The input to the problem, for an integer

n

, consists of a sequence of

n+1

non-negative weights

w_0,w_1,\dots, w_n

. The output is a rooted

binary tree In computer science, a binary tree is a tree data structure in which each node has at most two children, referred to as the ''left child'' and the ''right child''. That is, it is a ''k''-ary tree with . A recursive definition using set theor ...

with

n

internal nodes, each having exactly two children. Such a tree has exactly

n+1

leaf nodes, which can be identified (in the order given by the binary tree) with the

n+1

input weights. The goal of the problem is to find a tree, among all of the possible trees with

n

internal nodes, that minimizes the weighted sum of the ''external path lengths''. These path lengths are the numbers of steps from the root to each leaf. They are multiplied by the weight of the leaf and then summed to give the quality of the overall tree. This problem can be interpreted as a problem of constructing a

binary search tree In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a Rooted tree, rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left ...

for

n

ordered keys, with the assumption that the tree will be used only to search for values that are not already in the tree. In this case, the

n

keys partition the space of search values into

n+1

intervals, and the weight of one of these intervals can be taken as the probability of searching for a value that lands in that interval. The weighted sum of external path lengths controls the

expected time In computational complexity theory, the average-case complexity of an algorithm is the amount of some computational resource (typically time) used by the algorithm, averaged over all possible inputs. It is frequently contrasted with worst-case com ...

for searching the tree. Alternatively, the output of the problem can be used as a

Huffman code In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...

, a method for encoding

n+1

given values unambiguously by using variable-length sequences of binary values. In this interpretation, the code for a value is given by the sequence of left and right steps from a parent to the child on the path from the root to a leaf in the tree (e.g. with 0 for left and 1 for right). Unlike standard Huffman codes, the ones constructed in this way are ''alphabetical'', meaning that the sorted order of these binary codes is the same as the input ordering of the values. If the weight of a value is its frequency in a message to be encoded, then the output of the Garsia–Wachs algorithm is the alphabetical Huffman code that compresses the message to the shortest possible length.

Algorithm

Overall, the algorithm consists of three phases: # Build a binary tree having the values as leaves but possibly in the wrong order. # Compute each leaf's distance from the root in the resulting tree. # Build another binary tree with the leaves at the same distances but in the correct order. The first phase of the algorithm is easier to describe if the input is augmented with two

sentinel value In computer programming, a sentinel value (also referred to as a flag value, trip value, rogue value, signal value, or dummy data) is a special value in the context of an algorithm which uses its presence as a condition of termination, typically ...

\infty

(or any sufficiently large finite value) at the start and end of the sequence. The first phase maintains a forest of trees, initially a single-node tree for each non-sentinel input weight, which will eventually become the binary tree that it constructs. Each tree is associated with a value, the sum of the weights of its leaves makes a tree node for each non-sentinel input weight. The algorithm maintains a sequence of these values, with the two sentinel values at each end. The initial sequence is just the order in which the leaf weights were given as input. It then repeatedly performs the following steps, each of which reduces the length of the input sequence, until there is only one tree containing all the leaves: *Find the first three consecutive weights

x

y

, and

z

in the sequence for which

x\le z

. There always exists such a triple, because the final sentinel value is larger than any previous two finite values. *Remove

x

and

y

from the sequence, and make a new tree node to be the parent of the nodes for

x

and

y

. Its value is

x+y

. *Reinsert the new node immediately after the rightmost earlier position whose value is greater than or equal to

x+y

. There always exists such a position, because of the left sentinel value. To implement this phase efficiently, the algorithm can maintain its current sequence of values in any

self-balancing binary search tree In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.Donald ...

structure. Such a structure allows the removal of

x

and

y

, and the reinsertion of their new parent, in logarithmic time. In each step, the weights up to

y

in the even positions of the array form a decreasing sequence, and the weights in the odd positions form another decreasing sequence. Therefore, the position to reinsert

x+y

may be found in logarithmic time by using the balanced tree to perform two

binary search In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the m ...

es, one for each of these two decreasing sequences. The search for the first position for which

x\le z

can be performed in linear total time by using a

sequential search In computer science, linear search or sequential search is a method for finding an element within a list. It sequentially checks each element of the list until a match is found or the whole list has been searched. A linear search runs in linea ...

that begins at the

z

from the previous triple. It is nontrivial to prove that, in the third phase of the algorithm, another tree with the same distances exists and that this tree provides the optimal solution to the problem. But assuming this to be true, the second and third phases of the algorithm are straightforward to implement in linear time. Therefore, the total time for the algorithm, on an input of length

n

, is

O(n\log n)

History

The Garsia–Wachs algorithm is named after

and Michelle L. Wachs, who published it in 1977. Their algorithm simplified an earlier method of T. C. Hu and

Alan Tucker Alan Curtiss Tucker is an American mathematician. He is a professor of applied mathematics at Stony Brook University, and the author of a widely used textbook on combinatorics; he has also made research contributions to graph theory and coding theo ...

, and (although it is different in internal details) it ends up making the same comparisons in the same order as the Hu–Tucker algorithm. The original proof of correctness of the Garsia–Wachs algorithm was complicated, and was later simplified by and .

Problem description

Algorithm

History

References

Further reading