computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...

, the Hunt–Szymanski algorithm, also known as Hunt–McIlroy algorithm, is a solution to the

longest common subsequence problem The longest common subsequence (LCS) problem is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring problem: unlike substrings, su ...

. It was one of the first non-heuristic algorithms used in

diff In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but i ...

which compares a pair of files each represented as a sequence of lines. To this day, variations of this algorithm are found in incremental

version control system In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections o ...

s, wiki engines, and

molecular phylogenetics Molecular phylogenetics () is the branch of phylogeny that analyzes genetic, hereditary molecular differences, predominantly in DNA sequences, to gain information on an organism's evolutionary relationships. From these analyses, it is possible to ...

research software. The worst-case complexity for this algorithm is , but in practice is rather expected.

History

The algorithm was proposed by Harold S. Stone as a generalization of a special case solved by Thomas G. Szymanski.

James W. Hunt James Wayne Hunt (August 5, 1952 – March 21, 2021) was an American computer scientist and inventor. He invented the Hunt–Szymanski algorithm and Hunt–McIlroy algorithm algorithms. It was one of the first non-heuristic algorithms used in di ...

refined the idea, implemented the first version of the candidate-listing algorithm used by

and embedded it into an older framework of

Douglas McIlroy Malcolm Douglas McIlroy (born 1932) is a mathematician, engineer, and programmer. As of 2019 he is an Adjunct Professor of Computer Science at Dartmouth College. McIlroy is best known for having originally proposed Unix pipelines and developed se ...

. The description of the algorithm appeared as a technical report by Hunt and McIlroy in 1976. The following year, a variant of the algorithm was finally published in a joint paper by Hunt and Szymanski.

Algorithm

The Hunt–Szymanski algorithm is a modification to a basic solution for the longest common subsequence problem which has complexity . The solution is modified so that there are lower time and space requirements for the algorithm when it is working with typical inputs.

Basic longest common subsequence solution

Algorithm

Let be the th element of the first sequence. Let be the th element of the second sequence. Let be the length of the longest common subsequence for the first elements of and the first elements . :

P_ =
\begin
  0
& \text\ i = 0 \text  j = 0 \\
  1 + P_
& \text A_i = B_j \\
  \max(P_, P_)
& \text A_i \ne B_j
\end

Example

Consider the sequences and . contains three elements: :

\begin
 A_1 = a\\
 A_2 = b\\
 A_3 = c
\end

contains three elements: :

\begin
 B_1 = a\\
 B_2 = c\\
 B_3 = b
\end

The steps that the above algorithm would perform to determine the length of the longest common subsequence for both sequences are shown in the diagram. The algorithm correctly reports that the longest common subsequence of the two sequences is two elements long.

Complexity

The above algorithm has worst-case time and space complexities of (''see'' big O notation), where is the number of elements in sequence and is the number of elements in sequence . The Hunt–Szymanski algorithm modifies this algorithm to have a worst-case time complexity of and space complexity of , though it regularly beats the worst case with typical inputs.

Essential matches

-candidates

The Hunt–Szymanski algorithm only considers what the authors call essential matches, or -candidates. -candidates are pairs of indices such that: :

A_i = B_j

P_ > max(P_, P_)

The second point implies two properties of -candidates: * There is a common subsequence of length in the first elements of sequence and the first elements of sequence . * There are no common subsequences of length for any fewer than elements of sequence or elements of sequence .

Connecting -candidates

To create the longest common subsequence from a collection of -candidates, a grid with each sequence's contents on each axis is created. The -candidates are marked on the grid. A common subsequence can be created by joining marked coordinates of the grid such that any increase in is accompanied by an increase in . This is illustrated in the adjacent diagram. Black dots represent candidates that would have to be considered by the simple algorithm and the black lines are connections that create common subsequences of length 3. Red dots represent -candidates that are considered by the Hunt–Szymanski algorithm and the red line is the connection that creates a common subsequence of length 3.

References

{{DEFAULTSORT:Hunt-Szymanski algorithm Algorithms on strings Combinatorics Dynamic programming

History

Algorithm

Basic longest common subsequence solution

Algorithm

Example

Complexity

Essential matches

-candidates

Connecting -candidates

See also

References