Hirschberg's Algorithm
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, Hirschberg's algorithm, named after its inventor,
Dan Hirschberg Daniel S. Hirschberg is a full professor in Computer Science at University of California, Irvine. His research interests are in the theory of design and analysis of algorithms. He obtained his PhD in Computer Science from Princeton University ...
, is a
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. I ...
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
that finds the optimal
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
between two strings. Optimality is measured with the
Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the
Needleman–Wunsch algorithm The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul ...
that uses divide and conquer. Hirschberg's algorithm is commonly used in
computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
to find maximal global alignments of DNA and
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
sequences.


Algorithm information

Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment.
BLAST Blast or The Blast may refer to: * Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film) ...
and
FASTA FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics. History The original FASTA program ...
are suboptimal
heuristics A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
. If ''x'' and ''y'' are strings, where length(''x'') = ''n'' and length(''y'') = ''m'', the
Needleman–Wunsch algorithm The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul ...
finds an optimal alignment in O(''nm'') time, using O(''nm'') space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takes O(''nm'') time, but needs only O(min) space and is much faster in practice. One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the
longest common subsequence A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...
between two sets of data such as with the common
diff In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it ...
tool. The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that: # one can compute the optimal alignment score by only storing the current and previous row of the Needleman–Wunsch score matrix; # if (Z, W) = \operatorname(X, Y) is the optimal alignment of (X, Y), and X = X^l + X^r is an arbitrary partition of X, there exists a partition Y^l + Y^r of Y such that \operatorname(X, Y) = \operatorname(X^l, Y^l) + \operatorname(X^r, Y^r).


Algorithm description

X_i denotes the ''i''-th character of X, where 1 \leqslant i \leqslant \operatorname(X). X_ denotes a substring of size j - i + 1, ranging from the ''i''-th to the ''j''-th character of X. \operatorname(X) is the reversed version of X. X and Y are sequences to be aligned. Let x be a character from X, and y be a character from Y. We assume that \operatorname(x), \operatorname(y) and \operatorname(x,y) are well defined integer-valued functions. These functions represent the cost of deleting x, inserting y, and replacing x with y, respectively. We define \operatorname(X,Y), which returns the last line of the Needleman–Wunsch score matrix \mathrm(i, j): function NWScore(X, Y) Score(0, 0) = 0 // 2 * (length(Y) + 1) array for j = 1 to length(Y) Score(0, j) = Score(0, j - 1) + Ins(Yj) for i = 1 to length(X) // Init array Score(1, 0) = Score(0, 0) + Del(Xi) for j = 1 to length(Y) scoreSub = Score(0, j - 1) + Sub(Xi, Yj) scoreDel = Score(0, j) + Del(Xi) scoreIns = Score(1, j - 1) + Ins(Yj) Score(1, j) = max(scoreSub, scoreDel, scoreIns) end // Copy Score to Score Score(0, :) = Score(1, :) end for j = 0 to length(Y) LastLine(j) = Score(1, j) return LastLine Note that at any point, \operatorname only requires the two most recent rows of the score matrix. Thus, \operatorname is implemented in O(\min\) space. The Hirschberg algorithm follows: function Hirschberg(X, Y) Z = "" W = "" if length(X)

0 for i = 1 to length(Y) Z = Z + '-' W = W + Yi end else if length(Y)

0 for i = 1 to length(X) Z = Z + Xi W = W + '-' end else if length(X)

1 or length(Y)

1 (Z, W) = NeedlemanWunsch(X, Y) else xlen = length(X) xmid = length(X) / 2 ylen = length(Y) ScoreL = NWScore(X1:xmid, Y) ScoreR = NWScore(rev(Xxmid+1:xlen), rev(Y)) ymid =
arg max In mathematics, the arguments of the maxima (abbreviated arg max or argmax) are the points, or elements, of the domain of some function at which the function values are maximized.For clarity, we refer to the input (''x'') as ''points'' and the ...
ScoreL + rev(ScoreR) (Z,W) = Hirschberg(X1:xmid, y1:ymid) + Hirschberg(Xxmid+1:xlen, Yymid+1:ylen) end return (Z, W) In the context of observation (2), assume that X^l + X^r is a partition of X. Index \mathrm is computed such that Y^l = Y_ and Y^r = Y_.


Example

Let \begin X &= \text,\\ Y &= \text,\\ \operatorname(x) &= -2,\\ \operatorname(y) &= -2,\\ \operatorname(x,y) &= \begin +2, & \text x = y \\ -1, & \text x \neq y.\end \end The optimal alignment is given by W = AGTACGCA Z = --TATGC- Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix: T A T G C 0 -2 -4 -6 -8 -10 A -2 -1 0 -2 -4 -6 G -4 -3 -2 -1 0 -2 T -6 -2 -4 0 -2 -1 A -8 -4 0 -2 -1 -3 C -10 -6 -2 -1 -3 1 G -12 -8 -4 -3 1 -1 C -14 -10 -6 -5 -1 3 A -16 -12 -8 -7 -3 1 One starts with the top level call to \operatorname(\text, \text), which splits the first argument in half: X = \text + \text. The call to \operatorname(\text, Y) produces the following matrix: T A T G C 0 -2 -4 -6 -8 -10 A -2 -1 0 -2 -4 -6 G -4 -3 -2 -1 0 -2 T -6 -2 -4 0 -2 -1 A -8 -4 0 -2 -1 -3 Likewise, \operatorname(\operatorname(\text), \operatorname(Y)) generates the following matrix: C G T A T 0 -2 -4 -6 -8 -10 A -2 -1 -3 -5 -4 -6 C -4 0 -2 -4 -6 -5 G -6 -2 2 0 -2 -4 C -8 -4 0 1 -1 -3 Their last lines (after reversing the latter) and sum of those are respectively ScoreL = -8 -4 0 -2 -1 -3 rev(ScoreR) = -3 -1 1 0 -4 -8 Sum = 11 -5 1 -2 -5 -11 The maximum (shown in bold) appears at ymid = 2, producing the partition Y = \text + \text. The entire Hirschberg recursion (which we omit for brevity) produces the following tree: (AGTACGCA,TATGC) / \ (AGTA,TA) (CGCA,TGC) / \ / \ (AG, ) (TA,TA) (CG,TG) (CA,C) / \ / \ (T,T) (A,A) (C,T) (G,G) The leaves of the tree contain the optimal alignment.


See also

*
Longest common subsequence A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...


References

{{DEFAULTSORT:Hirschberg's Algorithm Sequence alignment algorithms Bioinformatics algorithms Articles with example pseudocode Dynamic programming