In
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, Hirschberg's algorithm, named after its inventor,
Dan Hirschberg, is a
dynamic programming algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
that finds the optimal
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
between two
string
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
s. Optimality is measured with the
Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the
Needleman–Wunsch algorithm that uses
dynamic programming. Hirschberg's algorithm is commonly used in
computational biology
Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...
to find maximal global alignments of
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
and
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
sequences.
Algorithm information
Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment.
BLAST and
FASTA are suboptimal
heuristics
A heuristic or heuristic technique (''problem solving'', '' mental shortcut'', ''rule of thumb'') is any approach to problem solving that employs a pragmatic method that is not fully optimized, perfected, or rationalized, but is nevertheless ...
. If
and
are strings, where
and
, the
Needleman–Wunsch algorithm finds an optimal alignment in
time, using
space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takes
time, but needs only
space and is much faster in practice.
One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the
longest common subsequence between two sets of data such as with the common
diff
In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but i ...
tool.
The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that:
# one can compute the optimal alignment score by only storing the current and previous row of the Needleman–Wunsch score matrix;
# if
is the optimal alignment of
, and
is an arbitrary partition of
, there exists a partition
of
such that
.
Algorithm description
denotes the ''i''-th character of
, where
.
denotes a substring of size
, ranging from the ''i''-th to the ''j''-th character of
.
is the reversed version of
.
and
are sequences to be aligned. Let
be a character from
, and
be a character from
. We assume that
,
and
are well defined integer-valued functions. These functions represent the cost of deleting
, inserting
, and replacing
with
, respectively.
We define
, which returns the last line of the Needleman–Wunsch score matrix
:
function NWScore(X, Y)
Score(0, 0) = 0 // 2 * (length(Y) + 1) array
for j = 1 to length(Y)
Score(0, j) = Score(0, j - 1) + Ins(Y
j)
for i = 1 to length(X) // Init array
Score(1, 0) = Score(0, 0) + Del(X
i)
for j = 1 to length(Y)
scoreSub = Score(0, j - 1) + Sub(X
i, Y
j)
scoreDel = Score(0, j) + Del(X
i)
scoreIns = Score(1, j - 1) + Ins(Y
j)
Score(1, j) = max(scoreSub, scoreDel, scoreIns)
end
// Copy Score
to Score
Score(0, :) = Score(1, :)
end
for j = 0 to length(Y)
LastLine(j) = Score(1, j)
return LastLine
Note that at any point,
only requires the two most recent rows of the score matrix. Thus,
is implemented in
space.
The Hirschberg algorithm follows:
function Hirschberg(X, Y)
Z = ""
W = ""
if length(X) 0
for i = 1 to length(Y)
Z = Z + '-'
W = W + Y
i
end
else if length(Y) 0
for i = 1 to length(X)
Z = Z + X
i
W = W + '-'
end
else if length(X) 1 or length(Y) 1
(Z, W) = NeedlemanWunsch(X, Y)
else
xlen = length(X)
xmid = length(X) / 2
ylen = length(Y)
ScoreL = NWScore(X
1:xmid, Y)
ScoreR = NWScore(rev(X
xmid+1:xlen), rev(Y))
ymid =
arg max ScoreL + rev(ScoreR)
(Z,W) = Hirschberg(X
1:xmid, y
1:ymid) + Hirschberg(X
xmid+1:xlen, Y
ymid+1:ylen)
end
return (Z, W)
In the context of observation (2), assume that
is a partition of
. Index
is computed such that
and
.
Example
Let
The optimal alignment is given by
W = AGTACGCA
Z = --TATGC-
Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:
T A T G C
0 -2 -4 -6 -8 -10
A -2 -1 0 -2 -4 -6
G -4 -3 -2 -1 0 -2
T -6 -2 -4 0 -2 -1
A -8 -4 0 -2 -1 -3
C -10 -6 -2 -1 -3 1
G -12 -8 -4 -3 1 -1
C -14 -10 -6 -5 -1 3
A -16 -12 -8 -7 -3 1
One starts with the top level call to
, which splits the first argument in half:
. The call to
produces the following matrix:
T A T G C
0 -2 -4 -6 -8 -10
A -2 -1 0 -2 -4 -6
G -4 -3 -2 -1 0 -2
T -6 -2 -4 0 -2 -1
A -8 -4 0 -2 -1 -3
Likewise,
generates the following matrix:
C G T A T
0 -2 -4 -6 -8 -10
A -2 -1 -3 -5 -4 -6
C -4 0 -2 -4 -6 -5
G -6 -2 2 0 -2 -4
C -8 -4 0 1 -1 -3
Their last lines (after reversing the latter) and sum of those are respectively
ScoreL =
-8 -4 0 -2 -1 -3 rev(ScoreR) =
-3 -1 1 0 -4 -8 Sum =
11 -5 1 -2 -5 -11
The maximum (shown in bold) appears at
ymid = 2
, producing the partition
.
The entire Hirschberg recursion (which we omit for brevity) produces the following tree:
(AGTACGCA,TATGC)
/ \
(AGTA,TA) (CGCA,TGC)
/ \ / \
(AG, ) (TA,TA) (CG,TG) (CA,C)
/ \ / \
(T,T) (A,A) (C,T) (G,G)
The leaves of the tree contain the optimal alignment.
See also
*
Longest common subsequence
References
{{DEFAULTSORT:Hirschberg's Algorithm
Sequence alignment algorithms
Bioinformatics algorithms
Articles with example pseudocode
Dynamic programming