The Viterbi algorithm is a

dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. ...

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of Markov information sources and

hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ...

s (HMM). The algorithm has found universal application in decoding the

convolutional code In telecommunication, a convolutional code is a type of error-correcting code that generates parity symbols via the sliding application of a boolean polynomial function to a data stream. The sliding application represents the 'convolution' of th ...

s used in both

CDMA Code-division multiple access (CDMA) is a channel access method used by various radio communication technologies. CDMA is an example of multiple access, where several transmitters can send information simultaneously over a single communicatio ...

and

GSM The Global System for Mobile Communications (GSM) is a standard developed by the European Telecommunications Standards Institute (ETSI) to describe the protocols for second-generation ( 2G) digital cellular networks used by mobile devices such ...

digital cellular,

dial-up Dial-up Internet access is a form of Internet access that uses the facilities of the public switched telephone network (PSTN) to establish a connection to an Internet service provider (ISP) by dialing a telephone number on a conventional telepho ...

modems, satellite, deep-space communications, and

802.11 IEEE 802.11 is part of the IEEE 802 set of local area network (LAN) technical standards, and specifies the set of media access control (MAC) and physical layer (PHY) protocols for implementing wireless local area network (WLAN) computer com ...

wireless LANs. It is now also commonly used in

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...

diarization Speaker diarisation ( or diarization) is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription b ...

keyword spotting Keyword spotting (or more simply, word spotting) is a problem that was historically first defined in the context of speech processing. In speech processing, keyword spotting deals with the identification of keywords in utterances. Keyword spottin ...

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

, and

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

. For example, in speech-to-text (speech recognition), the acoustic signal is treated as the observed sequence of events, and a string of text is considered to be the "hidden cause" of the acoustic signal. The Viterbi algorithm finds the most likely string of text given the acoustic signal.

History

The Viterbi algorithm is named after

Andrew Viterbi Andrew James Viterbi (born Andrea Giacomo Viterbi, March 9, 1935) is an American electrical engineer and businessman who co-founded Qualcomm Inc. and invented the Viterbi algorithm. He is the Presidential Chair Professor of Electrical Engineer ...

, who proposed it in 1967 as a decoding algorithm for

convolutional codes In telecommunication, a convolutional code is a type of error-correcting code that generates parity symbols via the sliding application of a boolean polynomial function to a data stream. The sliding application represents the 'convolution' of th ...

over noisy digital communication links. It has, however, a history of multiple invention, with at least seven independent discoveries, including those by Viterbi, Needleman and Wunsch, and Wagner and Fischer. It was introduced to

Natural Language Processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

as a method of part-of-speech tagging as early as 1987. ''Viterbi path'' and ''Viterbi algorithm'' have become standard terms for the application of dynamic programming algorithms to maximization problems involving probabilities. For example, in statistical parsing a dynamic programming algorithm can be used to discover the single most likely context-free derivation (parse) of a string, which is commonly called the "Viterbi parse". Another application is in target tracking, where the track is computed that assigns a maximum likelihood to a sequence of observations.

Extensions

A generalization of the Viterbi algorithm, termed the ''max-sum algorithm'' (or ''max-product algorithm'') can be used to find the most likely assignment of all or some subset of latent variables in a large number of graphical models, e.g.

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Ba ...

s, Markov random fields and conditional random fields. The latent variables need, in general, to be connected in a way somewhat similar to a

(HMM), with a limited number of connections between variables and some type of linear structure among the variables. The general algorithm involves ''message passing'' and is substantially similar to the

belief propagation A belief is an attitude that something is the case, or that some proposition is true. In epistemology, philosophers use the term "belief" to refer to attitudes about the world which can be either true or false. To believe something is to take i ...

algorithm (which is the generalization of the forward-backward algorithm). With the algorithm called

iterative Viterbi decoding Iterative Viterbi decoding is an algorithm that spots the subsequence ''S'' of an observation ''O'' = having the highest average probability (i.e., probability scaled by the length of ''S'') of being generated by a given hidden Markov model ''M'' w ...

one can find the subsequence of an observation that matches best (on average) to a given hidden Markov model. This algorithm is proposed by Qi Wang et al. to deal with

turbo code In information theory, turbo codes (originally in French ''Turbocodes'') are a class of high-performance forward error correction (FEC) codes developed around 1990–91, but first published in 1993. They were the first practical codes to closel ...

. Iterative Viterbi decoding works by iteratively invoking a modified Viterbi algorithm, reestimating the score for a filler until convergence. An alternative algorithm, the

Lazy Viterbi algorithm Lazy is the adjective for laziness, a lack of desire to expend effort. It may also refer to: Music Groups and musicians * Lazy (band), a Japanese rock band * Lazy Lester, American blues harmonica player Leslie Johnson (1933–2018) * Lazy Bill ...

, has been proposed. For many applications of practical interest, under reasonable noise conditions, the lazy decoder (using Lazy Viterbi algorithm) is much faster than the original

Viterbi decoder A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that has been encoded using a convolutional code or trellis code. There are other algorithms for decoding a convolutionally encoded stream (for example, the Fano algorithm). T ...

(using Viterbi algorithm). While the original Viterbi algorithm calculates every node in the trellis of possible outcomes, the Lazy Viterbi algorithm maintains a prioritized list of nodes to evaluate in order, and the number of calculations required is typically fewer (and never more) than the ordinary Viterbi algorithm for the same result. However, it is not so easy to parallelize in hardware.

Pseudocode

This algorithm generates a path

X=(x_1,x_2,\ldots,x_T)

, which is a sequence of states

x_n \in S=\

that generate the observations

Y=(y_1,y_2,\ldots, y_T)

with

y_n \in  O=\

, where

N

is the number of possible observations in the observation space

O

. Two 2-dimensional tables of size

K \times T

are constructed: * Each element

T_1,j /math> of T_1 stores the probability of the most likely path so far \hat=(\hat_1,\hat_2,\ldots,\hat_j) with \hat_j=s_i that generates Y=(y_1,y_2,\ldots, y_j) .
* Each element T_2,j of T_2 stores \hat_of the most likely path so far \hat=(\hat_1,\hat_2,\ldots,\hat_,\hat_j = s_i) \forall j, 2\leq j \leq T The table entries T_1,j T_2,j /math> are filled by increasing order of K\cdot j+i :

: T_1,j \max_,
: T_2,j \operatorname_,

with A_and B_as defined below. Note that B_does not need to appear in the latter expression, as it's non-negative and independent of k and thus does not affect the argmax.

;Input:
* The observation space O=\,
* the state space S=\,
* an array of initial probabilities \Pi = (\pi_1,\pi_2,\dots,\pi_K) such that \pi_i stores the probability that x_1 = s_i,
* a sequence of observations Y=(y_1,y_2,\ldots, y_T) such that y_t=o_i if the observation at time t is o_i,
* transition matrix A of size K\times K such that A_stores the transition probability of transiting from state s_i to state s_j,
* emission matrix B of size K\times N such that B_stores the probability of observing o_j from  state s_i .

;Output
* The most likely hidden state sequence X=(x_1,x_2,\ldots,x_T) function ''VITERBI'' (O,S,\Pi,Y,A,B):X for each state i=1,2,\ldots,K do T_1,1 leftarrow\pi_i\cdot B_T_2,1 leftarrow  0 end for
     for each observation j = 2,3,\ldots,T do
         for each state i =1,2,\ldots,K do
             
             
         end for
     end for x_T\leftarrow s_for j=T,T-1,\ldots,2 do z_\leftarrow T_2_j,j /math> x_\leftarrow s_end for
     return X end function

Restated in a succinct near- Python :
 function ''viterbi'' (O, S, \Pi, Tm, Em): best\_path Tm: transition matrix   Em: emission matrix trellis \leftarrow matrix(length(S), length(O)) To hold probability of each state given each observation pointers \leftarrow matrix(length(S), length(O)) To hold backpointer to best prior state
     for s in range(length(S)) :                Determine each hidden state's probability at time 0… trellis, 0 \leftarrow \Pi \cdot Em, O[0 for o in range(1, length(O)) :              …and after, tracking each state's most likely prior state, k
         for s in range(length(S)) : k \leftarrow \arg\max(k\ \mathsf\ trellis[k, o-1] \cdot Tm[k, s] \cdot Em[s, o]) trellis[s, o] \leftarrow trellis[k, o-1] \cdot Tm[k, s] \cdot Em[s, o] pointers[s, o] \leftarrow k best\_path \leftarrow list() k \leftarrow \arg\max(k\ \mathsf\ trellis, length(O)-1) Find k of best final state
     for o in range(length(O)-1, -1, -1) :      Backtrack from last observation best\_path.insert(0, S Insert previous state on most likely path k \leftarrow pointers, o /math>                      Use backpointer to find best previous state
     return best\_path;Explanation:
Suppose we are given a

(HMM) with state space

S

, initial probabilities

\pi_i

of being in state

i

and transition probabilities

a_

of transitioning from state

i

to state

j

. Say, we observe outputs

y_1,\dots, y_T

. The most likely state sequence

x_1,\dots,x_T

that produces the observations is given by the recurrence relationsXing E, slide 11. :

\begin
 V_ &= \mathrm\big( y_1 \ ,  \ k \big) \cdot \pi_k, \\
 V_ &= \max_ \left(  \mathrm\big( y_t \ ,  \ k \big) \cdot a_ \cdot V_\right).
\end

Here

V_

is the probability of the most probable state sequence

\mathrm\big(x_1,\dots,x_t,y_1,\dots, y_t\big)

responsible for the first

t

observations that have

k

as its final state. The Viterbi path can be retrieved by saving back pointers that remember which state

x

was used in the second equation. Let

\mathrm(k,t)

be the function that returns the value of

x

used to compute

V_

t > 1

, or

k

t=1

. Then :

\begin
 x_T &= \arg\max_ (V_), \\
 x_ &= \mathrm(x_t,t).
\end

Here we're using the standard definition of

arg max In mathematics, the arguments of the maxima (abbreviated arg max or argmax) are the points, or elements, of the domain of some function at which the function values are maximized.For clarity, we refer to the input (''x'') as ''points'' and t ...

. The complexity of this implementation is

O(T\times\left, \^2)

. A better estimation exists if the maximum in the internal loop is instead found by iterating only over states that directly link to the current state (i.e. there is an edge from

k

j

). Then using amortized analysis one can show that the complexity is

O(T\times(\left, \ + \left, \))

, where

E

is the number of edges in the graph.

Example

Consider a village where all villagers are either healthy or have a fever, and only the village doctor can determine whether each has a fever. The doctor diagnoses fever by asking patients how they feel. The villagers may only answer that they feel normal, dizzy, or cold. The doctor believes that the health condition of the patients operates as a discrete

Markov chain A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...

. There are two states, "Healthy" and "Fever", but the doctor cannot observe them directly; they are ''hidden'' from the doctor. On each day, there is a certain chance that a patient will tell the doctor "I feel normal", "I feel cold", or "I feel dizzy", depending on the patient's health condition. The ''observations'' (normal, cold, dizzy) along with a ''hidden'' state (healthy, fever) form a hidden Markov model (HMM), and can be represented as follows in the

Python programming language Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically-typed and garbage-collected. It supports multiple programming p ...

: obs = ("normal", "cold", "dizzy") states = ("Healthy", "Fever") start_p = trans_p = emit_p = In this piece of code, start_p represents the doctor's belief about which state the HMM is in when the patient first visits (all the doctor knows is that the patient tends to be healthy). The particular probability distribution used here is not the equilibrium one, which is (given the transition probabilities) approximately . The transition_p represents the change of the health condition in the underlying Markov chain. In this example, a patient who is healthy today has only a 30% chance of having a fever tomorrow. The emit_p represents how likely each possible observation (normal, cold, or dizzy) is, given the underlying condition (healthy or fever). A patient who is healthy has a 50% chance of feeling normal; one who has a fever has a 60% chance of feeling dizzy. An example of HMM

A patient visits three days in a row, and the doctor discovers that the patient feels normal on the first day, cold on the second day, and dizzy on the third day. The doctor has a question: what is the most likely sequence of health conditions of the patient that would explain these observations? This is answered by the Viterbi algorithm. def viterbi(obs, states, start_p, trans_p, emit_p): V = [] for st in states: V[0] t= # Run Viterbi when t > 0 for t in range(1, len(obs)): V.append() for st in states: max_tr_prob = V[t - 1] tates[0_["prob".html" ;"title=".html" ;"title="tates[0">tates[0 ["prob"">.html" ;"title="tates[0">tates[0 ["prob"* trans_p tates[0 t* emit_p t bs[t prev_st_selected = states[0] for prev_st in states[1:]: tr_prob = V[t - 1] [prev_st] ["prob"] * trans_p[prev_st] t* emit_p t bs[t if tr_prob > max_tr_prob: max_tr_prob = tr_prob prev_st_selected = prev_st max_prob = max_tr_prob V t= for line in dptable(V): print(line) opt = [] max_prob = 0.0 best_st = None # Get most probable state and its backtrack for st, data in V[-1].items(): if data["prob"] > max_prob: max_prob = data["prob"] best_st = st opt.append(best_st) previous = best_st # Follow the backtrack till the first observation for t in range(len(V) - 2, -1, -1): opt.insert(0, V + 1 revious prev" previous = V + 1 revious prev" print ("The steps of states are " + " ".join(opt) + " with highest probability of %s" % max_prob) def dptable(V): # Print a table of steps from dictionary yield " " * 5 + " ".join(("%3d" % i) for i in range(len(V))) for state in V yield "%.7s: " % state + " ".join("%.7s" % ("%lf" % v

tate Tate is an institution that houses, in a network of four art galleries, the United Kingdom's national collection of British art, and international modern and contemporary art. It is not a government institution, but its main sponsor is the U ...

prob" for v in V) The function viterbi takes the following arguments: obs is the sequence of observations, e.g.

 normal', 'cold', 'dizzy'/code>; states is the set of hidden states; start_p is the start probability; trans_p are the transition probabilities; and emit_p are the emission probabilities.  For simplicity of code, we assume that the observation sequence obs is non-empty and that  trans_p  /code> and emit_p  /code> is defined for all states i,j.

In the running example, the forward/Viterbi algorithm is used as follows:


viterbi(obs,
        states,
        start_p,
        trans_p,
        emit_p)



The output of the script is


$ python viterbi_example.py
         0          1          2
Healthy: 0.30000 0.08400 0.00588
Fever: 0.04000 0.02700 0.01512
The steps of states are Healthy Healthy Fever with highest probability of 0.01512


This reveals that the observations  normal', 'cold', 'dizzy'/code> were most likely generated by states  Healthy', 'Healthy', 'Fever'/code>. In other words, given the observed activities, the patient was most likely to have been healthy on the first day and also on the second day (despite feeling cold that day), and only to have contracted a fever on the third day.

The operation of Viterbi's algorithm can be visualized by means of a
 trellis diagram. The Viterbi path is essentially the shortest
path through this trellis.

  Soft output Viterbi algorithm 

The soft output Viterbi algorithm (SOVA) is a variant of the classical Viterbi algorithm.

SOVA differs from the classical Viterbi algorithm in that it uses a modified path metric which takes into account the  ''a priori probabilities'' of the input symbols, and produces a ''soft'' output indicating the ''reliability'' of the decision.

The first step in the SOVA is the selection of the survivor path, passing through one unique node at each time instant, ''t''. Since each node has 2 branches converging at it (with one branch being chosen to form the ''Survivor Path'', and the other being discarded), the difference in the branch metrics (or ''cost'') between the chosen and discarded branches indicate the ''amount of error'' in the choice.

This ''cost'' is accumulated over the entire sliding window (usually equals ''at least'' five constraint lengths), to indicate the ''soft output'' measure of reliability of the ''hard bit decision'' of the Viterbi algorithm.

  See also 

*  Expectation–maximization algorithm
* Baum–Welch algorithm In  electrical engineering,  statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the   ...

*  Forward-backward algorithm
* Forward algorithm 


The forward algorithm, in the context of a hidden Markov model (HMM), is used to calculate a 'belief state': the probability of a state at a certain time, given the history of evidence.  The process is also known as ''filtering''. The forward alg ...

* Error-correcting code 


In computing, telecommunication, information theory, and  coding theory, an error correction code, sometimes error correcting code, (ECC) is used for  controlling errors in data over unreliable or noisy  communication channels. The central idea i ...

* Viterbi decoder 
A Viterbi decoder uses the  Viterbi algorithm for decoding a bitstream that has been
encoded using a convolutional code or  trellis code.

There are other algorithms for decoding a convolutionally encoded stream (for example, the Fano algorithm). T ...

* Hidden Markov model 

A hidden Markov model (HMM) is a  statistical  Markov model in which the system being  modeled is assumed to be a  Markov process — call it  X  — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ...

*  Part-of-speech tagging
*  A* search algorithm

  References 



  General references 

*  (note: the Viterbi decoding algorithm is described in section IV.) Subscription required.
* 
*  Subscription required.
* 
* {{cite journal , author=Rabiner LR , title=A tutorial on hidden Markov models and selected applications in speech recognition , journal=Proceedings of the IEEE , volume=77 , issue=2 , pages=257–286 , date=February 1989 , doi=10.1109/5.18626, citeseerx=10.1.1.381.3454 , s2cid=13618539  (Describes the forward algorithm and Viterbi algorithm for HMMs).
* Shinghal, R. and  Godfried T. Toussaint, "Experiments in text recognition with the modified Viterbi algorithm," ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', Vol. PAMI-l, April 1979, pp. 184–193.
* Shinghal, R. and  Godfried T. Toussaint, "The sensitivity of the modified Viterbi algorithm to the source statistics," ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', vol. PAMI-2, March 1980, pp. 181–185.

  External links 

*  Implementations in Java, F#, Clojure, C# on Wikibooks

Tutorial
on convolutional coding with viterbi decoding, by Chip Fleming

A tutorial for a Hidden Markov Model toolkit (implemented in C) that contains a description of the Viterbi algorithm

Viterbi algorithm
by Dr.  Andrew J. Viterbi (scholarpedia.org).

  Implementations 


Mathematica
has an implementation as part of its support for stochastic processes

Susa
signal processing framework provides the C++ implementation for Forward error correction 


In computing, telecommunication, information theory, and  coding theory, an error correction code, sometimes error correcting code, (ECC) is used for  controlling errors in data over unreliable or noisy  communication channels. The central idea i ...
 codes and channel equalizatio
here


C++

C#

Java

Java 8

Julia (HMMBase.jl)

Perl

Prolog



Go

SFIHMM
includes code for Viterbi decoding.

 Error detection and correction
 Dynamic programming
 Markov models
 Articles with example Python (programming language) code