Baum–Welch Algorithm
   HOME

TheInfoList



OR:

In
electrical engineering Electrical engineering is an engineering discipline concerned with the study, design, and application of equipment, devices, and systems which use electricity, electronics, and electromagnetism. It emerged as an identifiable occupation in the l ...
, statistical computing and
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, the Baum–Welch algorithm is a special case of the
expectation–maximization algorithm In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variabl ...
used to find the unknown parameters of a
hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
(HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step.


History

The Baum–Welch algorithm was named after its inventors
Leonard E. Baum Leonard Esau Baum (August 23, 1931 – August 14, 2017) was an American mathematician, known for the Baum–Welch algorithm and Baum–Sweet sequence. He graduated Phi Beta Kappa from Harvard University in 1953, and earned a Ph.D. in mathematics fr ...
and
Lloyd R. Welch Lloyd Richard Welch (born September 28, 1927) is an American information theorist and applied mathematician, and co-inventor of the Baum–Welch algorithm and the Berlekamp–Welch algorithm, also known as the Welch–Berlekamp algorithm. Welch r ...
. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s. One of the first major applications of HMMs was to the field of
speech processing Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied t ...
. In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular
genetic information A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are usua ...
. They have since become an important tool in the probabilistic modeling of genomic sequences.


Description

A
hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
describes the joint probability of a collection of " hidden" and observed discrete random variables. It relies on the assumption that the ''i''-th hidden variable given the (''i'' − 1)-th hidden variable is independent of previous hidden variables, and the current observation variables depend only on the current hidden state. The Baum–Welch algorithm uses the well known EM algorithm to find the
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimate of the parameters of a hidden Markov model given a set of observed feature vectors. Let X_t be a discrete hidden random variable with N possible values (i.e. We assume there are N states in total). We assume the P(X_t\mid X_) is independent of time t, which leads to the definition of the time-independent stochastic transition matrix :A=\=P(X_t=j\mid X_=i). The initial state distribution (i.e. when t=1) is given by :\pi_i=P(X_1 = i). The observation variables Y_t can take one of K possible values. We also assume the observation given the "hidden" state is time independent. The probability of a certain observation y_i at time t for state X_t = j is given by :b_j(y_i)=P(Y_t=y_i\mid X_t=j). Taking into account all the possible values of Y_t and X_t, we obtain the N \times K matrix B=\ where b_j belongs to all the possible states and y_i belongs to all the observations. An observation sequence is given by Y= (Y_1=y_1,Y_2=y_2,\ldots,Y_T=y_T). Thus we can describe a hidden Markov chain by \theta = (A,B,\pi). The Baum–Welch algorithm finds a local maximum for \theta^* = \operatorname_\theta P(Y\mid\theta) (i.e. the HMM parameters \theta that maximize the probability of the observation).


Algorithm

Set \theta = (A, B, \pi) with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum.


Forward procedure

Let \alpha_i(t)=P(Y_1=y_1,\ldots,Y_t=y_t,X_t=i\mid\theta), the probability of seeing the observations y_1,y_2,\ldots,y_t and being in state i at time t. This is found recursively: #\alpha_i(1)=\pi_i b_i(y_1), #\alpha_i(t+1)=b_i(y_) \sum_^N \alpha_j(t) a_. Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences. However, this can be avoided in a slightly modified algorithm by scaling \alpha in the forward and \beta in the backward procedure below.


Backward procedure

Let \beta_i(t)=P(Y_=y_,\ldots,Y_T=y_\mid X_t=i,\theta) that is the probability of the ending partial sequence y_,\ldots,y_T given starting state i at time t. We calculate \beta_i(t) as, # \beta_i(T)=1, # \beta_i(t)=\sum_^N \beta_j(t+1) a_ b_j(y_).


Update

We can now calculate the temporary variables, according to Bayes' theorem: :\gamma_i(t)=P(X_t=i\mid Y,\theta) = \frac = \frac, which is the probability of being in state i at time t given the observed sequence Y and the parameters \theta :\xi_(t)=P(X_t=i,X_=j\mid Y,\theta) = \frac = \frac, which is the probability of being in state i and j at times t and t+1 respectively given the observed sequence Y and parameters \theta. The denominators of \gamma_i(t) and \xi_(t) are the same ; they represent the probability of making the observation Y given the parameters \theta. The parameters of the hidden Markov model \theta can now be updated: *\pi_i^* = \gamma_i(1), which is the expected frequency spent in state i at time 1. *a_^*=\frac, which is the expected number of transitions from state ''i'' to state ''j'' compared to the expected total number of transitions away from state ''i''. To clarify, the number of transitions away from state ''i'' does not mean transitions to a different state ''j'', but to any state including itself. This is equivalent to the number of times state ''i'' is observed in the sequence from ''t'' = 1 to ''t'' = ''T'' − 1. *b_i^*(v_k)=\frac, where : 1_= \begin 1 & \text y_t=v_k,\\ 0 & \text \end is an indicator function, and b_i^*(v_k) is the expected number of times the output observations have been equal to v_k while in state i over the expected total number of times in state i. These steps are now repeated iteratively until a desired level of convergence. Note: It is possible to over-fit a particular data set. That is, P(Y\mid\theta_\text) > P(Y \mid \theta_\text) . The algorithm also does not guarantee a global maximum.


Multiple sequences

The algorithm described thus far assumes a single observed sequence Y = y_1, \ldots, y_N. However, in many situations, there are several sequences observed: Y_1, \ldots, Y_R. In this case, the information from all of the observed sequences must be used in the update of the parameters A, \pi, and b. Assuming that you have computed \gamma_(t) and \xi_(t) for each sequence y_,\ldots,y_, the parameters can now be updated: *\pi_i^* = \frac *a_^*=\frac, *b_i^*(v_k)=\frac, where : 1_= \begin 1 & \text y_=v_k,\\ 0 & \text \end is an indicator function


Example

Suppose we have a chicken from which we collect eggs at noon every day. Now whether or not the chicken has laid eggs for collection depends on some unknown factors that are hidden. We can however (for simplicity) assume that the chicken is always in one of two states that influence whether the chicken lays eggs, and that this state only depends on the state on the previous day. Now we don't know the state at the initial starting point, we don't know the transition probabilities between the two states and we don't know the probability that the chicken lays an egg given a particular state. To start we first guess the transition and emission matrices. We then take a set of observations (E = eggs, N = no eggs): N, N, N, N, N, E, E, N, N, N This gives us a set of observed transitions between days: NN, NN, NN, NN, NE, EE, EN, NN, NN The next step is to estimate a new transition matrix. For example, the probability of the sequence NN and the state being then is given by the following, P(S_1) * P(N, S_1) * P(S_1 \rightarrow S_2) * P(N, S_2). Thus the new estimate for the to transition is now \frac=0.0908 (referred to as "Pseudo probabilities" in the following tables). We then calculate the to , to and to transition probabilities and normalize so they add to 1. This gives us the updated transition matrix: Next, we want to estimate a new emission matrix, The new estimate for the E coming from emission is now \frac=0.8769. This allows us to calculate the emission matrix as described above in the algorithm, by adding up the probabilities for the respective observed sequences. We then repeat for if N came from and for if N and E came from and normalize. To estimate the initial probabilities we assume all sequences start with the hidden state and calculate the highest probability and then repeat for . Again we then normalize to give an updated initial vector. Finally we repeat these steps until the resulting probabilities converge satisfactorily.


Applications


Speech recognition

Hidden Markov Models were first applied to speech recognition by James K. Baker in 1975. Continuous speech recognition occurs by the following steps, modeled by a HMM. Feature analysis is first undertaken on temporal and/or spectral features of the speech signal. This produces an observation vector. The feature is then compared to all sequences of the speech recognition units. These units could be
phonemes In phonology and linguistics, a phoneme () is a unit of sound that can distinguish one word from another in a particular language. For example, in most dialects of English, with the notable exception of the West Midlands and the north-west o ...
, syllables, or whole-word units. A lexicon decoding system is applied to constrain the paths investigated, so only words in the system's lexicon (word dictionary) are investigated. Similar to the lexicon decoding, the system path is further constrained by the rules of grammar and syntax. Finally, semantic analysis is applied and the system outputs the recognized utterance. A limitation of many HMM applications to speech recognition is that the current state only depends on the state at the previous time-step, which is unrealistic for speech as dependencies are often several time-steps in duration. The Baum–Welch algorithm also has extensive applications in solving HMMs used in the field of speech synthesis.


Cryptanalysis

The Baum–Welch algorithm is often used to estimate the parameters of HMMs in deciphering hidden or noisy information and consequently is often used in
cryptanalysis Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic sec ...
. In data security an observer would like to extract information from a data stream without knowing all the parameters of the transmission. This can involve reverse engineering a channel encoder. HMMs and as a consequence the Baum–Welch algorithm have also been used to identify spoken phrases in encrypted VoIP calls. In addition HMM cryptanalysis is an important tool for automated investigations of cache-timing data. It allows for the automatic discovery of critical algorithm state, for example key values.


Applications in bioinformatics


Finding genes


=Prokaryotic

= The
GLIMMER In bioinformatics, GLIMMER (Gene Locator and Interpolated Markov ModelER) is used to find genes in prokaryotic DNA. "It is effective at finding genes in bacteria, archea, viruses, typically finding 98-99% of all relatively long protein coding g ...
(Gene Locator and Interpolated Markov ModelER) software was an early gene-finding program used for the identification of coding regions in
prokaryotic A prokaryote () is a Unicellular organism, single-celled organism that lacks a cell nucleus, nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Greek language, Greek wikt:πρό#Ancient Greek, πρό (, 'before') a ...
DNA. GLIMMER uses Interpolated Markov Models (IMMs) to identify the
coding regions The coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non ...
and distinguish them from the
noncoding DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regul ...
. The latest release (GLIMMER3) has been shown to have increased specificity and accuracy compared with its predecessors with regard to predicting translation initiation sites, demonstrating an average 99% accuracy in locating 3' locations compared to confirmed genes in prokaryotes.


=Eukaryotic

= The
GENSCAN In bioinformatics, GENSCAN is a program to identify complete gene structures in genomic DNA. It is a G HMM-based program that can be used to predict the location of genes and their exon-intron boundaries in genomic sequences from a variety of org ...
webserver is a gene locator capable of analyzing
eukaryotic Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacte ...
sequences up to one million base-pairs (1 Mbp) long. GENSCAN utilizes a general inhomogeneous, three periodic, fifth order Markov model of DNA coding regions. Additionally, this model accounts for differences in gene density and structure (such as intron lengths) that occur in different isochores. While most integrated gene-finding software (at the time of GENSCANs release) assumed input sequences contained exactly one gene, GENSCAN solves a general case where partial, complete, or multiple genes (or even no gene at all) is present. GENSCAN was shown to exactly predict exon location with 90% accuracy with 80% specificity compared to an annotated database.


Copy-number variation detection

Copy-number variation Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of G ...
s (CNVs) are an abundant form of genome structure variation in humans. A discrete-valued bivariate HMM (dbHMM) was used assigning chromosomal regions to seven distinct states: unaffected regions, deletions, duplications and four transition states. Solving this model using Baum-Welch demonstrated the ability to predict the location of CNV breakpoint to approximately 300 bp from micro-array experiments. This magnitude of resolution enables more precise correlations between different CNVs and across populations than previously possible, allowing the study of CNV population frequencies. It also demonstrated a direct inheritance pattern for a particular CNV.


Implementations


Accord.NET
in C#
ghmm
C library with
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
bindings that supports both discrete and continuous emissions.
Jajapy
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
library that implements Baum-Welch on various kind of Markov Models ( HMM, MC, MDP, CTMC, GOHMM and MGOHMM).
HMMBase
package for
Julia Julia is usually a feminine given name. It is a Latinate feminine form of the name Julio and Julius. (For further details on etymology, see the Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e.g ...
. *HMMFit function in th
RHmm
package for R.
hmmtrain
in
MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...

rustbio
in
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH ...


See also

*
Viterbi algorithm The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especiall ...
*
Hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
*
EM algorithm EM, Em or em may refer to: Arts and entertainment Music * EM, the E major musical scale * Em, the E minor musical scale * Electronic music, music that employs electronic musical instruments and electronic music technology in its production * Ency ...
*
Maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
*
Speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
*
Bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
*
Cryptanalysis Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic sec ...


References


External links

* A comprehensive review of HMM methods and software in bioinformatics –
Profile Hidden Markov Models
* Early HMM publications by Baum: *
A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains
*
An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology
*
Statistical Inference for Probabilistic Functions of Finite State Markov Chains
* The Shannon Lecture by Welch, which speaks to how the algorithm can be implemented efficiently: *
Hidden Markov Models and the Baum–Welch Algorithm
IEEE Information Theory Society Newsletter, Dec. 2003. * An alternative to the Baum–Welch algorithm, the Viterbi Path Counting algorithm: ** Davis, Richard I. A.; Lovell, Brian C.
"Comparing and evaluating HMM ensemble training algorithms using train and test and condition number criteria"
Pattern Analysis and Applications, vol. 6, no. 4, pp. 327–336, 2003.
An Interactive Spreadsheet for Teaching the Forward-Backward Algorithm
(spreadsheet and article with step-by-step walkthrough)



{{DEFAULTSORT:Baum-Welch algorithm Randomized algorithms Bioinformatics algorithms Markov models