A profile HMM modelling a multiple sequence alignment

HMMER is a free and commonly used software package for sequence analysis written by

Sean Eddy Sean Roberts Eddy is Professor of Molecular & Cellular Biology and of Applied Mathematics at Harvard University. Previously he was based at the Janelia Research Campus from 2006 to 2015 in Virginia. His research interests are in bioinformatics, ...

. Its general usage is to identify homologous

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...

sequences, and to perform sequence alignments. It detects homology by comparing a ''profile-HMM'' (a

Hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

constructed explicitly for a particular search) to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...

in the HMMER package using the ''hmmbuild'' program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a

console Console may refer to: Computing and video games * System console, a physical device to operate a computer ** Virtual console, a user interface for multiple computer consoles on one device ** Command-line interface, a method of interacting with ...

utility ported to every major

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...

, including different versions of

Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...

Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...

, and

Mac OS Two major famlies of Mac operating systems were developed by Apple Inc. In 1984, Apple debuted the operating system that is now known as the "Classic" Mac OS with its release of the original Macintosh System Software. The system, rebranded "M ...

. HMMER is the core utility that protein family databases such as

Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families. Uses ...

and

InterPro InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them. The contents of InterPro ...

are based upon. Some other bioinformatics tools such as

UGENE UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2. UGENE helps biolo ...

also use HMMER. HMMER3 also makes extensive use of vector instructions for increasing computational speed. This work is based upon earlier publication showing a significant acceleration of the Smith-Waterman algorithm for aligning two sequences.

Profile HMMs

A profile HMM is a variant of an HMM relating specifically to biological sequences. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system, which can be used to align sequences and search databases for remotely homologous sequences. They capitalise on the fact that certain positions in a sequence alignment tend to have biases in which residues are most likely to occur, and are likely to differ in their probability of containing an insertion or a deletion. Capturing this information gives them a better ability to detect true homologs than traditional

BLAST Blast or The Blast may refer to: *Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film), ...

-based approaches, which penalise substitutions, insertions and deletions equally, regardless of where in an alignment they occur. Hmmer_coreProfileHMM

Profile HMMs center around a linear set of match (M) states, with one state corresponding to each consensus column in a sequence alignment. Each M state emits a single residue (amino acid or nucleotide). The probability of emitting a particular residue is determined largely by the frequency at which that residue has been observed in that column of the alignment, but also incorporates prior information on patterns of residues that tend to co-occur in the same columns of sequence alignments. This string of match states emitting amino acids at particular frequencies are analogous to position specific score matrices or weight matrices. A profile HMM takes this modelling of sequence alignments further by modelling insertions and deletions, using I and D states, respectively. D states do not emit a residue, while I states do emit a residue. Multiple I states can occur consecutively, corresponding to multiple residues between consensus columns in an alignment. M, I and D states are connected by state transition probabilities, which also vary by position in the sequence alignment, to reflect the different frequencies of insertions and deletions across sequence alignments. The HMMER2 and HMMER3 releases used an architecture for building profile HMMs called the Plan 7 architecture, named after the seven states captured by the model. In addition to the three major states (M, I and D), six additional states capture non-homologous flanking sequence in the alignment. These 6 states collectively are important for controlling how sequences are aligned to the model e.g. whether a sequence can have multiple consecutive hits to the same model (in the case of sequences with multiple instances of the same domain).

Programs in the HMMER package

The HMMER package consists of a collection of programs for performing functions using profile hidden Markov models. The programs include:

Profile HMM building

*hmmbuild - construct profile HMMs from multiple sequence alignments

Homology searching

*hmmscan - search protein sequences against a profile HMM database *hmmsearch - search profile HMMs against a sequence database *jackhmmer - iteratively search sequences against a protein database *nhmmer - search DNA/RNA queries against a DNA/RNA sequence database *nhmmscan - search nucleotide sequences against a nucleotide profile *phmmer - search protein sequences against a protein database

Other functions

*hmmalign - align sequences to a profile HMM *hmmemit - produce sample sequences from a profile HMM *hmmlogo - produce data for an HMM logo from an HMM file The package contains numerous other specialised functions.

The HMMER web server

In addition to the software package, the HMMER search function is available in the form of a web server. The service facilitates searches across a range of databases, including sequence databases such as

UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...

, SwissProt, and the

Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cry ...

, and HMM databases such as

TIGRFAMs TIGRFAMs is a database of protein families designed to support manual and automated genome annotation. Each entry includes a multiple sequence alignment and hidden Markov model (HMM) built from the alignment. Sequences that score above the defined ...

and

SUPERFAMILY SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, str ...

. The four search types phmmer, hmmsearch, hmmscan and jackhmmer are supported (see Programs). The search function accepts single sequences as well as sequence alignments or profile HMMs. The search results are accompanied by a report on the taxonomic breakdown, and the

domain Domain may refer to: Mathematics *Domain of a function, the set of input values for which the (total) function is defined **Domain of definition of a partial function **Natural domain of a partial function **Domain of holomorphy of a function * Do ...

organisation of the hits. Search results can then be filtered according to either parameter. The web service is currently run out of the European Bioinformatics Institute (EBI) in the United Kingdom, while development of the algorithm is still performed by Sean Eddy's team in the United States. Major reasons for relocating the web service were to leverage the computing infrastructure at the EBI, and to cross-link HMMER searches with relevant databases that are also maintained by the EBI.

The HMMER3 release

The latest stable release of HMMER is version 3.0. HMMER3 is complete rewrite of the earlier HMMER2 package, with the aim of improving the speed of profile-HMM searches. Major changes are outlined below:

Improvements in speed

A major aim of the HMMER3 project, started in 2004 was to improve the speed of HMMER searches. While profile HMM-based homology searches were more accurate than BLAST-based approaches, their slower speed limited their applicability. The main performance gain is due to a heuristic filter that finds high-scoring un-gapped matches within database sequences to a query profile. This heuristic results in a computation time comparable to

with little impact on accuracy. Further gains in performance are due to a

log-likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

model that requires no calibration for estimating

E-values In statistical hypothesis testing, e-values quantify the evidence in the data against a null hypothesis (e.g., "the coin is fair", or, in a medical context, "this new treatment has no effect"). They serve as a more robust alternative to p-value ...

, and allows the more accurate forward scores to be used for computing the significance of a homologous sequence. HMMER still lags behind BLAST in speed of DNA-based searches; however, DNA-based searches can be tuned such that an improvement in speed comes at the expense of accuracy.

Improvements in remote homology searching

The major advance in speed was made possible by the development of an approach for calculating the significance of results integrated over a range of possible alignments. In discovering remote homologs, alignments between query and hit proteins are often very uncertain. While most sequence alignment tools calculate match scores using only the best scoring alignment, HMMER3 calculates match scores by integrating across all possible alignments, to account for uncertainty in which alignment is best. HMMER sequence alignments are accompanied by posterior probability annotations, indicating which portions of the alignment have been assigned high confidence and which are more uncertain.

DNA sequence comparison

A major improvement in HMMER3 was the inclusion of DNA/DNA comparison tools. HMMER2 only had functionality to compare protein sequences.

Restriction to local alignments

While HMMER2 could perform local alignment (align a complete model to a subsequence of the target) and global alignment (align a complete model to a complete target sequence), HMMER3 only performs local alignment. This restriction is due to the difficulty in calculating the significance of hits when performing local/global alignments using the new algorithm.

References

External links

*
HMMER3 announcementA blog posting on HMMER policy on trademark, copyright, patents, and licensing
{{DEFAULTSORT:Hmmer Bioinformatics software Free science software Free software programmed in C Computational science