BLAST for Basic Local Alignment Search Tool is an
algorithm for comparing primary biological sequence information, such
as the amino-acid sequences of proteins or the nucleotides of DNA
BLAST search enables a researcher to compare a query
sequence with a library or database of sequences, and identify library
sequences that resemble the query sequence above a certain threshold.
Different types of BLASTs are available according to the query
sequences. For example, following the discovery of a previously
unknown gene in the mouse, a scientist will typically perform a BLAST
search of the human genome to see if humans carry a similar gene;
BLAST will identify sequences in the human genome that resemble the
mouse gene based on similarity of sequence. The
BLAST algorithm and
program were designed by Stephen Altschul, Warren Gish, Webb Miller,
Eugene Myers, and
David J. Lipman
1.1 Input 1.2 Output
2 Process 3 Algorithm
3.1 Parallel BLAST
4.1 Alternative versions 4.2 Accelerated versions
BLAST is one of the most widely used bioinformatics programs for
sequence searching. It addresses a fundamental problem in
bioinformatics research. The heuristic algorithm it uses is much
faster than other approaches, such as calculating an optimal
alignment. This emphasis on speed is vital to making the algorithm
practical on the huge genome databases currently available, although
subsequent algorithms can be even faster.
FASTA was developed by
David J. Lipman
Which bacterial species have a protein that is related in lineage to a certain protein with known amino-acid sequence What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined
BLAST is also often used as part of other algorithms that require
approximate sequence matching.
BLAST algorithm and the computer program that implements it were
developed by Stephen Altschul, Warren Gish, and
David Lipman at the
National Center for Biotechnology Information
Remove low-complexity region or sequence repeats in the query sequence.
"Low-complexity region" means a region of a sequence composed of few kinds of elements. These regions might give high scores that confuse the program to find the actual significant sequences in the database, so they should be filtered out. The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and then be ignored by the BLAST program. To filter out the low-complexity regions, the SEG program is used for protein sequences and the program DUST is used for DNA sequences. On the other hand, the program XNU is used to mask off the tandem repeats in protein sequences.
Make a k-letter word list of the query sequence.
Take k=3 for example, we list the words of length 3 in the query protein sequence (k is usually 11 for a DNA sequence) "sequentially", until the last letter of the query sequence is included. The method is illustrated in figure 1.
Fig. 1 The method to establish the k-letter query word list. Adapted from Biological Sequence Analysis I, Current Topics in Genome Analysis .
List the possible matching words.
This step is one of the main differences between
BLAST and FASTA.
FASTA cares about all of the common words in the database and query
sequences that are listed in step 2; however,
BLAST only cares about
the high-scoring words. The scores are created by comparing the word
in the list in step 2 with all the 3-letter words. By using the
scoring matrix (substitution matrix) to score the comparison of each
residue pair, there are 20^3 possible match scores for a 3-letter
word. For example, the score obtained by comparing PQG with PEG and
PQA is respectively 15 and 12 with the
Organize the remaining high-scoring words into an efficient search tree.
This allows the program to rapidly compare the high-scoring words to the database sequences.
Repeat step 3 to 4 for each k-letter word in the query sequence. Scan the database sequences for exact matches with the remaining high-scoring words.
The BLAST program scans the database sequences for the remaining high-scoring word, such as PEG, of each position. If an exact match is found, this match is used to seed a possible un-gapped alignment between the query and database sequences.
Extend the exact matches to high-scoring segment pair (HSP).
The original version of BLAST stretches a longer alignment between the query and the database sequence in the left and right directions, from the position where the exact match occurred. The extension does not stop until the accumulated total score of the HSP begins to decrease. A simplified example is presented in figure 2.
Fig. 2 The process to extend the exact match. Adapted from Biological Sequence Analysis I, Current Topics in Genome Analysis .
Fig. 3 The positions of the exact matches.
To save more time, a newer version of BLAST, called BLAST2 or gapped BLAST, has been developed. BLAST2 adopts a lower neighborhood word score threshold to maintain the same level of sensitivity for detecting sequence similarity. Therefore, the possible matching words list in step 3 becomes longer. Next, the exact matched regions, within distance A from each other on the same diagonal in figure 3, will be joined as a longer new region. Finally, the new regions are then extended by the same method as in the original version of BLAST, and the HSPs' (High-scoring segment pair) scores of the extended regions are then created by using a substitution matrix as before.
List all of the HSPs in the database whose score is high enough to be considered.
We list the HSPs whose scores are greater than the empirically determined cutoff score S. By examining the distribution of the alignment scores modeled by comparing random sequences, a cutoff score S can be determined such that its value is large enough to guarantee the significance of the remaining HSPs.
Evaluate the significance of the HSP score.
BLAST next assesses the statistical significance of each HSP score by exploiting the Gumbel extreme value distribution (EVD). (It is proved that the distribution of Smith-Waterman local alignment scores between two random sequences follows the Gumbel EVD. For local alignments containing gaps it is not proved.). In accordance with the Gumbel EVD, the probability p of observing a score S equal to or greater than x is given by the equation
S ≥ x
= 1 − exp
x − μ
displaystyle pleft(Sgeq xright)=1-exp left(-e^ -lambda left(x-mu right) right)
displaystyle mu = frac log left(Km'n'right) lambda ;
The statistical parameters
displaystyle mathrm K
are estimated by fitting the distribution of the un-gapped local alignment scores, of the query sequence and a lot of shuffled versions (Global or local shuffling) of a database sequence, to the Gumbel extreme value distribution. Note that
displaystyle mathrm K
depend upon the substitution matrix, gap penalties, and sequence composition (the letter frequencies).
are the effective lengths of the query and database sequences, respectively. The original sequence length is shortened to the effective length to compensate for the edge effect (an alignment start near the end of one of the query or database sequence is likely not to have enough sequence to build an optimal alignment). They can be calculated as
≈ m −
ln K m n
displaystyle m'approx m- frac ln Kmn H ;
≈ n −
ln K m n
displaystyle n'approx n- frac ln Kmn H ;
displaystyle mathrm H
is the average expected score per aligned pair of residues in an alignment of two random sequences. Altschul and Gish gave the typical values,
λ = 0.318
displaystyle lambda =0.318
displaystyle mathrm K =0.13
displaystyle mathrm H =0.40
, for un-gapped local alignment using
E ≈ 1 −
s > x D
displaystyle Eapprox 1-e^ -pleft(s>xDright)
p < 0.1
, E could be approximated by the Poisson distribution as
E ≈ p D
displaystyle Eapprox pD
This expectation or expect value "E" (often called an E score or E-value or e-value) assessing the significance of the HSP score for un-gapped local alignment is reported in the BLAST results. The calculation shown here is modified if individual HSPs are combined, such as when producing gapped alignments (described below), due to the variation of the statistical parameters.
Make two or more HSP regions into a longer alignment.
Sometimes, we find two or more HSP regions in one database sequence that can be made into a longer alignment. This provides additional evidence of the relation between the query and database sequence. There are two methods, the Poisson method and the sum-of-scores method, to compare the significance of the newly combined HSP regions. Suppose that there are two combined HSP regions with the pairs of scores (65, 40) and (52, 45), respectively. The Poisson method gives more significance to the set with the maximal lower score (45>40). However, the sum-of-scores method prefers the first set, because 65+40 (105) is greater than 52+45(97). The original BLAST uses the Poisson method; gapped BLAST and the WU- BLAST uses the sum-of scores method.
Show the gapped Smith-Waterman local alignments of the query and each of the matched database sequences.
The original BLAST only generates un-gapped alignments including the initially found HSPs individually, even when there is more than one HSP found in one database sequence. BLAST2 produces a single alignment with gaps that can include all of the initially found HSP regions. Note that the computation of the score and its corresponding E-value involves use of adequate gap penalties.
Report every match whose expect score is lower than a threshold parameter E.
Parallel BLAST Parallel BLAST versions of split databases are implemented using MPI and Pthreads, and have been ported to various platforms including Windows, Linux, Solaris, Mac OS X, and AIX. Popular approaches to parallelize BLAST include query distribution, hash table segmentation, computation parallelization, and database segmentation (partition). Databases are split into equal sized pieces and stored locally on each node. Each query is run on all nodes in parallel and the resultant BLAST output files from all nodes merged to yield the final output. Program The BLAST program can either be downloaded and run as a command-line utility "blastall" or accessed for free over the web. The BLAST web server, hosted by the NCBI, allows anyone with a web browser to perform similarity searches against constantly updated databases of proteins and DNA that include most of the newly sequenced organisms. The BLAST program is based on an open-source format, giving everyone access to it and enabling them to have the ability to change the program code. This has led to the creation of several BLAST "spin-offs". There are now a handful of different BLAST programs available, which can be used depending on what one is attempting to do and what they are working with. These different programs vary in query sequence input, the database being searched, and what is being compared. These programs and their details are listed below: BLAST is actually a family of programs (all included in the blastall executable). These include:
This program, given a DNA query, returns the most similar DNA
sequences from the DNA database that the user specifies.
This program, given a protein query, returns the most similar protein
sequences from the protein database that the user specifies.
BLAST (PSI-BLAST) (blastpgp)
This program is used to find distant relatives of a protein. First, a
list of all closely related proteins is created. These proteins are
combined into a general "profile" sequence, which summarises
significant features present in these sequences. A query against the
protein database is then run using this profile, and a larger group of
proteins is found. This larger group is used to construct another
profile, and the process is repeated.
By including related proteins in the search, PSI-
BLAST is much more
sensitive in picking up distant evolutionary relationships than a
standard protein-protein BLAST.
Of these programs, BLASTn and BLASTp are the most commonly used because they use direct comparisons, and do not require translations. However, since protein sequences are better conserved evolutionarily than nucleotide sequences, tBLASTn, tBLASTx, and BLASTx, produce more reliable and accurate results when dealing with coding DNA. They also enable one to be able to directly see the function of the protein sequence, since by translating the sequence of interest before searching often gives you annotated protein hits. Alternative versions A version designed for comparing large genomes or DNA is BLASTZ. CS-BLAST (ContSxt-Specific BLAST) is an extended version of BLAST for searching protein sequences that finds twice as many remotely related sequences as BLAST at the same speed and error rate. In CS-BLAST, the mutation probabilities between amino acids depend not only on the single amino acid, as in BLAST, but also on its local sequence context. Washington University produced an alternative version of NCBI BLAST, called WU-BLAST. The rights have since been acquired to Advanced Biocomputing, LLC. In 2009, NCBI has released a new set of BLAST executables, the C++ based BLAST+, and has released parallel versions until 2.2.26. Starting with version 2.2.27 (April 2013), only BLAST+ executables are available. Among the changes is the replacement of the blastall executable with separate executables for the different BLAST programs, and changes in option handling. The formatdb utility (C based) has been replaced by makeblastdb (C++ based) and databases formatted by either one should be compatible for identical blast releases. The algorithms remain similar, however, the number of hits found and their order can vary significantly between the older and the newer version. Accelerated versions
Alternatives to BLAST An extremely fast but considerably less sensitive alternative to BLAST is BLAT (Blast Like Alignment Tool). While BLAST does a linear search, BLAT relies on k-mer indexing the database, and can thus often find seeds faster. Another software alternative similar to BLAT is PatternHunter. Advances in sequencing technology in the late 2000s has made searching for very similar nucleotide matches an important problem. New alignment programs tailored for this use typically use BWT-indexing of the target database (typically a genome). Input sequences can then be mapped very quickly, and output is typically in the form of a BAM file. Example alignment programs are BWA, SOAP, and Bowtie. For protein identification, searching for known domains (for instance from Pfam) by matching with Hidden Markov Models is a popular alternative, such as HMMER. An alternative to BLAST for comparing two banks of sequences is PLAST. PLAST provides a high-performance general purpose bank to bank sequence similarity search tool relying on the PLAST and ORIS algorithms. Results of PLAST are very similar to BLAST, but PLAST is significantly faster and capable of comparing large sets of sequences with a small memory (i.e. RAM) footprint. For applications in metagenomics, where the task is to compare billions of short DNA reads against tens of millions of protein references, DIAMOND runs at up to 20,000 times as fast as BLASTX, while maintaining a high level of sensitivity. The open-source software MMseqs2 is an alternative to BLAST/PSI-BLAST, which improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI- BLAST at more than 400 times its speed. BLAST output visualization To help users interpreting BLAST results, different software is available. According to installation and use, analysis features and technology, here are some available tools:
NCBI BLAST service general BLAST output interpreters, GUI-based: JAMBLAST, Blast Viewer, BLASTGrabber integrated BLAST environments: PLAN, BlastStation-Free BLAST output parsers: MuSeqBox, Zerg, BioParser, BLAST-Explorer specialized BLAST-related tools: MEGAN, BLAST2GENE, BOV, Circoletto
Uses of BLAST BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.
With the use of BLAST, you can possibly correctly identify a species
or find homologous species. This can be useful, for example, when you
are working with a
Locating domains When working with a protein sequence you can input it into BLAST, to locate known domains within the sequence of interest.
Establishing phylogeny Using the results received through BLAST you can create a phylogenetic tree using the BLAST web-page. Phylogenies based on BLAST alone are less reliable than other purpose-built computational phylogenetic methods, so should only be relied upon for "first pass" phylogenetic analyses.
DNA mapping When working with a known species, and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest, to relevant sequences in the database(s).
Comparison When working with genes, BLAST can locate common genes in two related species, and can be used to map annotations from one organism to another.
BLAST and the Smith-Waterman Process
While both Smith-Waterman and
BLAST are used to find homologous
sequences by searching and comparing a query sequence with those in
the databases, they do have their differences.
Due to the fact that
BLAST is based on a heuristic algorithm, the
results received through BLAST, in terms of the hits found, may not be
the best possible results, as it will not provide you with all the
hits within the database.
BLAST misses hard to find matches.
A better alternative in order to find the best possible results would
be to use the Smith-Waterman algorithm. This method varies from the
BLAST method in two areas, accuracy and speed. The Smith-Waterman
option provides better accuracy, in that it finds matches that BLAST
cannot, because it does not miss any information. Therefore, it is
necessary for remote homology. However, when compared to BLAST, it is
more time consuming, not to mention that it requires large amounts of
computer usage and space. However, technologies to speed up the
Smith-Waterman process have been found to improve the time necessary
to perform a search dramatically. These technologies include FPGA
^ a b Altschul, Stephen; Gish, Warren; Miller, Webb; Myers, Eugene;
Lipman, David (1990). "Basic local alignment search tool". Journal of
Molecular Biology. 215 (3): 403–410.
doi:10.1016/S0022-2836(05)80360-2. PMID 2231712.
^ Casey, R. M. (2005). "
BLAST Sequences Aid in Genomics and
Proteomics". Business Intelligence Network.
^ Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein
similarity searches". Science. 227 (4693): 1435–41.
doi:10.1126/science.2983426. PMID 2983426.
^ Oehmen, C.; Nieplocha, J. (2006). "ScalaBLAST: A Scalable
BLAST for High-Performance Data-Intensive
Library resources about Sequence alignment
Resources in your library Resources in other libraries
BLAST+ executables — free source downloads
Biological Sequence Analysis I: Andy Baxevanis' lecture from NHGRI's
Current Topics in Genome Analysis series, covering contemporary areas
in genomics and bioinformatics
What's behind BLAST?: talk by
Baxevanis, Andreas D. (2005). "Chapter 11: Assessing Pairwise Sequence Similarity: BLAST and FASTA". In Andreas D. Baxevanis; B. F. Francis Ouellette. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (Third ed.). New York: John Wiley & Sons. ISBN 978-0-471-47878-2. Wheeler, David; Bhagwat, Medha (2007). "Chapter 9: BLAST QuickStart". In Bergman, Nicholas H. Comparative Genomics Volumes 1 and 2. Methods in Molecular Biology. 395-396. Totowa, NJ: Humana Press. PMID 21250292. Mount DW (1 Jul 2007). "Using the Basic Local Alignment Search Tool (BLAST)". Cold Spring Harbor Protocols. 2007 (14): pdb.top17. doi:10.1101/pdb.top17. PMID 21357135.
v t e
Sequence databases: GenBank, European
BLAST Bowtie Clustal HMMER MUSCLE SAMtools TopHat
Server: ExPASy Ontology: Gene Ontology
International Society for Computational Biology
Intelligent Systems for Molecular Biology
Computational biology List of biological databases Sequencing Sequence database Sequence alignment Molec