BLAT (
BLAST
Blast or The Blast may refer to:
* Explosion, a rapid increase in volume and release of energy in an extreme manner
*Detonation, an exothermic front accelerating through a medium that eventually drives a shock front
Film
* ''Blast'' (1997 film) ...
-like alignment tool) is a
pairwise sequence alignment algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
that was developed by
Jim Kent
William James Kent (born February 10, 1960) is an American research scientist and computer programmer. He has been a contributor to genome database projects and the 2003 winner of the Benjamin Franklin Award.
Early life
Kent was born in Hawa ...
at the
University of California Santa Cruz
The University of California, Santa Cruz (UC Santa Cruz or UCSC) is a public land-grant research university in Santa Cruz, California. It is one of the ten campuses in the University of California system. Located on Monterey Bay, on the edge of ...
(UCSC) in the early 2000s to assist in the assembly and annotation of the
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the n ...
.
It was designed primarily to decrease the time needed to align millions of mouse genomic reads and
expressed sequence tags In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proc ...
against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
/
DNA alignments and ~50 times faster with
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
/protein alignments.
Overview
BLAT is one of multiple algorithms developed for the analysis and comparison of biological sequences such as DNA, RNA and proteins, with a primary goal of inferring
homology
Homology may refer to:
Sciences
Biology
*Homology (biology), any characteristic of biological organisms that is derived from a common ancestor
* Sequence homology, biological homology between DNA, RNA, or protein sequences
*Homologous chrom ...
in order to discover biological function of genomic sequences.
It is not guaranteed to find the mathematically optimal alignment between two sequences like the classic Needleman-Wunsch
and Smith-Waterman
dynamic programming
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
I ...
algorithms do; rather, it first attempts to rapidly detect short sequences which are more likely to be homologous, and then it aligns and further extends the homologous regions. It is similar to the
heuristic
A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
BLAST
family of algorithms, but each tool has tried to deal with the problem of aligning biological sequences in a timely and efficient manner by attempting different algorithmic techniques.
Uses of BLAT
BLAT can be used to align DNA sequences as well as protein and translated nucleotide (mRNA or DNA) sequences. It is designed to work best on sequences with great similarity. The DNA search is most effective for primates and the protein search is effective for land vertebrates.
In addition, protein or translated sequence queries are more effective for identifying distant matches and for cross-species analysis than DNA sequence queries.
Typical uses of BLAT include the following:
*Alignment of multiple mRNA sequences onto a genome assembly in order to infer their genomic coordinates;
*Alignment of a protein or mRNA sequence from one species onto a sequence database from another species to determine homology. Provided the two species are not too divergent, cross-species alignment is generally effective with BLAT. This is possible because BLAT does not require perfect matches, but rather accepts mismatches in alignments;
*BLAT can be used for alignments of two protein sequences. However, it is not the tool of choice for these types of alignments. BLASTP, the Standard Protein
BLAST
Blast or The Blast may refer to:
* Explosion, a rapid increase in volume and release of energy in an extreme manner
*Detonation, an exothermic front accelerating through a medium that eventually drives a shock front
Film
* ''Blast'' (1997 film) ...
tool, is more efficient at protein-protein alignments;
*Determination of the distribution of exonic and intronic regions of a gene;
*Detection of gene family members of a specific gene query;
*Display of the protein-coding sequence of a specific gene.
BLAT is designed to find matches between sequences of length at least 40 bases that share ≥95% nucleotide identity or ≥80% translated protein identity.
Process
BLAT is used to find regions in a target genomic database which are similar to a query sequence under examination. The general algorithmic process followed by BLAT is similar to
BLAST
Blast or The Blast may refer to:
* Explosion, a rapid increase in volume and release of energy in an extreme manner
*Detonation, an exothermic front accelerating through a medium that eventually drives a shock front
Film
* ''Blast'' (1997 film) ...
's in that it first searches for short segments in the database and query sequences which have a certain number of matching elements. These alignment seeds are then extended in both directions of the sequences in order to form high-scoring pairs.
However, BLAT uses a different indexing approach from BLAST, which allows it to rapidly scan very large genomic and protein databases for similarities to a query sequence. It does this by keeping an indexed list (
hash table
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
) of the target database in memory, which significantly reduces the time required for the comparison of the query sequences with the target database. This index is built by taking the coordinates of all the non-overlapping k-mers (words with k letters) in the target database, except for highly repeated k-mers. BLAT then builds a list of all overlapping k-mers from the query sequence and searches for these in the target database, building up a list of hits where there are matches between the sequences
(Figure 1 illustrates this process).
Search stage
There are three different strategies used in order to search for candidate homologous regions:
#The first method requires single perfect matches between the query and database sequences i.e. the two k-mer words are exactly the same. This approach is not considered the most practical. This is because a small k-mer size is necessary in order to achieve high levels of sensitivity, but this increases the number of false positive hits, thus increasing the amount of time spent in the alignment stage of the algorithm.
#The second method allows at least one mismatch between the two k-mer words. This decreases the amount of false positives, allowing larger k-mer sizes which are less computationally expensive to handle than those produced from the previous method. This method is very effective in identifying small homologous regions.
#The third method requires multiple perfect matches which are in close proximity to each other. As Kent shows,
this is a very effective technique capable of taking into consideration small insertions and deletions within the homologous regions.
When aligning nucleotides, BLAT uses the third method requiring two perfect word matches of size 11 (11-mers). When aligning proteins, the BLAT version determines the search methodology used: when the client/server version is used, BLAT searches for three perfect 4-mer matches; when the stand-alone version is used, BLAT searches for a single perfect 5-mer between the query and database sequences.
BLAT vs. BLAST
Some of the differences between BLAT and BLAST are outlined below:
*BLAT indexes the genome/protein database, retains the index in memory, and then scans the query sequence for matches. BLAST, on the other hand, builds an index of the query sequences and searches through the database for matches.
A BLAST variant called MegaBLAST indexes 4 databases to speed up alignments.
*BLAT can extend on multiple perfect and near-perfect matches (default is 2 perfect matches of length 11 for nucleotide searches and 3 perfect matches of length 4 for protein searches), while BLAST extends only when one or two matches occur close together.
*BLAT connects each
homologous area between two sequences into a single larger alignment, in contrast to BLAST which returns each homologous area as a separate local alignment. The result of BLAST is a list of
exons
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
with each alignment extending just past the end of the exon. BLAT, however, correctly places each base of the
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
onto the genome, using each base only once and can be used to identify
intron
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene. ...
-exon boundaries (i.e.
splice sites).
*BLAT is less sensitive than BLAST.
Program usage
BLAT can be used either as a web-based server-client program or as a stand-alone program.
Server-client
The web-based application of BLAT can be accessed from the UCSC Genome Bioinformatics Site.
[UCSC Genome Bioinformatics Site]
/ref> Building the index is a relatively slow procedure. Therefore, each genome assembly used by the web-based BLAT is associated with a BLAT server, in order to have a pre-computed index available for alignments. These web-based BLAT servers keep the index in memory for users to input their query sequences.
Once the query sequence is uploaded/pasted into the search field, the user can select various parameters such as which species' genome to target (there are currently over 50 species available) and the assembly version of that genome (for example, the human genome has four assemblies to select from), the query type (i.e. whether the sequence relates to DNA, protein etc.) and output settings (i.e. how to sort and visualise the output). The user can then run the search by either submitting the query or using the BLAT "I'm feeling lucky" search.
Bhagwat ''et al.'' provide step by step protocols for how to use BLAT to:
*Map an mRNA/cDNA sequence to a genomic sequence;
*Map a protein sequence to the genome;
*Perform homology searches.
Input
BLAT can handle long database sequences, however, it is more effective with short query sequences than long query sequences. Kent recommends a maximum query length of 200,000 bases. The UCSC browser limits query sequences to less than 25,000 letters (i.e. nucleotides
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules w ...
) for DNA searches and less than 10,000 letters (i.e. amino acids
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
) for protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
and translated sequence searches.
The BLAT Search Genome available on the UCSC website accepts query sequences as text (cut and pasted into the query box) or uploaded as text files. The BLAT Search Genome can accept multiple sequences of the same type at once, up to a maximum of 25. For multiple sequences, the total number of nucleotides must not exceed 50,000 for DNA searches or 25,000 letters for protein or translated sequence searches.
An example of searching a target database with a DNA query sequence is shown in Figure 2.
Output
A BLAT search returns a list of results that are ordered in decreasing order based on the score. The following information is returned: the score of the alignment, the region of query sequence that matches to the database sequence, the size of the query sequence, the level of identity as a percentage of the alignment and the chromosome and position that the query sequence maps to. Bhagwat ''et al.'' describe how the BLAT "Score" and "Identity" measures are calculated.
For each search result, the user is provided with a link to the UCSC Genome Browser so they can visualise the alignment on the chromosome. This a major benefit of the web-based BLAT over the stand-alone BLAT. The user is able to obtain biological information associated with the alignment, such as information about the gene to which the query may match.
The user is also provided with a link to view the alignment of the query sequence with the genome assembly. The matches between the query and genome assembly are blue and the boundaries of the alignments are lighter in colour. These exon boundaries indicate splice sites.
The "I'm feeling lucky" search result returns the highest scoring alignment for the first query sequence based on the output sort option selected by the user.
Stand-alone
Stand-alone BLAT is more suitable for batch runs, and more efficient than the web-based BLAT. It is more efficient because it is able to store the genome in memory, unlike the web-based application which only stores the index in memory.
License
Both the source and precompiled binaries of BLAT are freely available for academic and personal use. Commercial license of stand-alone BLAT is distributed b
Kent Informatics, Inc.
See also
*BLAST
Blast or The Blast may refer to:
* Explosion, a rapid increase in volume and release of energy in an extreme manner
*Detonation, an exothermic front accelerating through a medium that eventually drives a shock front
Film
* ''Blast'' (1997 film) ...
Basic Local Alignment Search Tool
*Sequence alignment software
This list of sequence alignment software is a compilation of software tools and web portals used in pairwise sequence alignment and multiple sequence alignment. See structural alignment software for structural alignment of proteins.
Database searc ...
References
{{reflist, 2
External links
UCSC BLAT Search Genome
Kent Informatics, Inc.
BLAT source code
BLAT FAQ
— by UCSC
Human BLAT Search
Bioinformatics software
Laboratory software