FASTA is a DNA and

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...

software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the

FASTA format In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format a ...

which is now ubiquitous in

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

History

The original FASTA program was designed for protein sequence similarity searching. Because of the exponentially expanding genetic information and the limited speed and memory of computers in the 1980s heuristic methods were introduced aligning a query sequence to entire data-bases. FASTA, published in 1987, added the ability to do DNA:DNA searches, translated protein:DNA searches, and also provided a more sophisticated shuffling program for evaluating statistical significance. There are several programs in this package that allow the alignment of

sequences and DNA sequences. Nowadays, increased computer performance makes it possible to perform searches for

local Local may refer to: Geography and transportation * Local (train), a train serving local traffic demand * Local, Missouri, a community in the United States * Local government, a form of public administration, usually the lowest tier of administrat ...

alignment detection in a database using the

Smith–Waterman algorithm The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorit ...

. FASTA is pronounced "fast A", and stands for "FAST-All", because it works with any alphabet, an extension of the original "FAST-P" (protein) and "FAST-N" (nucleotide) alignment tools. C674115

Uses

The current FASTA package contains programs for protein:protein, DNA:DNA, protein:translated DNA (with frameshifts), and ordered or unordered peptide searches. Recent versions of the FASTA package include special translated search algorithms that correctly handle

frameshift Ribosomal frameshifting, also known as translational frameshifting or translational recoding, is a biological phenomenon that occurs during translation that results in the production of multiple, unique proteins from a single mRNA. The process can ...

errors (which six-frame-translated searches do not handle very well) when comparing nucleotide to protein sequence data. In addition to rapid heuristic search methods, the FASTA package provides SSEARCH, an implementation of the optimal

. A major focus of the package is the calculation of accurate similarity statistics, so that biologists can judge whether an alignment is likely to have occurred by chance, or whether it can be used to infer

homology Homology may refer to: Sciences Biology *Homology (biology), any characteristic of biological organisms that is derived from a common ancestor * Sequence homology, biological homology between DNA, RNA, or protein sequences *Homologous chrom ...

. The FASTA package is available from the

University of Virginia The University of Virginia (UVA) is a Public university#United States, public research university in Charlottesville, Virginia. Founded in 1819 by Thomas Jefferson, the university is ranked among the top academic institutions in the United S ...

and the

European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...

. The FASTA file format used as input for this software is now largely used by other sequence database search tools (such as

BLAST Blast or The Blast may refer to: * Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film) ...

) and sequence alignment programs (

Clustal Clustal is a series of widely used computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its a ...

T-Coffee T-Coffee (Tree-based Consistency Objective Function for Alignment Evaluation) is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can al ...

, etc.).

Search method

FASTA takes a given nucleotide or amino acid sequence and searches a corresponding sequence database by using local sequence alignment to find matches of similar database sequences. The FASTA program follows a largely

heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...

method which contributes to the high speed of its execution. It initially observes the pattern of word hits, word-to-word matches of a given length, and marks potential matches before performing a more time-consuming optimized search using a Smith–Waterman type of algorithm. The size taken for a word, given by the parameter kmer, controls the sensitivity and speed of the program. Increasing the

k-mer In bioinformatics, ''k''-mers are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which ''k''-mers are composed of nucleotides (''i.e''. A, T, G ...

value decreases number of background hits that are found. From the word hits that are returned the program looks for segments that contain a cluster of nearby hits. It then investigates these segments for a possible match. There are some differences between fastn and fastp relating to the type of sequences used but both use four steps and calculate three scores to describe and format the sequence similarity results. These are: * Identify regions of highest density in each sequence comparison. Taking a k-mer to equal 1 or 2. : In this step all or a group of the identities between two sequences are found using a look up table. The k-mer value determines how many consecutive identities are required for a match to be declared. Thus the lesser the k-mer value: the more sensitive the search. k-mer=2 is frequently taken by users for protein sequences and kmer=4 or 6 for nucleotide sequences. Short

oligonucleotide Oligonucleotides are short DNA or RNA molecules, oligomers, that have a wide range of applications in genetic testing, research, and forensics. Commonly made in the laboratory by solid-phase chemical synthesis, these small bits of nucleic acids c ...

s are usually run with k-mer= 1. The program then finds all similar local regions, represented as diagonals of a certain length in a dot plot, between the two sequences by counting k-mer matches and penalizing for intervening mismatches. This way, local regions of highest density matches in a diagonal are isolated from background hits. For protein sequences BLOSUM50 values are used for scoring k-mer matches. This ensures that groups of identities with high similarity scores contribute more to the local diagonal score than to identities with low similarity scores. Nucleotide sequences use the

identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. Terminology and notation The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...

for the same purpose. The best 10 local regions selected from all the diagonals put together are then saved. * Rescan the regions taken using the scoring matrices. trimming the ends of the region to include only those contributing to the highest score. : Rescan the 10 regions taken. This time use the relevant scoring matrix while rescoring to allow runs of identities shorter than the k-mer value. Also while rescoring conservative replacements that contribute to the similarity score are taken. Though protein sequences use the BLOSUM50 matrix, scoring matrices based on the minimum number of base changes required for a specific replacement, on identities alone, or on an alternative

measure of similarity In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such meas ...

such as PAM, can also be used with the program. For each of the diagonal regions rescanned this way, a subregion with the maximum score is identified. The initial scores found in step1 are used to rank the library sequences. The highest score is referred to as ''init1'' score. * In an alignment if several initial regions with scores greater than a CUTOFF value are found, check whether the trimmed initial regions can be joined to form an approximate alignment with gaps. Calculate a similarity score that is the sum of the joined regions penalising for each gap 20 points. This initial similarity score (''initn'') is used to rank the library sequences. The score of the single best initial region found in step 2 is reported (''init1''). : Here the program calculates an optimal alignment of initial regions as a combination of compatible regions with maximal score. This optimal alignment of initial regions can be rapidly calculated using a dynamic programming algorithm. The resulting score initn is used to rank the library sequences. This joining process increases sensitivity but decreases selectivity. A carefully calculated cut-off value is thus used to control where this step is implemented, a value that is approximately one

standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...

above the average score expected from unrelated sequences in the library. A 200-residue query sequence with k-mer 2 uses a value 28. * Use a banded

to calculate an optimal score for alignment. : This step uses a banded

to create an optimised score (''opt'') for each alignment of query sequence to a database(library) sequence. It takes a band of 32 residues centered on the ''init1'' region of step2 for calculating the optimal alignment. After all sequences are searched the program plots the initial scores of each database sequence in a

histogram A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the ent ...

, and calculates the statistical significance of the "opt" score. For protein sequences, the final alignment is produced using a full Smith–Waterman alignment. For DNA sequences, a banded alignment is provided. : Smith-Waterman-Algorithm-Example-En

FASTA can remove complexity regions before aligning the sequences by encoding low complexity regions in lower case and using the -S option. However, the BLAST program offers more options for correcting for biased composition statistics. Therefore, the program PRSS is added in the FASTA distribution package. PRSS shuffles the matching sequences in the database either on the one-letter level or it shuffles short segments which length the user can determine. The shuffled sequences are now aligned again and if the score is still higher than expected this is caused by the low complexity regions being mixed up still mapping to the query. By the amount of the score the shuffled sequences still attain PRSS now can predict the significance of the score of the original sequences. The higher the score of the shuffled sequences the less significant the matches found between original database and query sequence.David W. Mount: ''Bioinformatics Sequence and Genome Analysis'', Edition 1, Cold Spring Harbor Laboratory Press, 2001, pp. 295–297. The FASTA programs find regions of local or global similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

References

{{reflist Phylogenetics software

History

Uses

Search method

See also

References