Biological background
The genetic instructions of every replicating cell in a living organism are contained within its DNA. Throughout the cell's lifetime, this information is transcribed and replicated by cellular mechanisms to produce proteins or to provide instructions for daughter cells during cell division, and the possibility exists that the DNA may be altered during these processes. This is known as aTerminology
;BLOSUM: Blocks Substitution Matrix, aConstruction of BLOSUM matrices
BLOSUM matrices are obtained by using blocks of similar amino acid sequences as data, then applying statistical methods to the data to obtain the similarity scores. Statistical Methods Steps:Eliminating Sequences
Eliminate the sequences that are more than r% identical. There are two ways to eliminate the sequences. It can be done either by removing sequences from the block or just by finding similar sequences and replace them by new sequences which could represent the cluster. Elimination is done to remove protein sequences that are more similar than the specified threshold.Calculating Frequency & Probability
A database storing the sequence alignments of the most conserved regions of protein families. These alignments are used to derive the BLOSUM matrices. Only the sequences with a percentage of identity lower than the threshold are used. By using the block, counting the pairs of amino acids in each column of the multiple alignment.Log odds ratio
It gives the ratio of the occurrence each amino acid combination in the observed data to the expected value of occurrence of the pair. It is rounded off and used in the substitution matrix. where is the probability of observing the pair and is the expected probability of such a pair occurring, given the background probabilities of each amino acid.BLOSUM Matrices
The odds for relatedness are calculated from log odd ratio, which are then rounded off to get the substitution matrices BLOSUM matrices.Score of the BLOSUM matrices
A scoring matrix or a table of values is required for evaluating the significance of a sequence alignment, such as describing the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Typically, when two nucleotide sequences are being compared, all that is being scored is whether or not two bases are the same at one position. All matches and mismatches are respectively given the same score (typically +1 or +5 for matches, and -1 or -4 for mismatches). But it is different for proteins. Substitution matrices for amino acids are more complicated and implicitly take into account everything that might affect the frequency with which any amino acid is substituted for another. The objective is to provide a relatively heavy penalty for aligning two residues together if they have a low probability of being homologous (correctly aligned by evolutionary descent). Two major forces drive the amino-acid substitution rates away from uniformity: substitutions occur with the different frequencies, and lessen functionally tolerated than others. Thus, substitutions are selected against. Commonly used substitution matrices include the blocks substitution (BLOSUM) and point accepted mutation (PAM) matrices. Both are based on taking sets of high-confidence alignments of many homologous proteins and assessing the frequencies of all substitutions, but they are computed using different methods. Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them.page 673 Every possible identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins. A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions. To calculate a BLOSUM matrix, the following equation is used: : Here, is the probability of two amino acids and replacing each other in a homologous sequence, and and are the background probabilities of finding the amino acids and in any protein sequence. The factor is a scaling factor, set such that the matrix contains easily computable integer values.Variants
BLOSUM
BLOSUM80: more related proteins BLOSUM62: midrange BLOSUM45: distantly related proteins The BLOSUM62 matrix with the amino acids in the table grouped according to the chemistry of the side chain, as in (a). Each value in the matrix is calculated by dividing the frequency of occurrence of the amino acid pair in the BLOCKS database, clustered at the 62% level, divided by the probability that the same two amino acids might align by chance. The ratio is then converted to a logarithm and expressed as a log odds score, as for PAM. BLOSUM matrices are usually scaled in half-bit units. A score of zero indicates that the frequency with which a given two amino acids were found aligned in the database was as expected by chance, while a positive score indicates that the alignment was found more often than by chance, and negative score indicates that the alignment was found less often than by chance.PMB
PMB (Probability Matrix from Blocks) of 2004 uses the additivity of evolutionary distances to improve on BLOSUM's analysis of the BLOCKS database. The up-to-date 2001 version of BLOCKS was used to generate a new set of BLOSUM matrices. The "observed substitution frequencies" found in these BLOSUM matrices are used to estimate actual substitution frequencies (with higher evolutionary distance, i.e. lower ''r'', some later replacement can mask earlier replacements). PMB thus defines a true evolutionary model like PAM and JTT do. It is not a symmetric matrix.RBLOSUM
The original code written by Henikoff and Henikoff does not exactly act according to their paper's description of the algorithm. The BLOSUM62 from that program has been used for many years as standard. Surprisingly, the miscalculated BLOSUM62 improves search performance compared to the 2008 corrected version of the same relative entropy (RBLOSUM64). A 2018 article claims that RBLOSUM is better than BLOSUM and CorBLOSUM.CorBLOSUM
A 2016 paper finds further errors in the original code not addressed by the 2008 RBLOSUM correction. The corrected version from this paper, CorBLOSUM, manages to be more effective than BLOSUM at similarity search in about 75% of cases.Some uses in bioinformatics
Research applications
BLOSUM scores was used to predict and understand the surface gene variants among hepatitis B virus carriers and T-cell epitopes.Surface gene variants among hepatitis B virus carriers
DNA sequences of HBsAg were obtained from 180 patients, in which 51 were chronic HBV carrier and 129 newly diagnosed patients, and compared with consensus sequences built with 168 HBV sequences imported from GenBank. Literature review and BLOSUM scores were used to define potentially altered antigenicity.Reliable prediction of T-cell epitopes
A novel input representation has been developed consisting of a combination of sparse encoding, Blosum encoding, and input derived from hidden Markov models. this method predicts T-cell epitopes for the genome of hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design.Use in BLAST
BLOSUM matrices are also used as a scoring matrix when comparing DNA sequences or protein sequences to judge the quality of the alignment. This form of scoring system is utilized by a wide range of alignment software including BLAST.Comparing PAM and BLOSUM
In addition to BLOSUM matrices, a previously developed scoring matrix can be used. This is known as a PAM. The two result in the same scoring outcome, but use differing methodologies. BLOSUM looks directly at mutations in motifs of related sequences while PAM's extrapolate evolutionary information based on closely related sequences. Since both PAM and BLOSUM are different methods for showing the same scoring information, the two can be compared but due to the very different method of obtaining this score, a PAM100 does not equal a BLOSUM100.=The relationship between PAM and BLOSUM
==The differences between PAM and BLOSUM
=Availability
The "reference" version of BLOSUM is found in the NCBI toolkits. Both the older (deprecated) NCBI C Toolkit and the current NCBI C++ Toolkit provide the BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, and BLOSUM90 matrices. Both also offer APIs for making use of the matrices. The original source code for calculating BLOSUM is also found on the NCBI website, at https://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/. This archive "blosum.tar.Z" represents the original miscalculated version with improved search performance from 1992. The archive also contains pre-calculated BLOSUM outputs at the following similarity levels: "-2" (blosumn), 30, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 95, and 100.README from blosum.tar.ZSoftware Packages
There are several software packages in different programming languages that allow easy use of Blosum matrices. Besides the aforementioned NCBI Toolkits, there are:See also
*References
External links
*