bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a

substitution matrix In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a Nucleic acid sequence, nucleotide sequence or a Protein primary structure, protein sequence changes to other character states ove ...

used for

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

s. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. BLOSUM matrices were first introduced in a paper by Steven Henikoff and Jorja Henikoff. They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of

amino acids Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the Proteinogenic amino acid, 22 α-amino acids incorporated into p ...

and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices.

Biological background

The genetic instructions of every replicating cell in a living organism are contained within its DNA. Throughout the cell's lifetime, this information is transcribed and replicated by cellular mechanisms to produce proteins or to provide instructions for daughter cells during cell division, and the possibility exists that the DNA may be altered during these processes. This is known as a

mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, ...

. At the molecular level, there are regulatory systems that correct most — but not all — of these changes to the DNA before it is replicated. The functionality of a protein is highly dependent on its structure. Changing a single amino acid in a protein may reduce its ability to carry out this function, or the mutation may even change the function that the protein carries out. Changes like these may severely impact a crucial function in a cell, potentially causing the cell — and in extreme cases, the organism — to die. Conversely, the change may allow the cell to continue functioning albeit differently, and the mutation can be passed on to the organism's offspring. If this change does not result in any significant physical disadvantage to the offspring, the possibility exists that this mutation will persist within the population. The possibility also exists that the change in function becomes advantageous. The 20 amino acids translated by the

genetic code Genetic code is a set of rules used by living cell (biology), cells to Translation (biology), translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished ...

vary greatly by the physical and chemical properties of their side chains. However, these amino acids can be categorised into groups with similar physicochemical properties. Substituting an amino acid with another from the same category is more likely to have a smaller impact on the structure and function of a protein than replacement with an amino acid from a different category. Sequence alignment is a fundamental research method for modern biology. The most common sequence alignment for protein is to look for similarity between different sequences in order to infer function or establish evolutionary relationships. This helps researchers better understand the origin and function of genes through the nature of homology and conservation. Substitution matrices are utilized in algorithms to calculate the similarity of different sequences of proteins; however, the utility of Dayhoff PAM Matrix has decreased over time due to the requirement of sequences with a similarity more than 85%. In order to fill in this gap, Henikoff and Henikoff introduced BLOSUM (BLOcks SUbstitution Matrix) matrix which led to marked improvements in alignments and in searches using queries from each of the groups of related proteins.

Terminology

;BLOSUM: Blocks Substitution Matrix, a

used for

s. ;Scoring metrics (statistical versus biological): When evaluating a sequence alignment, one would like to know how meaningful it is. This requires a scoring matrix, or a table of values that describes the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences. ;BLOSUM r: :The matrix built from blocks with less than r% of similarity :* E.g., BLOSUM62 is the matrix built using sequences with less than 62% similarity (sequences with ≥ 62% identity were clustered together). :* Note: BLOSUM 62 is the default matrix for protein BLAST. Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities. Several sets of BLOSUM matrices exist using different alignment databases, named with numbers. BLOSUM matrices with high numbers are designed for comparing closely related sequences, while those with low numbers are designed for comparing distant related sequences. For example, BLOSUM80 is used for closely related alignments, and BLOSUM45 is used for more distantly related alignments. The matrices were created by merging (clustering) all sequences that were more similar than a given percentage into one single sequence and then comparing those sequences (that were all more divergent than the given percentage value) only; thus reducing the contribution of closely related sequences. The percentage used was appended to the name, giving BLOSUM80 for example where sequences that were more than 80% identical were clustered.

Construction of BLOSUM matrices

BLOSUM matrices are obtained by using blocks of similar amino acid sequences as data, then applying statistical methods to the data to obtain the similarity scores. Statistical Methods Steps:

Eliminating Sequences

Eliminate the sequences that are more than r% identical. There are two ways to eliminate the sequences. It can be done either by removing sequences from the block or just by finding similar sequences and replace them by new sequences which could represent the cluster. Elimination is done to remove protein sequences that are more similar than the specified threshold.

Calculating Frequency & Probability

A database storing the sequence alignments of the most conserved regions of protein families. These alignments are used to derive the BLOSUM matrices. Only the sequences with a percentage of identity lower than the threshold are used. By using the block, counting the pairs of amino acids in each column of the multiple alignment.

Log odds ratio

It gives the ratio of the occurrence each amino acid combination in the observed data to the expected value of occurrence of the pair. It is rounded off and used in the substitution matrix.

LogOddRatio=2 \log_2

where

P \left(O \right)

is the probability of observing the pair and

P \left(E \right)

is the expected probability of such a pair occurring, given the background probabilities of each amino acid.

BLOSUM Matrices

The odds for relatedness are calculated from log odd ratio, which are then rounded off to get the substitution matrices BLOSUM matrices.

Score of the BLOSUM matrices

A scoring matrix or a table of values is required for evaluating the significance of a sequence alignment, such as describing the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Typically, when two nucleotide sequences are being compared, all that is being scored is whether or not two bases are the same at one position. All matches and mismatches are respectively given the same score (typically +1 or +5 for matches, and -1 or -4 for mismatches). But it is different for proteins. Substitution matrices for amino acids are more complicated and implicitly take into account everything that might affect the frequency with which any amino acid is substituted for another. The objective is to provide a relatively heavy penalty for aligning two residues together if they have a low probability of being homologous (correctly aligned by evolutionary descent). Two major forces drive the amino-acid substitution rates away from uniformity: substitutions occur with the different frequencies, and lessen functionally tolerated than others. Thus, substitutions are selected against. Commonly used substitution matrices include the blocks substitution (BLOSUM) and point accepted mutation (PAM) matrices. Both are based on taking sets of high-confidence alignments of many homologous proteins and assessing the frequencies of all substitutions, but they are computed using different methods. Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them.page 673 Every possible identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins. A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions. To calculate a BLOSUM matrix, the following equation is used: :

S_= \frac \log

Here,

p_

is the probability of two amino acids

i

and

j

replacing each other in a homologous sequence, and

q_i

and

q_j

are the background probabilities of finding the amino acids

i

and

j

in any protein sequence. The factor

\lambda

is a scaling factor, set such that the matrix contains easily computable integer values.

Variants

BLOSUM

BLOSUM80: more related proteins BLOSUM62: midrange BLOSUM45: distantly related proteins The BLOSUM62 matrix with the amino acids in the table grouped according to the chemistry of the side chain, as in (a). Each value in the matrix is calculated by dividing the frequency of occurrence of the amino acid pair in the BLOCKS database, clustered at the 62% level, divided by the probability that the same two amino acids might align by chance. The ratio is then converted to a logarithm and expressed as a log odds score, as for PAM. BLOSUM matrices are usually scaled in half-bit units. A score of zero indicates that the frequency with which a given two amino acids were found aligned in the database was as expected by chance, while a positive score indicates that the alignment was found more often than by chance, and negative score indicates that the alignment was found less often than by chance.

PMB

PMB (Probability Matrix from Blocks) of 2004 uses the additivity of evolutionary distances to improve on BLOSUM's analysis of the BLOCKS database. The up-to-date 2001 version of BLOCKS was used to generate a new set of BLOSUM matrices. The "observed substitution frequencies" found in these BLOSUM matrices are used to estimate actual substitution frequencies (with higher evolutionary distance, i.e. lower ''r'', some later replacement can mask earlier replacements). PMB thus defines a true evolutionary model like PAM and JTT do. It is not a symmetric matrix.

RBLOSUM

The original code written by Henikoff and Henikoff does not exactly act according to their paper's description of the algorithm. The BLOSUM62 from that program has been used for many years as standard. Surprisingly, the miscalculated BLOSUM62 improves search performance compared to the 2008 corrected version of the same relative entropy (RBLOSUM64). A 2018 article claims that RBLOSUM is better than BLOSUM and CorBLOSUM.

CorBLOSUM

A 2016 paper finds further errors in the original code not addressed by the 2008 RBLOSUM correction. The corrected version from this paper, CorBLOSUM, manages to be more effective than BLOSUM at similarity search in about 75% of cases.

Some uses in bioinformatics

Research applications

BLOSUM scores was used to predict and understand the surface gene variants among hepatitis B virus carriers and T-cell epitopes.

Surface gene variants among hepatitis B virus carriers

DNA sequences of HBsAg were obtained from 180 patients, in which 51 were chronic HBV carrier and 129 newly diagnosed patients, and compared with consensus sequences built with 168 HBV sequences imported from GenBank. Literature review and BLOSUM scores were used to define potentially altered antigenicity.

Reliable prediction of T-cell epitopes

A novel input representation has been developed consisting of a combination of sparse encoding, Blosum encoding, and input derived from hidden Markov models. this method predicts T-cell epitopes for the genome of hepatitis C virus and discuss possible applications of the prediction method to guide the process of rational vaccine design.

Use in BLAST

BLOSUM matrices are also used as a scoring matrix when comparing DNA sequences or protein sequences to judge the quality of the alignment. This form of scoring system is utilized by a wide range of alignment software including BLAST.

Comparing PAM and BLOSUM

In addition to BLOSUM matrices, a previously developed scoring matrix can be used. This is known as a PAM. The two result in the same scoring outcome, but use differing methodologies. BLOSUM looks directly at mutations in motifs of related sequences while PAM's extrapolate evolutionary information based on closely related sequences. Since both PAM and BLOSUM are different methods for showing the same scoring information, the two can be compared but due to the very different method of obtaining this score, a PAM100 does not equal a BLOSUM100.

=The relationship between PAM and BLOSUM

=The differences between PAM and BLOSUM

Availability

The "reference" version of BLOSUM is found in the NCBI toolkits. Both the older (deprecated) NCBI C Toolkit and the current NCBI C++ Toolkit provide the BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, and BLOSUM90 matrices. Both also offer APIs for making use of the matrices. The original source code for calculating BLOSUM is also found on the NCBI website, at https://ftp.ncbi.nih.gov/repository/blocks/unix/blosum/. This archive "blosum.tar.Z" represents the original miscalculated version with improved search performance from 1992. The archive also contains pre-calculated BLOSUM outputs at the following similarity levels: "-2" (blosumn), 30, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 95, and 100.README from blosum.tar.Z

Software Packages

There are several software packages in different programming languages that allow easy use of Blosum matrices. Besides the aforementioned NCBI Toolkits, there are:
blosum
module for Python * BioJava library for

Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...

... and many more.

References

External links

*
BLOCKS WWW server

Data files of matrices including BLOSUM30–100 on the NCBI FTP server

Interactive BLOSUM Network Visualization
{{Webarchive, url=https://web.archive.org/web/20170130052934/http://ahmetrasit.com/blosum/ , date=30 January 2017 Biochemistry methods Computational phylogenetics Matrices (mathematics) Bioinformatics