bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...

, a sequence logo is a graphical representation of the

sequence conservation In evolutionary biology, conserved sequences are identical or similar Sequence (biology), sequences in nucleic acids (DNA sequence, DNA and RNA) or peptide sequence, proteins across species (homology (biology)#Orthology, orthologous sequences), ...

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecul ...

s (in a strand of DNA/

RNA Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...

) or

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha ...

s (in

protein sequence Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesi ...

s). A sequence logo is created from a collection of aligned sequences and depicts the consensus sequence and diversity of the sequences. Sequence logos are frequently used to depict sequence characteristics such as protein-binding sites in DNA or functional units in proteins.

Overview

A sequence logo consists of a stack of letters at each position. The relative sizes of the letters indicate their frequency in the sequences. The total height of the letters depicts the information content of the position, in bits.

Logo creation

To create sequence logos, related DNA, RNA or protein sequences, or DNA sequences that have common conserved binding sites, are aligned so that the most conserved parts create good alignments. A sequence logo can then be created from the conserved

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolution ...

. The sequence logo will show how well residues are conserved at each position: the higher the number of residues, the higher the letters will be, because the better the conservation is at that position. Different residues at the same position are scaled according to their frequency. The height of the entire stack of residues is the

information Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

measured in

bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represented a ...

s. Sequence logos can be used to represent conserved DNA binding sites, where

transcription factor In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The fu ...

s bind. The information content (y-axis) of position

i

is given by: :for amino acids,

R_i = \log_2(20) - (H_i + e_n)

:for nucleic acids,

R_i = \log_2(4) - (H_i + e_n)

where

H_i

is the uncertainty (sometimes called the Shannon

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...

) of position

i

H_i = - \sum_^ f_ \times \log_2 f_

Here,

f_

is the relative frequency of base or amino acid

b

at position

i

, and

e_n

is the small-sample correction for an alignment of

n

letters. The height of letter

b

in column

i

is given by :

\text = f_ \times R_i

The approximation for the small-sample correction,

e_n

, is given by: :

e_n = \frac\times\frac

where

s

is 4 for nucleotides, 20 for amino acids, and

n

is the number of sequences in the alignment.

Consensus logo

A consensus logo is a simplified variation of a sequence logo that can be embedded in text format. Like a sequence logo, a consensus logo is created from a collection of aligned protein or DNA/RNA sequences and conveys information about the conservation of each position of a

sequence motif In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as '' ...

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...

. However, a consensus logo displays only conservation information, and not explicitly the frequency information of each

at each position. Instead of a stack made of several characters, denoting the relative frequency of each character, the consensus logo depicts the degree of conservation of each position using the height of the consensus character at that position.

Advantages and drawbacks

The main, and obvious, advantage of consensus logos over sequence logos is their ability to be embedded as text in any

Rich Text Format ) As an example, the following RTF code would be rendered as follows: This is some bold text. Character encoding A standard RTF file can only consist of 7-bit ASCII characters, but can use escape sequences to encode other characters. T ...

supporting editor/viewer and, therefore, in scientific manuscripts. As described above, the consensus logo is a cross between sequence logos and

consensus sequence In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It serves as a simplified r ...

s. As a result, compared to a sequence logo, the consensus logo omits information (the relative contribution of each character to the conservation of that position in the motif/alignment). Hence, a sequence logo should be used preferentially whenever possible. That being said, the need to include graphic figures in order to display sequence logos has perpetuated the use of consensus sequences in scientific manuscripts, even though they fail to convey information on both conservation and frequency. Consensus logos represent therefore an improvement over consensus sequences whenever motif/alignment information has to be constrained to text.

Extensions

Hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

s (HMMs) not only consider the information content of aligned positions in an alignment, but also of insertions and deletions. In an HMM sequence logo used by

Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families. Use ...

, three rows are added to indicate the frequencies of occupancy (presence) and insertion, as well as the expected insertion length. PF03377

References

{{reflist

External links

How to read sequence logos

* Erill, I., "A gentle introduction to information content in transcription factor binding sites"
Eprint

What is (in) a sequence logo?
Bioinformatics Statistical charts and diagrams