HOME

TheInfoList



OR:

The nucleic acid notation currently in use was first formalized by the
International Union of Pure and Applied Chemistry The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...
(IUPAC) in 1970. This universally accepted notation uses the Roman characters G, C, A, and T, to represent the four nucleotides commonly found in deoxyribonucleic acids (DNA). Given the rapidly expanding role for genetic sequencing, synthesis, and analysis in biology, some researchers have developed alternate notations to further support the analysis and manipulation of genetic data. These notations generally exploit size, shape, and symmetry to accomplish these objectives.


IUPAC notation

Degenerate base symbols in
biochemistry Biochemistry or biological chemistry is the study of chemical processes within and relating to living organisms. A sub-discipline of both chemistry and biology, biochemistry may be divided into three fields: structural biology, enzymology and ...
are an
IUPAC The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...
representation for a position on a DNA sequence that can have multiple possible alternatives. These should not be confused with non-canonical bases because each particular sequence will have in fact one of the regular bases. These are used to encode the consensus sequence of a population of aligned sequences and are used for example in
phylogenetic analysis In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...
to summarise into one multiple sequences or for
BLAST Blast or The Blast may refer to: *Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film), ...
searches, even though IUPAC degenerate symbols are masked (as they are not coded). Under the commonly used IUPAC system, nucleobases are represented by the first letters of their chemical names: guanine, cytosine, adenine, and thymine. This shorthand also includes eleven "ambiguity" characters associated with every possible combination of the four DNA bases. The ambiguity characters were designed to encode positional variations in order to report DNA sequencing errors, consensus sequences, or
single-nucleotide polymorphism In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently lar ...
s. The IUPAC notation, including ambiguity characters and suggested mnemonics, is shown in Table 1. Despite its broad and nearly universal acceptance, the IUPAC system has a number of limitations, which stem from its reliance on the Roman alphabet. The poor legibility of upper-case Roman characters, which are generally used when displaying genetic data, may be chief among these limitations. The value of external projections in distinguishing letters has been well documented.Tinker, M. A. 1963. Legibility of Print. Iowa State University Press, Ames IA. However, these projections are absent from upper case letters, which in some cases are only distinguishable by subtle internal cues. Take for example the upper case C and G used to represent cytosine and guanine. These characters generally comprise half the characters in a genetic sequence but are differentiated by a small internal tick (depending on the typeface). Nevertheless, these Roman characters are available in the
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
character set most commonly used in textual communications, which reinforces this system's ubiquity. Another shortcoming of the IUPAC notation arises from the fact that its eleven ambiguity characters have been selected from the remaining characters of the Roman alphabet. The authors of the notation endeavored to select ambiguity characters with logical mnemonics. For example, S is used to represent the possibility of finding cytosine or guanine at genetic loci, both of which form ''strong'' cross-strand binding interactions. Conversely, the ''weaker'' interactions of thymine and adenine are represented by a W. However, convenient mnemonics are not as readily available for the other ambiguity characters displayed in Table 1. This has made ambiguity characters difficult to use and may account for their limited application.


Nucleic acid nomenclature

The positions of the
carbon Carbon () is a chemical element with the symbol C and atomic number 6. It is nonmetallic and tetravalent—its atom making four electrons available to form covalent chemical bonds. It belongs to group 14 of the periodic table. Carbon mak ...
s in the ribose sugar that forms the backbone of the nucleic acid chain are numbered, and are used to indicate the direction of nucleic acids (5'->3' versus 3'->5'). This is referred to as directionality.


Alternative visually enhanced notations

Legibility issues associated with IUPAC-encoded genetic data have led biologists to consider alternative strategies for displaying genetic data. These creative approaches to visualizing DNA sequences have generally relied on the use of spatially distributed symbols and/or visually distinct shapes to encode lengthy nucleic acid sequences. Alternative notations for nucleotide sequences have been attempted, however general uptake has been low. Several of these approaches are summarized below.


Stave projection

In 1986, Cowin et al. described a novel method for visualizing DNA sequence known as the Stave Projection. Their strategy was to encode nucleotides as circles on series of horizontal bars akin to notes on musical stave. As illustrated in Figure 1, each gap on the five-line staff corresponded to one of the four DNA bases. The spatial distribution of the circles made it far easier to distinguish individual bases and compare genetic sequences than IUPAC-encoded data. The order of the bases (from top to bottom, G, A, T, C) is chosen so that the complementary strand can be read by turning the projection upside down.


Geometric symbols

Zimmerman et al. took a different approach to visualizing genetic data. Rather than relying on spatially distributed circles to highlight genetic features, they exploited four geometrically diverse symbols found in a standard computer font to distinguish the four bases. The authors developed a simple WordPerfect macro to translate IUPAC characters into the more visually distinct symbols.


DNA Skyline

With the growing availability of font editors, Jarvius and Landegren devised a novel set of genetic symbols, known as the DNA Skyline font, which uses increasingly taller blocks to represent the different DNA bases. While reminiscent of Cowin ''et al''.'s spatially distributed Stave Projection, the DNA Skyline font is easy to download and permits translation to and from the IUPAC notation by simply changing the font in most standard word processing applications.


Ambigraphic notations

Ambigrams An ambigram is a calligraphic design that has several interpretations as written. The term was coined by Douglas Hofstadter in 1983. Most often, ambigrams appear as visually symmetrical words. When flipped, they remain unchanged, or they mutate ...
(symbols that convey different meaning when viewed in a different orientation) have been designed to mirror structural symmetries found in the DNA double helix. By assigning ambigraphic characters to complementary bases (i.e. guanine: b, cytosine: q, adenine: n, and thymine: u), it is possible to complement DNA sequences by simply rotating the text 180 degrees. An ambigraphic nucleic acid notation also makes it easy to identify genetic palindromes, such as endonuclease restriction sites, as sections of text that can be rotated 180 degrees without changing the sequence. One example of an ambigraphic nucleic acid notation is AmbiScript, a rationally designed nucleic acid notations that combined many of the visual and functional features of its predecessors. Its notation also uses spatially offset characters to facilitate the visual review and analysis of genetic data. AmbiScript was also designed to indicate ambiguous nucleotide positions via compound symbols. This strategy aimed to offer a more intuitive solution to the use of ambiguity characters first proposed by the IUPAC. As with Jarvius and Landegren's DNA Skyline fonts, AmbiScript fonts can be downloaded and applied to IUPAC-encoded sequence data.


Triple Helix Base Pairing

Watson and Crick base pairs are indicated by a "•" or a "-" or a "." (example: A•T, or poly(rC)•2poly(rC)). Hoogsteen
triple helix In the fields of geometry and biochemistry, a triple helix (plural triple helices) is a set of three congruent geometrical helices with the same axis, differing by a translation along the axis. This means that each of the helices keeps the same ...
base pairs are indicated by a "*" or a ":" (example: C•G*G+, or T•A*T, or C•G*G, or T•A*A).


See also

* IUPAC for amino acids *
DNA replication In molecular biology, DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. DNA replication occurs in all living organisms acting as the most essential part for biological inheritanc ...
*
Nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecule ...


References

{{Natural science DNA Notation DNA replication Nucleic acids Nucleotides