A nucleic acid sequence is a succession of
bases signified by a series of a set of five different letters that indicate the order of
nucleotides
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules w ...
forming
allele
An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution.
::"The chro ...
s within a
DNA (using GACT) or
RNA
Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
(GACU) molecule. By convention, sequences are usually presented from the
5' end to the 3' end. For DNA, the
sense
A sense is a biological system used by an organism for sensation, the process of gathering information about the world through the detection of Stimulus (physiology), stimuli. (For example, in the human body, the brain which is part of the cen ...
strand is used. Because nucleic acids are normally linear (unbranched)
polymers
A polymer (; Greek '' poly-'', "many" + ''-mer'', "part")
is a substance or material consisting of very large molecules called macromolecules, composed of many repeating subunits. Due to their broad spectrum of properties, both synthetic an ...
, specifying the sequence is equivalent to defining the
covalent
A covalent bond is a chemical bond that involves the sharing of electrons to form electron pairs between atoms. These electron pairs are known as shared pairs or bonding pairs. The stable balance of attractive and repulsive forces between atoms ...
structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the
primary structure
Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthes ...
.
The sequence has capacity to represent
information
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random ...
. Biological
deoxyribonucleic acid represents the information which directs the functions of an
organism
In biology, an organism () is any living system that functions as an individual entity. All organisms are composed of cells (cell theory). Organisms are classified by taxonomy into groups such as multicellular animals, plants, and ...
.
Nucleic acids also have a
secondary structure
Protein secondary structure is the three dimensional conformational isomerism, form of ''local segments'' of proteins. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
and
tertiary structure
Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may int ...
. Primary structure is sometimes mistakenly referred to as ''primary sequence''. Conversely, there is no parallel concept of secondary or tertiary sequence.
Nucleotides
Nucleic acids consist of a chain of linked units called nucleotides. Each nucleotide consists of three subunits: a
phosphate
In chemistry, a phosphate is an anion, salt, functional group or ester derived from a phosphoric acid. It most commonly means orthophosphate, a derivative of orthophosphoric acid .
The phosphate or orthophosphate ion is derived from phospho ...
group and a
sugar
Sugar is the generic name for sweet-tasting, soluble carbohydrates, many of which are used in food. Simple sugars, also called monosaccharides, include glucose, fructose, and galactose. Compound sugars, also called disaccharides or double ...
(
ribose
Ribose is a simple sugar and carbohydrate with molecular formula C5H10O5 and the linear-form composition H−(C=O)−(CHOH)4−H. The naturally-occurring form, , is a component of the ribonucleotides from which RNA is built, and so this compo ...
in the case of
RNA
Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
,
deoxyribose
Deoxyribose, or more precisely 2-deoxyribose, is a monosaccharide with idealized formula H−(C=O)−(CH2)−(CHOH)3−H. Its name indicates that it is a deoxy sugar, meaning that it is derived from the sugar ribose by loss of a hydroxy group. D ...
in
DNA) make up the backbone of the nucleic acid strand, and attached to the sugar is one of a set of
nucleobase
Nucleobases, also known as ''nitrogenous bases'' or often simply ''bases'', are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic b ...
s. The nucleobases are important in
base pair
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
ing of strands to form higher-level
secondary
Secondary may refer to: Science and nature
* Secondary emission, of particles
** Secondary electrons, electrons generated as ionization products
* The secondary winding, or the electrical or electronic circuit connected to the secondary winding i ...
and
tertiary structures
Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length sc ...
such as the famed
double helix
A double is a look-alike or doppelgänger; one person or being that resembles another.
Double, The Double or Dubble may also refer to:
Film and television
* Double (filmmaking), someone who substitutes for the credited actor of a character
* ...
.
The possible letters are ''A'', ''C'', ''G'', and ''T'', representing the four
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
bases of a DNA strand –
adenine
Adenine () ( symbol A or Ade) is a nucleobase (a purine derivative). It is one of the four nucleobases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The three others are guanine, cytosine and thymine. Its derivati ...
,
cytosine
Cytosine () ( symbol C or Cyt) is one of the four nucleobases found in DNA and RNA, along with adenine, guanine, and thymine (uracil in RNA). It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attached (an am ...
,
guanine
Guanine () ( symbol G or Gua) is one of the four main nucleobases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine (uracil in RNA). In DNA, guanine is paired with cytosine. The guanine nucleoside is called ...
,
thymine
Thymine () ( symbol T or Thy) is one of the four nucleobases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The others are adenine, guanine, and cytosine. Thymine is also known as 5-methyluracil, a pyrimidine nu ...
–
covalent
A covalent bond is a chemical bond that involves the sharing of electrons to form electron pairs between atoms. These electron pairs are known as shared pairs or bonding pairs. The stable balance of attractive and repulsive forces between atoms ...
ly linked to a
phosphodiester
In chemistry, a phosphodiester bond occurs when exactly two of the hydroxyl groups () in phosphoric acid react with hydroxyl groups on other molecules to form two ester bonds. The "bond" involves this linkage . Discussion of phosphodiesters is d ...
backbone. In the typical case, the sequences are printed abutting one another without gaps, as in the sequence AAAGTCTGAC, read left to right in the
5' to 3'
Directionality, in molecular biology and biochemistry, is the end-to-end chemical orientation of a single strand of nucleic acid. In a single strand of DNA or RNA, the chemical convention of naming carbon atoms in the nucleotide Pentose, pentose-s ...
direction. With regards to
transcription
Transcription refers to the process of converting sounds (voice, music etc.) into letters or musical notes, or producing a copy of something in another medium, including:
Genetics
* Transcription (biology), the copying of DNA into RNA, the fir ...
, a sequence is on the coding strand if it has the same order as the transcribed RNA.
One sequence can be
complementary
A complement is something that completes something else.
Complement may refer specifically to:
The arts
* Complement (music), an interval that, when added to another, spans an octave
** Aggregate complementation, the separation of pitch-class ...
to another sequence, meaning that they have the base on each position in the complementary (i.e., A to T, C to G) and in the reverse order. For example, the complementary sequence to TTAC is GTAA. If one strand of the double-stranded DNA is considered the sense strand, then the other strand, considered the antisense strand, will have the complementary sequence to the sense strand.
Notation
Comparing and determining % difference between two nucleotide sequences.
* AATCCGCTAG
* AAACCCTTAG
* Given the two 10-nucleotide sequences, line them up and compare the differences between them. Calculate the percent similarity by taking the number of different DNA bases divided by the total number of nucleotides. In the above case, there are three differences in the 10 nucleotide sequence. Therefore, divide 7/10 to get the 70% similarity and subtract that from 100% to get a 30% difference.
While A, T, C, and G represent a particular nucleotide at a position, there are also letters that represent ambiguity which are used when more than one kind of nucleotide could occur at that position. The rules of the International Union of Pure and Applied Chemistry (
IUPAC
The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...
) are as follows:
[Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences]
NC-IUB, 1984.
These symbols are also valid for RNA, except with U (uracil) replacing T (thymine).
[
Apart from adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U), DNA and RNA also contain bases that have been modified after the nucleic acid chain has been formed. In DNA, the most common modified base is ]5-methylcytidine
5-Methylcytidine is a modified nucleoside derived from 5-Methylcytosine, 5-methylcytosine. It is found in ribonucleic acids of animal, plant, and bacterial origin.
References
Nucleosides
Pyrimidones
Hydroxymethyl compounds
{{organic ...
(m5C). In RNA, there are many modified bases, including pseudouridine (Ψ), dihydrouridine (D), inosine (I), ribothymidine (rT) and 7-methylguanosine
7-Methylguanosine (m7G) is a modified purine nucleoside. It is a methylated version of guanosine and when found in human urine, it may be a biomarker of some types of cancer. In the RNAs, 7-methylguanosine have been used to study and examine the ...
(m7G). Hypoxanthine
Hypoxanthine is a naturally occurring purine derivative. It is occasionally found as a constituent of nucleic acids, where it is present in the anticodon of tRNA in the form of its nucleoside inosine. It has a tautomer known as 6-hydroxypurine. Hyp ...
and xanthine
Xanthine ( or ; archaically xanthic acid; systematic name 3,7-dihydropurine-2,6-dione) is a purine base (genetics), base found in most human body tissues and fluids, as well as in other organisms. Several stimulants are derived from xanthine, incl ...
are two of the many bases created through mutagen
In genetics, a mutagen is a physical or chemical agent that permanently changes nucleic acid, genetic material, usually DNA, in an organism and thus increases the frequency of mutations above the natural background level. As many mutations can ca ...
presence, both of them through deamination (replacement of the amine-group with a carbonyl-group). Hypoxanthine is produced from adenine
Adenine () ( symbol A or Ade) is a nucleobase (a purine derivative). It is one of the four nucleobases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The three others are guanine, cytosine and thymine. Its derivati ...
, and xanthine is produced from guanine
Guanine () ( symbol G or Gua) is one of the four main nucleobases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine (uracil in RNA). In DNA, guanine is paired with cytosine. The guanine nucleoside is called ...
. Similarly, deamination of cytosine
Cytosine () ( symbol C or Cyt) is one of the four nucleobases found in DNA and RNA, along with adenine, guanine, and thymine (uracil in RNA). It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attached (an am ...
results in uracil
Uracil () (symbol U or Ura) is one of the four nucleobases in the nucleic acid RNA. The others are adenine (A), cytosine (C), and guanine (G). In RNA, uracil binds to adenine via two hydrogen bonds. In DNA, the uracil nucleobase is replaced by ...
.
Biological significance
In biological systems, nucleic acids contain information which is used by a living cell
Cell most often refers to:
* Cell (biology), the functional basic unit of life
Cell may also refer to:
Locations
* Monastic cell, a small room, hut, or cave in which a religious recluse lives, alternatively the small precursor of a monastery ...
to construct specific protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
s. The sequence of nucleobase
Nucleobases, also known as ''nitrogenous bases'' or often simply ''bases'', are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic b ...
s on a nucleic acid strand is translated
Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...
by cell machinery into a sequence of amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
s making up a protein strand. Each group of three bases, called a codon
The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...
, corresponds to a single amino acid, and there is a specific genetic code
The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...
by which each possible combination of three bases corresponds to a specific amino acid.
The central dogma of molecular biology
The central dogma of molecular biology is an explanation of the flow of genetic information within a biological system. It is often stated as "DNA makes RNA, and RNA makes protein", although this is not its original meaning. It was first stated by ...
outlines the mechanism by which proteins are constructed using information contained in nucleic acids. DNA is transcribed into mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
molecules, which travel to the ribosome
Ribosomes ( ) are macromolecular machines, found within all cells, that perform biological protein synthesis (mRNA translation). Ribosomes link amino acids together in the order specified by the codons of messenger RNA (mRNA) molecules to ...
where the mRNA is used as a template for the construction of the protein strand. Since nucleic acids can bind to molecules with complementary
A complement is something that completes something else.
Complement may refer specifically to:
The arts
* Complement (music), an interval that, when added to another, spans an octave
** Aggregate complementation, the separation of pitch-class ...
sequences, there is a distinction between "sense
A sense is a biological system used by an organism for sensation, the process of gathering information about the world through the detection of Stimulus (physiology), stimuli. (For example, in the human body, the brain which is part of the cen ...
" sequences which code for proteins, and the complementary "antisense" sequence, which is by itself nonfunctional, but can bind to the sense strand.
Sequence determination
DNA sequencing is the process of determining the nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
sequence of a given DNA fragment. The sequence of the DNA of a living thing encodes the necessary information for that living thing to survive and reproduce. Therefore, determining the sequence is useful in fundamental research into why and how organisms live, as well as in applied subjects. Because of the importance of DNA to living things, knowledge of a DNA sequence may be useful in practically any biological research
Research is "creativity, creative and systematic work undertaken to increase the stock of knowledge". It involves the collection, organization and analysis of evidence to increase understanding of a topic, characterized by a particular att ...
. For example, in medicine
Medicine is the science and practice of caring for a patient, managing the diagnosis, prognosis, prevention, treatment, palliation of their injury or disease, and promoting their health. Medicine encompasses a variety of health care pract ...
it can be used to identify, diagnose
Diagnosis is the identification of the nature and cause of a certain phenomenon. Diagnosis is used in many different disciplines, with variations in the use of logic, analytics, and experience, to determine " cause and effect". In systems enginee ...
and potentially develop treatments for genetic disease
A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...
s. Similarly, research into pathogens
In biology, a pathogen ( el, πάθος, "suffering", "passion" and , "producer of") in the oldest and broadest sense, is any organism or agent that can produce disease. A pathogen may also be referred to as an infectious agent, or simply a germ ...
may lead to treatments for contagious diseases. Biotechnology
Biotechnology is the integration of natural sciences and engineering sciences in order to achieve the application of organisms, cells, parts thereof and molecular analogues for products and services. The term ''biotechnology'' was first used b ...
is a burgeoning discipline, with the potential for many useful products and services.
RNA is not sequenced directly. Instead, it is copied to a DNA by reverse transcriptase
A reverse transcriptase (RT) is an enzyme used to generate complementary DNA (cDNA) from an RNA template, a process termed reverse transcription. Reverse transcriptases are used by viruses such as HIV and hepatitis B to replicate their genomes, ...
, and this DNA is then sequenced.
Current sequencing methods rely on the discriminatory ability of DNA polymerases, and therefore can only distinguish four bases. An inosine (created from adenosine during RNA editing
RNA editing (also RNA modification) is a molecular process through which some cells can make discrete changes to specific nucleotide sequences within an RNA molecule after it has been generated by RNA polymerase. It occurs in all living organisms ...
) is read as a G, and 5-methyl-cytosine (created from cytosine by DNA methylation
DNA methylation is a biological process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence. When located in a gene promoter, DNA methylation typically acts t ...
) is read as a C. With current technology, it is difficult to sequence small amounts of DNA, as the signal is too weak to measure. This is overcome by polymerase chain reaction
The polymerase chain reaction (PCR) is a method widely used to rapidly make millions to billions of copies (complete or partial) of a specific DNA sample, allowing scientists to take a very small sample of DNA and amplify it (or a part of it) t ...
(PCR) amplification.
Digital representation
Once a nucleic acid sequence has been obtained from an organism, it is stored ''in silico
In biology and other experimental sciences, an ''in silico'' experiment is one performed on computer or via computer simulation. The phrase is pseudo-Latin for 'in silicon' (correct la, in silicio), referring to silicon in computer chips. It ...
'' in digital format. Digital genetic sequences may be stored in sequence database
In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The ...
s, be analyzed (see ''Sequence analysis'' below), be digitally altered and be used as templates for creating new actual DNA using artificial gene synthesis
Artificial gene synthesis, or simply gene synthesis, refers to a group of methods that are used in synthetic biology to construct and assemble genes from nucleotides '' de novo''. Unlike DNA synthesis in living cells, artificial gene synthesis do ...
.
Sequence analysis
Digital genetic sequences may be analyzed using the tools of bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
to attempt to determine its function.
Genetic testing
The DNA in an organism's genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
can be analyzed to diagnose
Diagnosis is the identification of the nature and cause of a certain phenomenon. Diagnosis is used in many different disciplines, with variations in the use of logic, analytics, and experience, to determine " cause and effect". In systems enginee ...
vulnerabilities to inherited disease
A disease is a particular abnormal condition that negatively affects the structure or function of all or part of an organism, and that is not immediately due to any external injury. Diseases are often known to be medical conditions that a ...
s, and can also be used to determine a child's paternity (genetic father) or a person's ancestry
An ancestor, also known as a forefather, fore-elder or a forebear, is a parent or (recursively) the parent of an antecedent (i.e., a grandparent, great-grandparent, great-great-grandparent and so forth). ''Ancestor'' is "any person from whom ...
. Normally, every person carries two variations of every gene
In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
, one inherited from their mother, the other inherited from their father. The human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the n ...
is believed to contain around 20,000–25,000 genes. In addition to studying chromosome
A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins are ...
s to the level of individual genes, genetic testing in a broader sense includes biochemical
Biochemistry or biological chemistry is the study of chemical processes within and relating to living organisms. A sub-discipline of both chemistry and biology, biochemistry may be divided into three fields: structural biology, enzymology an ...
tests for the possible presence of genetic disease
A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...
s, or mutant forms of genes associated with increased risk of developing genetic disorders.
Genetic testing identifies changes in chromosomes, genes, or proteins. Usually, testing is used to find changes that are associated with inherited disorders. The results of a genetic test can confirm or rule out a suspected genetic condition or help determine a person's chance of developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being developed.
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA
Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
, or protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
to identify regions of similarity that may be due to functional, structural
A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such a ...
, or evolution
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
ary relationships between the sequences. If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutation
A point mutation is a genetic mutation where a single nucleotide base is changed, inserted or deleted from a DNA or RNA sequence of an organism's genome. Point mutations have a variety of effects on the downstream protein product—consequences ...
s and gaps as insertion or deletion mutations (indel
Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that ...
s) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
s occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif
In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''As ...
is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chain
In organic chemistry and biochemistry, a side chain is a chemical group that is attached to a core part of the molecule called the "main chain" or backbone. The side chain is a hydrocarbon branching element of a molecule that is attached to a l ...
s have similar biochemical properties) in a particular region of the sequence, suggest that this region has structural or functional importance. Although DNA and RNA nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.
Computational phylogenetics
Computational phylogenetics is the application of computational algorithms, methods, and programs to phylogenetic makes extensive use of sequence alignments in the construction and interpretation of phylogenetic tree
A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...
s, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young most recent common ancestor
In biology and genetic genealogy, the most recent common ancestor (MRCA), also known as the last common ancestor (LCA) or concestor, of a set of organisms is the most recent individual from which all the organisms of the set are descended. The ...
, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock
The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleoti ...
" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence
Coalescence may refer to:
* Coalescence (chemistry), the process by which two or more separate masses of miscible substances seem to "pull" each other together should they make the slightest contact
* Coalescence (computer science), the merging of ...
time), assumes that the effects of mutation and selection
Selection may refer to:
Science
* Selection (biology), also called natural selection, selection in evolution
** Sex selection, in genetics
** Mate selection, in mating
** Sexual selection in humans, in human sexuality
** Human mating strategie ...
are constant across sequence lineages. Therefore, it does not account for possible differences among organisms or species in the rates of DNA repair
DNA repair is a collection of processes by which a cell identifies and corrects damage to the DNA molecules that encode its genome. In human cells, both normal metabolic activities and environmental factors such as radiation can cause DNA dam ...
or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutation
Silent mutations are mutations in DNA that do not have an observable effect on the organism's phenotype. They are a specific type of neutral mutation. The phrase ''silent mutation'' is often used interchangeably with the phrase '' synonymous mutat ...
s that do not alter the meaning of a given codon
The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...
and other mutations that result in a different amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
being incorporated into the protein.) More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.
Sequence motifs
Frequently the primary structure encodes motifs that are of functional importance. Some examples of sequence motifs are: the C/D
and H/ACA boxes
of snoRNA
In molecular biology, Small nucleolar RNAs (snoRNAs) are a class of small RNA molecules that primarily guide chemical modifications of other RNAs, mainly ribosomal RNAs, transfer RNAs and small nuclear RNAs. There are two main classes of snoRNA, ...
s, Sm binding site found in spliceosomal RNAs such as U1, U2, U4, U5, U6, U12 and U3, the Shine-Dalgarno sequence,
the Kozak consensus sequence
The Kozak consensus sequence (Kozak consensus or Kozak sequence) is a nucleic acid motif that functions as the protein translation initiation site in most eukaryotic mRNA transcripts. Regarded as the optimum sequence for initiating translation in ...
and the RNA polymerase III terminator.
Sequence entropy
In bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, a sequence entropy, also known as sequence complexity or information profile, is a numerical sequence providing a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. The manipulations of the information profiles enable the analysis of the sequences using alignment-free techniques, such as for example in motif and rearrangements detection.
See also
* Gene structure
Gene structure is the organisation of specialised sequence elements within a gene. Genes contain most of the information necessary for living cells to survive and reproduce. In most organisms, genes are made of DNA, where the particular DNA sequen ...
* Nucleic acid structure determination
Experimental approaches of determining the structure of nucleic acids, such as RNA and DNA, can be largely classified into biophysical and biochemical methods. Biophysical methods use the fundamental physical properties of molecules for struct ...
* Quaternary numeral system
A quaternary numeral system is base-. It uses the digits 0, 1, 2 and 3 to represent any real number. Conversion from binary is straightforward.
Four is the largest number within the subitizing range and one of two numbers that is both a sq ...
* Single-nucleotide polymorphism
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently lar ...
(SNP)
References
External links
A bibliography on features, patterns, correlations in DNA and protein texts
Visualization of nucleotide sequence
{{DEFAULTSORT:Nucleic Acid Sequence
DNA
Molecular biology
Nucleic acids
RNA