The coding region of a
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
, also known as the coding DNA sequence (CDS), is the portion of a gene's
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
or
RNA
Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
that codes for a
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
.
Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of
prokaryote
A prokaryote (; less commonly spelled procaryote) is a unicellular organism, single-celled organism whose cell (biology), cell lacks a cell nucleus, nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Ancient Gree ...
s and
eukaryote
The eukaryotes ( ) constitute the Domain (biology), domain of Eukaryota or Eukarya, organisms whose Cell (biology), cells have a membrane-bound cell nucleus, nucleus. All animals, plants, Fungus, fungi, seaweeds, and many unicellular organisms ...
s. This can further assist in mapping the
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as the DNA within each of the 23 distinct chromosomes in the cell nucleus. A small DNA molecule is found within individual Mitochondrial DNA, mitochondria. These ar ...
and developing gene therapy.
Definition
Although this term is also sometimes used interchangeably with
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
, it is not the exact same thing: the
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
can be composed of the coding region as well as the 3' and 5'
untranslated region
In molecular genetics, an untranslated region (or UTR) refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the Directionality (molecular biology), 5' side, it is called the Five prime ...
s of the RNA, and so therefore, an exon would be partially made up of coding region. The 3' and 5'
untranslated region
In molecular genetics, an untranslated region (or UTR) refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the Directionality (molecular biology), 5' side, it is called the Five prime ...
s of the RNA, which do not code for protein, are termed
non-coding regions and are not discussed on this page.
There is often confusion between coding regions and
exomes and there is a clear distinction between these terms. While the
exome refers to all exons within a genome, the coding region refers to sections of the DNA (or
primary transcript
A primary transcript is the single-stranded ribonucleic acid (RNA) product synthesized by transcription of DNA, and processed to yield various mature RNA products such as mRNAs, tRNAs, and rRNAs. The primary transcripts designated to be mRNA ...
) or a singular section of processed mRNA which specifically codes for a certain kind of protein.
History
In 1978,
Walter Gilbert published "Why Genes in Pieces" which first began to explore the idea that the gene is a mosaic—that each full
nucleic acid
Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a pentose, 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nuclei ...
strand is not coded continuously but is interrupted by "silent" non-coding regions. This was the first indication that there needed to be a distinction between the parts of the genome that code for protein, now called coding regions, and those that do not.
Composition

The evidence suggests that there is a general interdependence between base composition patterns and coding region availability. The coding region is thought to contain a higher
GC-content
In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out of ...
than non-coding regions. There is further research that discovered that the longer the coding strand, the higher the GC-content. Short coding strands are comparatively still GC-poor, similar to the low GC-content of the base composition translational
stop codon
In molecular biology, a stop codon (or termination codon) is a codon (nucleotide triplet within messenger RNA) that signals the termination of the translation process of the current protein. Most codons in messenger RNA correspond to the additio ...
s like TAG, TAA, and TGA.
GC-rich areas are also where the ratio
point mutation
A point mutation is a genetic mutation where a single nucleotide base is changed, inserted or deleted from a DNA or RNA sequence of an organism's genome. Point mutations have a variety of effects on the downstream protein product—consequences ...
type is altered slightly: there are more
transitions, which are changes from purine to purine or pyrimidine to pyrimidine, compared to
transversions, which are changes from purine to pyrimidine or pyrimidine to purine. The transitions are less likely to change the encoded amino acid and remain a
silent mutation
Silent mutations, also called synonymous or samesense mutations, are mutations in DNA that do not have an observable effect on the organism's phenotype. The phrase ''silent mutation'' is often used interchangeably with the phrase '' synonymous mut ...
(especially if they occur in the third
nucleotide
Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
of a codon) which is usually beneficial to the organism during translation and protein formation.
This indicates that essential coding regions (gene-rich) are higher in GC-content and more stable and resistant to
mutation
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, ...
compared to accessory and non-essential regions (gene-poor). However, it is still unclear whether this came about through neutral and random mutation or through a pattern of
selection. There is also debate on whether the methods used, such as gene windows, to ascertain the relationship between GC-content and coding region are accurate and unbiased.
Structure and function

In
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
, the coding region is flanked by the
promoter sequence on the 5' end of the
template strand and the termination sequence on the 3' end. During
transcription, the
RNA Polymerase (RNAP) binds to the promoter sequence and moves along the template strand to the coding region. RNAP then adds RNA
nucleotide
Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
s complementary to the coding region in order to form the
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
, substituting
uracil
Uracil () (nucleoside#List of nucleosides and corresponding nucleobases, symbol U or Ura) is one of the four nucleotide bases in the nucleic acid RNA. The others are adenine (A), cytosine (C), and guanine (G). In RNA, uracil binds to adenine via ...
in place of
thymine
Thymine () (symbol T or Thy) is one of the four nucleotide bases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The others are adenine, guanine, and cytosine. Thymine is also known as 5-methyluracil, a pyrimidine ...
.
[Overview of transcription. (n.d.). Retrieved from https://www.khanacademy.org/science/biology/gene-expression-central-dogma/transcription-of-dna-into-rna/a/overview-of-transcription .] This continues until the RNAP reaches the termination sequence.
After transcription and maturation, the
mature mRNA formed encompasses multiple parts important for its eventual translation into
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
. The coding region in an mRNA is flanked by the
5' untranslated region (5'-UTR) and
3' untranslated region (3'-UTR),
the
5' cap, and
Poly-A tail. During
translation
Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...
, the
ribosome
Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...
facilitates the attachment of the
tRNAs to the coding region, 3 nucleotides at a time (
codons). The tRNAs transfer their associated
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
s to the growing
polypeptide
Peptides are short chains of amino acids linked by peptide bonds. A polypeptide is a longer, continuous, unbranched peptide chain. Polypeptides that have a molecular mass of 10,000 Da or more are called proteins. Chains of fewer than twenty ...
chain, eventually forming the protein defined in the initial DNA coding region.
Regulation
The coding region can be modified in order to regulate gene expression.
Alkylation Alkylation is a chemical reaction that entails transfer of an alkyl group. The alkyl group may be transferred as an alkyl carbocation, a free radical, a carbanion, or a carbene (or their equivalents). Alkylating agents are reagents for effecting al ...
is one form of regulation of the coding region. The gene that would have been transcribed can be silenced by targeting a specific sequence. The bases in this sequence would be blocked using
alkyl groups, which create the
silencing effect.
While the
regulation of gene expression
Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (protein or RNA). Sophisticated programs of gene expression are wide ...
manages the abundance of RNA or protein made in a cell, the regulation of these mechanisms can be controlled by a
regulatory sequence found before the
open reading frame
In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...
begins in a strand of DNA. The
regulatory sequence will then determine the location and time that expression will occur for a protein coding region.
RNA splicing
RNA splicing is a process in molecular biology where a newly-made precursor messenger RNA (pre-mRNA) transcription (biology), transcript is transformed into a mature messenger RNA (Messenger RNA, mRNA). It works by removing all the introns (non-cod ...
ultimately determines what part of the sequence becomes translated and expressed, and this process involves cutting out introns and putting together exons. Where the RNA
spliceosome cuts, however, is guided by the recognition of
splice sites, in particular the 5' splicing site, which is one of the substrates for the first step in splicing. The coding regions are within the exons, which become covalently joined together to form the
mature messenger RNA.
Mutations
Mutation
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, ...
s in the coding region can have very diverse effects on the phenotype of the organism. While some mutations in this region of DNA/RNA can result in advantageous changes, others can be harmful and sometimes even lethal to an organism's survival. In contrast, changes in the non-coding region may not always result in detectable changes in phenotype.
Mutation types

There are various forms of mutations that can occur in coding regions. One form is
silent mutation
Silent mutations, also called synonymous or samesense mutations, are mutations in DNA that do not have an observable effect on the organism's phenotype. The phrase ''silent mutation'' is often used interchangeably with the phrase '' synonymous mut ...
s, in which a change in nucleotides does not result in any change in amino acid after transcription and translation.
[Yang, J. (2016, March 23). What are Genetic Mutation? Retrieved from https://www.singerinstruments.com/resource/what-are-genetic-mutation/ .] There also exist
nonsense mutations, where base alterations in the coding region code for a premature stop codon, producing a shorter final protein.
Point mutations, or single base pair changes in the coding region, that code for different amino acids during translation, are called
missense mutations. Other types of mutations include
frameshift mutations such as
insertions or
deletions.
Formation
Some forms of mutations are
hereditary (
germline mutations), or passed on from a parent to its offspring.
[What is a gene mutation and how do mutations occur? - Genetics Home Reference - NIH. (n.d.). Retrieved from https://ghr.nlm.nih.gov/primer/mutationsanddisorders/genemutation .] Such mutated coding regions are present in all cells within the organism. Other forms of mutations are acquired (
somatic mutation
A somatic mutation is a change in the DNA sequence of a somatic cell of a multicellular organism with dedicated reproductive cells; that is, any mutation that occurs in a cell other than a gamete, germ cell, or gametocyte. Unlike germline muta ...
s) during an organism's lifetime, and may not be constant cell-to-cell.
These changes can be caused by
mutagens,
carcinogen
A carcinogen () is any agent that promotes the development of cancer. Carcinogens can include synthetic chemicals, naturally occurring substances, physical agents such as ionizing and non-ionizing radiation, and biologic agents such as viruse ...
s, or other environmental agents (ex.
UV). Acquired mutations can also be a result of copy-errors during
DNA replication
In molecular biology, DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. DNA replication occurs in all life, living organisms, acting as the most essential part of heredity, biolog ...
and are not passed down to offspring. Changes in the coding region can also be
de novo (new); such changes are thought to occur shortly after
fertilization
Fertilisation or fertilization (see American and British English spelling differences#-ise, -ize (-isation, -ization), spelling differences), also known as generative fertilisation, syngamy and impregnation, is the fusion of gametes to give ...
, resulting in a mutation present in the offspring's DNA while being absent in both the sperm and egg cells.
Prevention
There exist multiple transcription and translation mechanisms to prevent lethality due to deleterious mutations in the coding region. Such measures include
proofreading
Proofreading is a phase in the process of publishing where galley proofs are compared against the original manuscripts or graphic artworks, to identify transcription errors in the typesetting process. In the past, proofreaders would place corr ...
by some
DNA Polymerases during replication,
mismatch repair following replication, and the '
Wobble Hypothesis' which describes the
degeneracy of the third base within an mRNA codon.
Constrained coding regions (CCRs)
While it is well known that the genome of one individual can have extensive differences when compared to the genome of another, recent research has found that some coding regions are highly constrained, or resistant to mutation, between individuals of the same species. This is similar to the concept of interspecies constraint in
conserved sequences. Researchers termed these highly constrained sequences constrained coding regions (CCRs), and have also discovered that such regions may be involved in high
purifying selection. On average, there is approximately 1 protein-altering mutation every 7 coding bases, but some CCRs can have over 100 bases in sequence with no observed protein-altering mutations, some without even synonymous mutations.
[Havrilla, J. M., Pedersen, B. S., Layer, R. M., & Quinlan, A. R. (2018). A map of constrained coding regions in the human genome. ''Nature Genetics'', 88–95. ] These patterns of constraint between genomes may provide clues to the sources of rare
developmental diseases or potentially even embryonic lethality. Clinically validated variants and
de novo mutations in CCRs have been previously linked to disorders such as
infantile epileptic encephalopathy, developmental delay and severe heart disease.
Coding sequence detection

While identification of
open reading frames within a DNA sequence is straightforward, identifying coding sequences is not, because the cell translates only a subset of all open reading frames to proteins. Currently CDS prediction uses sampling and sequencing of mRNA from cells, although there is still the problem of determining which parts of a given mRNA are actually translated to protein. CDS prediction is a subset of
gene prediction, the latter also including prediction of DNA sequences that code not only for protein but also for other functional elements such as RNA genes and regulatory sequences.
In both
prokaryote
A prokaryote (; less commonly spelled procaryote) is a unicellular organism, single-celled organism whose cell (biology), cell lacks a cell nucleus, nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Ancient Gree ...
s and
eukaryote
The eukaryotes ( ) constitute the Domain (biology), domain of Eukaryota or Eukarya, organisms whose Cell (biology), cells have a membrane-bound cell nucleus, nucleus. All animals, plants, Fungus, fungi, seaweeds, and many unicellular organisms ...
s,
gene overlapping occurs relatively often in both DNA and RNA viruses as an evolutionary advantage to reduce genome size while retaining the ability to produce various proteins from the available coding regions. For both DNA and RNA,
pairwise alignments can detect overlapping coding regions, including short
open reading frame
In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...
s in viruses, but would require a known coding strand to compare the potential overlapping coding strand with. An alternative method using single genome sequences would not require multiple genome sequences to execute comparisons but would require at least 50 nucleotides overlapping in order to be sensitive.
See also
*
Coding strand The DNA strand that codes for a protein
*
Exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
The entire portion of the strand that is transcribed
*
Mature mRNA The portion of the mRNA transcription product that is translated
*
Gene structure The other elements that make up a gene
*
Nested gene Entire coding sequence lies within the bounds of a larger external gene
*
Non-coding DNA
Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and reg ...
Parts of genomes that do not encode protein-coding genes
*
Non-coding RNA
A non-coding RNA (ncRNA) is a functional RNA molecule that is not Translation (genetics), translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally imp ...
Molecules that do not encode proteins, so have no CDS
*
Non-functional DNA Parts of genomes with no relevant biological function
References
{{Reflist
DNA
Biochemistry