The coding region of a
gene
In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
, also known as the coding sequence (CDS), is the portion of a gene's
DNA or
RNA
Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
that codes for
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
.
Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of
prokaryote
A prokaryote () is a single-celled organism that lacks a nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Greek πρό (, 'before') and κάρυον (, 'nut' or 'kernel').Campbell, N. "Biology:Concepts & Con ...
s and
eukaryote
Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bact ...
s. This can further assist in mapping the
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the ...
and developing gene therapy.
Definition
Although this term is also sometimes used interchangeably with
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
, it is not the exact same thing: the
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
is composed of the coding region as well as the 3' and 5'
untranslated region
In molecular genetics, an untranslated region (or UTR) refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the 5' side, it is called the 5' UTR (or leader sequence), or if it is f ...
s of the RNA, and so therefore, an exon would be partially made up of coding regions. The 3' and 5'
untranslated region
In molecular genetics, an untranslated region (or UTR) refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the 5' side, it is called the 5' UTR (or leader sequence), or if it is f ...
s of the RNA, which do not code for protein, are termed
non-coding regions and are not discussed on this page.
There is often confusion between coding regions and
exomes and there is a clear distinction between these terms. While the
exome refers to all exons within a genome, the coding region refers to a singular section of the DNA or RNA which specifically codes for a certain kind of protein.
History
In 1978,
Walter Gilbert published "Why Genes in Pieces" which first began to explore the idea that the gene is a mosaic—that each full
nucleic acid
Nucleic acids are biopolymers, macromolecules, essential to all known forms of life. They are composed of nucleotides, which are the monomers made of three components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main ...
strand is not coded continuously but is interrupted by "silent" non-coding regions. This was the first indication that there needed to be a distinction between the parts of the genome that code for protein, now called coding regions, and those that do not.
Composition
The evidence suggests that there is a general interdependence between base composition patterns and coding region availability. The coding region is thought to contain a higher
GC-content
In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out of ...
than non-coding regions. There is further research that discovered that the longer the coding strand, the higher the GC-content. Short coding strands are comparatively still GC-poor, similar to the low GC-content of the base composition translational
stop codon
In molecular biology (specifically protein biosynthesis), a stop codon (or termination codon) is a codon ( nucleotide triplet within messenger RNA) that signals the termination of the translation process of the current protein. Most codons in ...
s like TAG, TAA, and TGA.
GC-rich areas are also where the ratio
point mutation type is altered slightly: there are more
transitions, which are changes from purine to purine or pyrimidine to pyrimidine, compared to
transversions, which are changes from purine to pyrimidine or pyrimidine to purine. The transitions are less likely to change the encoded amino acid and remain a
silent mutation (especially if they occur in the third
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecul ...
of a codon) which is usually beneficial to the organism during translation and protein formation.
This indicates that essential coding regions (gene-rich) are higher in GC-content and more stable and resistant to
mutation
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, m ...
compared to accessory and non-essential regions (gene-poor). However, it is still unclear whether this came about through neutral and random mutation or through a pattern of
selection
Selection may refer to:
Science
* Selection (biology), also called natural selection, selection in evolution
** Sex selection, in genetics
** Mate selection, in mating
** Sexual selection in humans, in human sexuality
** Human mating strat ...
. There is also debate on whether the methods used, such as gene windows, to ascertain the relationship between GC-content and coding region are accurate and unbiased.
Structure and function
In
DNA, the coding region is flanked by the
promoter sequence on the 5' end of the
template strand and the termination sequence on the 3' end. During
transcription, the
RNA Polymerase (RNAP) binds to the promoter sequence and moves along the template strand to the coding region. RNAP then adds RNA
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecul ...
s complementary to the coding region in order to form the
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein.
mRNA is created during the ...
, substituting
uracil
Uracil () (symbol U or Ura) is one of the four nucleobases in the nucleic acid RNA. The others are adenine (A), cytosine (C), and guanine (G). In RNA, uracil binds to adenine via two hydrogen bonds. In DNA, the uracil nucleobase is replaced ...
in place of
thymine
Thymine () (symbol T or Thy) is one of the four nucleobases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The others are adenine, guanine, and cytosine. Thymine is also known as 5-methyluracil, a pyrimidine n ...
.
[Overview of transcription. (n.d.). Retrieved from https://www.khanacademy.org/science/biology/gene-expression-central-dogma/transcription-of-dna-into-rna/a/overview-of-transcription.] This continues until the RNAP reaches the termination sequence.
After transcription and maturation, the
mature mRNA formed encompasses multiple parts important for its eventual translation into
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
. The coding region in an mRNA is flanked by the
5' untranslated region (5'-UTR) and
3' untranslated region
In molecular genetics, the three prime untranslated region (3′-UTR) is the section of messenger RNA (mRNA) that immediately follows the translation termination codon. The 3′-UTR often contains regulatory regions that post-transcriptionally ...
(3'-UTR),
the
5' cap, and
Poly-A tail. During
translation
Translation is the communication of the Meaning (linguistic), meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The ...
, the
ribosome facilitates the attachment of the
tRNAs
Transfer RNA (abbreviated tRNA and formerly referred to as sRNA, for soluble RNA) is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes), that serves as the physical link between the mRNA and the amino a ...
to the coding region, 3 nucleotides at a time (
codons). The tRNAs transfer their associated
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha ...
s to the growing
polypeptide
Peptides (, ) are short chains of amino acids linked by peptide bonds. Long chains of amino acids are called proteins. Chains of fewer than twenty amino acids are called oligopeptides, and include dipeptides, tripeptides, and tetrapeptides ...
chain, eventually forming the protein defined in the initial DNA coding region.
Regulation
The coding region can be modified in order to regulate gene expression.
Alkylation
Alkylation is the transfer of an alkyl group from one molecule to another. The alkyl group may be transferred as an alkyl carbocation, a free radical, a carbanion, or a carbene (or their equivalents). Alkylating agents are reagents for effectin ...
is one form of regulation of the coding region. The gene that would have been transcribed can be silenced by targeting a specific sequence. The bases in this sequence would be blocked using
alkyl groups, which create the
silencing effect.
While the
regulation of gene expression
Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products (protein or RNA). Sophisticated programs of gene expression are wid ...
manages the abundance of RNA or protein made in a cell, the regulation of these mechanisms can be controlled by a
regulatory sequence found before the
open reading frame
In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a Prokaryote, prokaryotic DNA sequence, where only one of the #Six-fra ...
begins in a strand of DNA. The
regulatory sequence will then determine the location and time that expression will occur for a protein coding region.
RNA splicing
RNA splicing is a process in molecular biology where a newly-made precursor messenger RNA (pre-mRNA) transcript is transformed into a mature messenger RNA (mRNA). It works by removing all the introns (non-coding regions of RNA) and ''splicing'' b ...
ultimately determines what part of the sequence becomes translated and expressed, and this process involves cutting out introns and putting together exons. Where the RNA
spliceosome
A spliceosome is a large ribonucleoprotein (RNP) complex found primarily within the nucleus of eukaryotic cells. The spliceosome is assembled from small nuclear RNAs ( snRNA) and numerous proteins. Small nuclear RNA (snRNA) molecules bind to sp ...
cuts, however, is guided by the recognition of
splice sites, in particular the 5' splicing site, which is one of the substrates for the first step in splicing. The coding regions are within the exons, which become covalently joined together to form the
mature messenger RNA
Mature messenger RNA, often abbreviated as mature mRNA is a eukaryotic RNA transcript that has been spliced and processed and is ready for translation in the course of protein synthesis. Unlike the eukaryotic RNA immediately after transcription ...
.
Mutations
Mutation
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, m ...
s in the coding region can have very diverse effects on the phenotype of the organism. While some mutations in this region of DNA/RNA can result in advantageous changes, others can be harmful and sometimes even lethal to an organism's survival. In contrast, changes in the coding region may not always result in detectable changes in phenotype.
Mutation types
There are various forms of mutations that can occur in coding regions. One form is
silent mutations, in which a change in nucleotides does not result in any change in amino acid after transcription and translation.
[Yang, J. (2016, March 23). What are Genetic Mutation? Retrieved from https://www.singerinstruments.com/resource/what-are-genetic-mutation/.] There also exist
nonsense mutations, where base alterations in the coding region code for a premature stop codon, producing a shorter final protein.
Point mutations, or single base pair changes in the coding region, that code for different amino acids during translation, are called
missense mutations. Other types of mutations include
frameshift mutations such as
insertions or
deletions.
Formation
Some forms of mutations are
hereditary (
germline mutations), or passed on from a parent to its offspring.
[What is a gene mutation and how do mutations occur? - Genetics Home Reference - NIH. (n.d.). Retrieved from https://ghr.nlm.nih.gov/primer/mutationsanddisorders/genemutation.] Such mutated coding regions are present in all cells within the organism. Other forms of mutations are acquired (
somatic mutation
A somatic mutation is a change in the DNA sequence of a somatic cell of a multicellular organism with dedicated reproductive cells; that is, any mutation that occurs in a cell other than a gamete, germ cell, or gametocyte. Unlike germline m ...
s) during an organisms lifetime, and may not be constant cell-to-cell.
These changes can be caused by
mutagen
In genetics, a mutagen is a physical or chemical agent that permanently changes genetic material, usually DNA, in an organism and thus increases the frequency of mutations above the natural background level. As many mutations can cause cancer i ...
s,
carcinogen
A carcinogen is any substance, radionuclide, or radiation that promotes carcinogenesis (the formation of cancer). This may be due to the ability to damage the genome or to the disruption of cellular metabolic processes. Several radioactive subst ...
s, or other environmental agents (ex.
UV). Acquired mutations can also be a result of copy-errors during
DNA replication
In molecular biology, DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. DNA replication occurs in all living organisms acting as the most essential part for biological inherita ...
and are not passed down to offspring. Changes in the coding region can also be
de novo (new); such changes are thought to occur shortly after
fertilization
Fertilisation or fertilization (see spelling differences), also known as generative fertilisation, syngamy and impregnation, is the fusion of gametes to give rise to a new individual organism or offspring and initiate its development. Pro ...
, resulting in a mutation present in the offspring's DNA while being absent in both the sperm and egg cells.
Prevention
There exist multiple transcription and translation mechanisms to prevent lethality due to deleterious mutations in the coding region. Such measures include
proofreading
Proofreading is the reading of a galley proof or an electronic copy of a publication to find and correct reproduction errors of text or art. Proofreading is the final step in the editorial cycle before publication.
Professional
Tradition ...
by some
DNA Polymerases
A DNA polymerase is a member of a family of enzymes that catalyze the synthesis of DNA molecules from nucleoside triphosphates, the molecular precursors of DNA. These enzymes are essential for DNA replication and usually work in groups to create ...
during replication,
mismatch repair following replication, and the '
Wobble Hypothesis' which describes the
degeneracy of the third base within an mRNA codon.
Constrained coding regions (CCRs)
While it is well known that the genome of one individual can have extensive differences when compared to the genome of another, recent research has found that some coding regions are highly constrained, or resistant to mutation, between individuals of the same species. This is similar to the concept of interspecies constraint in
conserved sequences
In evolutionary biology, conserved sequences are identical or similar Sequence (biology), sequences in nucleic acids (DNA sequence, DNA and RNA) or peptide sequence, proteins across species (homology (biology)#Orthology, orthologous sequences), ...
. Researchers termed these highly constrained sequences constrained coding regions (CCRs), and have also discovered that such regions may be involved in high
purifying selection. On average, there is approximately 1 protein-altering mutation every 7 coding bases, but some CCRs can have over 100 bases in sequence with no observed protein-altering mutations, some without even synonymous mutations.
[Havrilla, J. M., Pedersen, B. S., Layer, R. M., & Quinlan, A. R. (2018). A map of constrained coding regions in the human genome. ''Nature Genetics'', 88–95. doi: 10.1101/220814] These patterns of constraint between genomes may provide clues to the sources of rare
developmental diseases or potentially even embryonic lethality. Clinically validated variants and
de novo mutations in CCRs have been previously linked to disorders such as
infantile epileptic encephalopathy, developmental delay and severe heart disease.
Coding sequence detection
While identification of
open reading frames within a DNA sequence is straightforward, identifying coding sequences is not, because the cell translates only a subset of all open reading frames to proteins. Currently CDS prediction uses sampling and sequencing of mRNA from cells, although there is still the problem of determining which parts of a given mRNA are actually translated to protein. CDS prediction is a subset of
gene prediction
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functi ...
, the latter also including prediction of DNA sequences that code not only for protein but also for other functional elements such as RNA genes and regulatory sequences.
In both
prokaryote
A prokaryote () is a single-celled organism that lacks a nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Greek πρό (, 'before') and κάρυον (, 'nut' or 'kernel').Campbell, N. "Biology:Concepts & Con ...
s and
eukaryote
Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bact ...
s,
gene overlapping occurs relatively often in both DNA and RNA viruses as an evolutionary advantage to reduce genome size while retaining the ability to produce various proteins from the available coding regions. For both DNA and RNA,
pairwise alignments can detect overlapping coding regions, including short
open reading frame
In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a Prokaryote, prokaryotic DNA sequence, where only one of the #Six-fra ...
s in viruses, but would require a known coding strand to compare the potential overlapping coding strand with. An alternative method using single genome sequences would not require multiple genome sequences to execute comparisons but would require at least 50 nucleotides overlapping in order to be sensitive.
See also
*
Coding strand The DNA strand that codes for a protein
*
Exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
The entire portion of the strand that is transcribed
*
Mature mRNA The portion of the mRNA transcription product that is translated
*
Gene structure The other elements that make up a gene
*
Nested gene A nested gene is a gene whose entire coding sequence lies within the bounds (between the start codon and the stop codon) of a larger external gene. The coding sequence for a nested gene differs greatly from the coding sequence for its external host ...
Entire coding sequence lies within the bounds of a larger external gene
*
Non-coding DNA Parts of genomes that do not encode protein-coding genes
*
Non-coding RNA
A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non- ...
Molecules that do not encode proteins, so have no CDS
References
{{Reflist
DNA
Biochemistry