Repeated sequences (also known as repetitive elements, repeating units or repeats) are short or long patterns of
nucleic acids
Nucleic acids are biopolymers, macromolecules, essential to all known forms of life. They are composed of nucleotides, which are the monomers made of three components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main clas ...
(DNA or RNA) that occur in multiple copies throughout the
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
. In many organisms, a significant fraction of the
genomic DNA
Genomic deoxyribonucleic acid (abbreviated as gDNA) is chromosomal DNA, in contrast to extra-chromosomal DNAs like plasmids. Most organisms have the same genomic DNA in every cell; however, only certain genes are active in each cell to allow for c ...
is repetitive, with over two-thirds of the sequence consisting of repetitive elements in humans. Some of these repeated sequences are necessary for maintaining important genome structures such as
telomeres
A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mos ...
or
centromeres
The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers a ...
.
Repeated sequences are categorized into different classes depending on features such as structure, length, location, origin, and mode of multiplication. The disposition of repetitive elements throughout the genome can consist either in directly-adjacent arrays called
tandem repeats
Tandem repeats occur in DNA when a pattern of one or more nucleotides is repeated and the repetitions are directly adjacent to each other. Several protein domains also form tandem repeats within their amino acid primary structure, such as armadill ...
or in repeats dispersed throughout the genome called
interspersed repeats. Tandem repeats and interspersed repeats are further categorized into subclasses based on the length of the repeated sequence and/or the mode of multiplication.
While some repeated DNA sequences are important for cellular functioning and genome maintenance, other repetitive sequences can be harmful. Many repetitive DNA sequences have been linked to human diseases such as Huntington's disease and Friedreich's ataxia. Some repetitive elements are neutral and occur when there is an absence of selection for specific sequences depending on how transposition or
crossing over occurs.
However, an abundance of neutral repeats can still influence genome evolution as they accumulate over time. Overall, repeated sequences are an important area of focus because they can provide insight into human diseases and genome evolution.
History of Discovery
In the 1950s,
Barbara McClintock
Barbara McClintock (June 16, 1902 – September 2, 1992) was an American scientist and cytogeneticist who was awarded the 1983 Nobel Prize in Physiology or Medicine. McClintock received her PhD in botany from Cornell University in 1927. There s ...
first observed DNA transposition and illustrated the functions of the
centromere
The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers a ...
and
telomere
A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mos ...
at the Cold Spring Harbor Symposium. McClintock's work set the stage for the discovery of repeated sequences because transposition, centromere structure, and telomere structure are all possible through repetitive elements, yet this was not fully understood at the time. The term "repeated sequence" was first used by
Roy John Britten and D. E. Kohne in 1968; they found out that more than half of the eukaryotic genomes were repetitive DNA through their experiments on reassociation of DNA. Although the repetitive DNA sequences were conserved and ubiquitous, their biological role was yet unknown. In the 1990s, more research was conducted to elucidate the evolutionary dynamics of
minisatellite
A minisatellite is a tract of repetitive DNA in which certain DNA motifs (ranging in length from 10–60 base pairs) are typically repeated 5-50 times. Minisatellites occur at more than 1,000 locations in the human genome and they are notable for ...
and
microsatellite
A microsatellite is a tract of repetitive DNA in which certain DNA motifs (ranging in length from one to six or more base pairs) are repeated, typically 5–50 times. Microsatellites occur at thousands of locations within an organism's genome. ...
repeats because of their importance in DNA-based forensics and
molecular ecology
Molecular ecology is a field of evolutionary biology that is concerned with applying molecular population genetics, molecular phylogenetics, and more recently genomics to traditional ecological questions (e.g., species diagnosis, conservation and ...
. DNA-dispersed repeats were increasingly recognized as a potential source of genetic
variation and
regulation
Regulation is the management of complex systems according to a set of rules and trends. In systems theory, these types of rules exist in various fields of biology and society, but the term has slightly different meanings according to context. For ...
. Discoveries of deleterious repetitive DNA-related diseases stimulated further interest in this area of study. In the 2000s, the data from full eukaryotic genome sequencing enabled the identification of different promoters, enhancers, and regulatory RNAs which are all coded by repetitive regions. Today, the structural and regulatory roles of repetitive DNA sequences remain an active area of research.
Types and Functions
Many repeat sequences are likely to be non-functional, decaying remnants of
Transposable element
A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Transp ...
s, these have been labelled "junk" or "selfish" DNA. Nevertheless, occasionally some repeats may be
exapted for other functions.
Tandem Repeats
Tandem repeats
Tandem repeats occur in DNA when a pattern of one or more nucleotides is repeated and the repetitions are directly adjacent to each other. Several protein domains also form tandem repeats within their amino acid primary structure, such as armadill ...
are repeated sequences which are directly adjacent to each other in the genome. Tandem repeats may vary in the number of nucleotides comprising the repeated sequence, as well as the number of times the sequence repeats. When the repeating sequence is only 2-10 nucleotides long, the repeat is referred to as a short tandem repeat (STR) or
microsatellite
A microsatellite is a tract of repetitive DNA in which certain DNA motifs (ranging in length from one to six or more base pairs) are repeated, typically 5–50 times. Microsatellites occur at thousands of locations within an organism's genome. ...
. When the repeating sequence is 10-60 nucleotides long, the repeat is referred to as a
minisatellite
A minisatellite is a tract of repetitive DNA in which certain DNA motifs (ranging in length from 10–60 base pairs) are typically repeated 5-50 times. Minisatellites occur at more than 1,000 locations in the human genome and they are notable for ...
. For minisatellites and microsatellites, the number of times the sequence repeats at a single locus can range from twice to hundreds of times.
Tandem repeats have a wide variety of biological functions in the genome. For example, minisatellites are often hotspots of meiotic
homologous recombination
Homologous recombination is a type of genetic recombination in which genetic information is exchanged between two similar or identical molecules of double-stranded or single-stranded nucleic acids (usually DNA as in cellular organisms but may ...
in eukaryotic organisms.
Recombination is when two homologous chromosomes align, break, and rejoin to swap pieces. Recombination is important as a source of genetic diversity, as a mechanism for repairing damaged DNA, and a necessary step in the appropriate segregation of chromosomes in meiosis.
The presence of repeated sequence DNA makes it easier for areas of homology to align, thereby controlling when and where recombination occurs.
In addition to playing an important role in recombination, tandem repeats also play important structural roles in the genome. For example,
telomeres
A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mos ...
are composed mainly of tandem TTAGGG repeats. These repeats fold into highly organized
G quadruplex structures which protect the ends of chromosomal DNA from degradation.
Repetitive elements are enriched in the middle of chromosomes as well.
Centromeres
The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers a ...
are the highly compact regions of chromosomes which join sister chromatids together and also allow the mitotic spindle to attach and separate sister chromatids during cell division. Centromeres are composed of a 177 base pair tandem repeat named the α-satellite repeat.
Pericentromeric heterochromatin, the DNA which surrounds the centromere and is important for structural maintenance, is composed of a mixture of different satellite subfamilies including the α-, β- and γ-satellites as well as HSATII, HSATIII, and sn5 repeats.
Some repetitive sequences, such as those with structural roles discussed above, play roles necessary for proper biological functioning. Other tandem repeats have deleterious roles which drive diseases. Many other tandem repeats, however, have unknown or poorly understood functions.
Interspersed Repeats
Interspersed repeats are identical or similar DNA sequences which are found in different locations throughout the genome. Interspersed repeats are distinguished from tandem repeats in that the repeated sequences are not directly adjacent to each other but instead may be scattered among different chromosomes or far apart on the same chromosome. Most interspersed repeats are
transposable elements
A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Transpo ...
(TEs), mobile sequences which can be “cut and pasted” or “copied and pasted” into different places in the genome.
TEs were originally called “jumping genes” for their ability to move, yet this term is somewhat misleading as not all TEs are discrete genes.
Transposable elements that are transcribed into RNA, reverse-transcribed into DNA, then reintegrated into the genome are called
retrotransposons
Retrotransposons (also called Class I transposable elements or transposons via RNA intermediates) are a type of genetic component that copy and paste themselves into different genomic locations (transposon) by converting RNA back into DNA through ...
.
Just as tandem repeats are further subcategorized based on the length of the repeating sequence, there are many different types of retrotransposons. Long interspersed nuclear elements (
LINEs
Line most often refers to:
* Line (geometry), object with zero thickness and curvature that stretches to infinity
* Telephone line, a single-user circuit on a telephone communication system
Line, lines, The Line, or LINE may also refer to:
Arts ...
) are typically 3-7 kilobases in length.
Short interspersed nuclear elements (
SINEs
Sines () is a city and a municipality in Portugal. The municipality, divided into two parishes, has around 14,214 inhabitants (2021) in an area of . Sines holds an important oil refinery and several petrochemical industries. It is also a popular ...
) are typically 100-300 base pairs and no longer than 600 base pairs.
Long-terminal repeat retrotransposons (LTRs) are a third major class of retrotransposons and are characterized by highly repetitive sequences as the ends of the repeat.
When a transposable element does not proceed through RNA as an intermediate, it is called a
DNA transposon DNA transposons are DNA sequences, sometimes referred to "jumping genes", that can move and integrate to different locations within the genome. They are class II transposable elements (TEs) that move through a DNA intermediate, as opposed to class ...
.
Other classification systems refer to retrotransposons as “Class I” and DNA transposons as “Class II” transposable elements.
Transposable elements are estimated to constitute 45% of the human genome. Since uncontrolled propagation of TEs could wreak havoc on the genome, many regulatory mechanisms have evolved to silence their spread, including DNA methylation, histone modifications, non-coding RNAs (ncRNAs) including small interfering RNA (siRNA), chromatin remodelers, histone variants, and other epigenetic factors.
However, TEs play a wide variety of important biological functions. When TEs are introduced into a new host, such as from a virus, they increase genetic diversity.
In some cases, host organisms find new functions for the proteins which arise from expressing TEs in an evolutionary process called TE exaptation.
Recent research also suggests that TEs serve to maintain higher-order chromatin structure and 3D genome organization. Furthermore, TEs contribute to regulating the expression of other genes by serving as distal
enhancers
In genetics, an enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins ( activators) to increase the likelihood that transcription of a particular gene will occur. These proteins are usually referred to as transcription ...
and transcription factor binding sites.
The prevalence of interspersed elements in the genome has garnered attention for more research on their origins and functions. Some specific interspersed elements have been characterized, such as the Alu repeat and LINE1.
Direct and Inverted Repeats
While tandem and interspersed repeats are distinguished based on their location in the genome, direct and inverted repeats are distinguished based on the ordering of the nucleotide bases.
Direct repeats occur when a nucleotide sequence is repeated with the same directionality.
Inverted repeats An inverted repeat (or IR) is a single stranded sequence of nucleotides followed downstream by its reverse complement. The intervening sequence of nucleotides between the initial sequence and the reverse complement can be any length including zero. ...
occur when a nucleotide sequence is repeated in the inverse direction. For example, a direct repeat of "CATCAT" would be another repetition of "CATCAT." In contrast, the inverted repeated would be "TACTAC." When there are no nucleotides separating the inverted repeat, such as "CATCATTACTAC," the sequence is called a palindromic repeat. Inverted repeats can play structural roles in DNA and RNA by forming stem loops and cruciforms.
Repeated Sequences in Human Disease
For humans, some repeated DNA sequences are associated with diseases. Specifically, tandem repeat sequences, underlie several
human disease conditions, particularly trinucleotide repeat diseases such as
Huntington’s disease
Huntington's disease (HD), also known as Huntington's chorea, is a neurodegenerative disease that is mostly inherited. The earliest symptoms are often subtle problems with mood or mental abilities. A general lack of coordination and an unst ...
,
fragile X syndrome
Fragile X syndrome (FXS) is a genetic disorder characterized by mild-to-moderate intellectual disability. The average IQ in males with FXS is under 55, while about two thirds of affected females are intellectually disabled. Physical features may ...
, several
spinocerebellar ataxias
Spinocerebellar ataxia (SCA) is a progressive, degenerative, genetic disease with multiple types, each of which could be considered a neurological condition in its own right. An estimated 150,000 people in the United States have a diagnosis of s ...
,
myotonic dystrophy
Myotonic dystrophy (DM) is a type of muscular dystrophy, a group of genetic disorders that cause progressive muscle loss and weakness. In DM, muscles are often unable to relax after contraction. Other manifestations may include cataracts, intel ...
and
Friedreich’s ataxia
Friedreich's ataxia (FRDA or FA) is an autosomal-recessive genetic disease that causes difficulty walking, a loss of sensation in the arms and legs, and impaired speech that worsens over time. Symptoms generally start between 5 and 20 year ...
.
Trinucleotide repeat expansions in the
germline
In biology and genetics, the germline is the population of a multicellular organism's cells that pass on their genetic material to the progeny (offspring). In other words, they are the cells that form the egg, sperm and the fertilised egg. They ...
over successive generations can lead to increasingly severe manifestations of the disease.
These trinucleotide repeat expansions may occur through
strand slippage
Slipped strand mispairing (SSM), (also known as replication slippage), is a mutation process which occurs during DNA replication. It involves denaturation and displacement of the DNA strands, resulting in mispairing of the complementary bases. ...
during
DNA replication
In molecular biology, DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. DNA replication occurs in all living organisms acting as the most essential part for biological inheritanc ...
or during
DNA repair
DNA repair is a collection of processes by which a cell identifies and corrects damage to the DNA molecules that encode its genome. In human cells, both normal metabolic activities and environmental factors such as radiation can cause DNA dam ...
synthesis.
It has been noted that
genes
In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
containing pathogenic CAG repeats often encode proteins that themselves have a role in the
DNA damage
DNA repair is a collection of processes by which a cell identifies and corrects damage to the DNA molecules that encode its genome. In human cells, both normal metabolic activities and environmental factors such as radiation can cause DNA da ...
response and that repeat expansions may impair specific DNA repair pathways.
Faulty repair of DNA damages in repeat sequences may cause further expansion of these sequences, thus setting up a vicious cycle of pathology.
Huntington's Disease
Huntington's disease
Huntington's disease (HD), also known as Huntington's chorea, is a neurodegenerative disease that is mostly inherited. The earliest symptoms are often subtle problems with mood or mental abilities. A general lack of coordination and an unst ...
is a neurodegenerative disorder which is due to the expansion of repeated trinucleotide sequence CAG in
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequen ...
1 of the ''
huntingtin
Huntingtin (Htt) is the protein coded for in humans by the ''HTT'' gene, also known as the ''IT15'' ("interesting transcript 15") gene. Mutated ''HTT'' is the cause of Huntington's disease (HD), and has been investigated for this role and also for ...
'' gene (''HTT''). This gene is responsible for encoding the protein huntingtin which plays a role in preventing apoptosis, otherwise known as cell death, and
repair of oxidative DNA damage. In Huntington's disease the expansion of the trinucleotide sequence CAG encodes for a mutant huntingtin protein with an expanded polyglutamine domain. This domain causes the protein to form aggregates in nerve cells preventing normal cellular function and resulting in neurodegeneration.
Fragile X Syndrome
Fragile X syndrome
Fragile X syndrome (FXS) is a genetic disorder characterized by mild-to-moderate intellectual disability. The average IQ in males with FXS is under 55, while about two thirds of affected females are intellectually disabled. Physical features may ...
is caused by the expansion of the DNA sequence CCG in the ''FMR1'' gene on the X chromosome. This gene produces the RNA-binding protein FMRP. In the case of Fragile X syndrome the repeated sequence makes the gene unstable and therefore silences the gene ''FMR1.'' Because the gene resides on the X chromosome, females who have two X chromosomes are less effected than males who only have on X chromosome and one Y chromosome because the second X chromosome can compensate for the silencing of the gene on the other X chromosome.
Spinocerebellar Ataxias
The disease
spinocerebellar ataxias
Spinocerebellar ataxia (SCA) is a progressive, degenerative, genetic disease with multiple types, each of which could be considered a neurological condition in its own right. An estimated 150,000 people in the United States have a diagnosis of s ...
has CAG
trinucleotide repeat sequences that underlie several types of spinocerebellar ataxias (SCAs-
SCA1;
SCA2; SCA3; SCA6; SCA7; SCA12; SCA17).
Similar to Huntington's disease, the polyglutamine tail created due to this trinucleotide expansion causes aggregation of proteins, preventing normal cellular function and causing neurodegeneration.
Friedreich's Ataxia
Friedreich's ataxia
Friedreich's ataxia (FRDA or FA) is an autosomal-recessive genetic disease that causes difficulty walking, a loss of sensation in the arms and legs, and impaired speech that worsens over time. Symptoms generally start between 5 and 20 year ...
is a type of ataxia that has an expanded repeat sequence GAA in the frataxin gene. The frataxin gene is responsible for producing the frataxin protein, which is a mitochondrial protein involved in energy production and cellular respiration. The expanded GAA sequence results in the silencing of the first intron resulting in loss of function in the frataxin protein. The loss of a functional ''FXN'' gene leads to issues with mitochondrial functioning as a whole and can present phenotypically in patients as difficulty walking.
Myotonic Dystrophy
Myotonic dystrophy
Myotonic dystrophy (DM) is a type of muscular dystrophy, a group of genetic disorders that cause progressive muscle loss and weakness. In DM, muscles are often unable to relax after contraction. Other manifestations may include cataracts, intel ...
is a disorder that presents as muscle weakness and consits of two main types: DM1 and DM2. Both types of myotonic dystrophy are due to expanded DNA sequences. In DM1 the DNA sequence that is expanded is CCG while in DM2 it is CCTG. These two sequences are found on different genes with the expanded sequence in DM2 being found on the ''ZNF9'' gene and the expanded sequence in DM1 found on the ''DMPK'' gene. The two genes don't encode for proteins unlike other disorders like Huntington's disease or Fragile X syndrome. It has been shown, however, that there is a link between RNA toxicity and the repeat sequences in DM1 and DM2.
Amyotrophic Lateral Sclerosis and Frontotemporal Dementia
Not all diseases caused by repeated DNA sequences are trinucleotide repeat diseases. The diseases
amyotrophic lateral sclerosis
Amyotrophic lateral sclerosis (ALS), also known as motor neuron disease (MND) or Lou Gehrig's disease, is a neurodegenerative disease that results in the progressive loss of motor neurons that control voluntary muscles. ALS is the most comm ...
and
frontotemporal dementia
Frontotemporal dementia (FTD), or frontotemporal degeneration disease, or frontotemporal neurocognitive disorder, encompasses several types of dementia involving the progressive degeneration of frontal and temporal lobes. FTDs broadly present as ...
are caused by hexanucleotide GGGGCC repeat sequences in the ''
C9orf72
C9orf72 (chromosome 9 open reading frame 72) is a protein which in humans is encoded by the gene ''C9orf72''.
The human ''C9orf72'' gene is located on the short (p) arm of chromosome 9 open reading frame 72, from base pair 27,546,546 to base pai ...
'' gene, causing RNA toxicity that leads to neurodegeneration.
Biotechnology
Repetitive DNA is hard to
sequence
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...
using
next-generation sequencing Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation s ...
techniques because
sequence assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...
from short reads simply cannot determine the length of a repetitive part. This issue is particularly serious for microsatellites, which are made of tiny 1-6bp repeat units.
Although they are difficult to sequence, these short repeats have great value in DNA fingerprinting and evolutionary studies. Many researchers have historically left out repetitive sequences when analyzing and publishing whole genome data due to technical limitations.
Bustos. et al proposed one method of sequencing long stretches of repetitive DNA.
The method combines the use of a linear vector for stabilization and exonuclease III for deletion of continuing simple sequence repeats (SSRs) rich regions. First, SSR-rich fragments are cloned into a linear vector that can stably incorporate tandem repeats up to 30kb. Expression of repeats is prohibited by the transcriptional terminators in the vector. The second step involves the use of exonuclease III. The enzyme can delete nucleotide at the 3' end which results in the production of a unidirectional deletion of SSR fragments. Finally, this product which has deleted fragments is multiplied and analyzed with colony PCR. The sequence is then built by an ordered sequencing of a set of clones containing different deletions.
See also
References
External links
Function of Repetitive DNA*
{{DEFAULTSORT:Repeated Sequence (Dna)