Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not

encode The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome. ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

sequences. Some non-coding DNA is transcribed into functional

non-coding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not Translation (genetics), translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally im ...

molecules (e.g. transfer RNA,

microRNA MicroRNA (miRNA) are small, single-stranded, non-coding RNA molecules containing 21 to 23 nucleotides. Found in plants, animals and some viruses, miRNAs are involved in RNA silencing and post-transcriptional regulation of gene expression. mi ...

piRNA Pirna (; hsb, Pěrno; ) is a town in Saxony, Germany and capital of the administrative district Sächsische Schweiz-Osterzgebirge. The town's population is over 37,000. Pirna is located near Dresden and is an important district town as well as ...

ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...

, and regulatory RNAs). Other functional regions of the non-coding DNA fraction include

regulatory sequence A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living organisms and v ...

s that control gene expression;

scaffold attachment region The term S/MAR (scaffold/matrix attachment region), otherwise called SAR (scaffold-attachment region), or MAR (matrix-associated region), are sequences in the DNA of eukaryotic chromosomes where the nuclear matrix attaches. As architectural DNA c ...

s; origins of DNA replication;

centromere The centromere links a pair of sister chromatids together during cell division. This constricted region of chromosome connects the sister chromatids, creating a short arm (p) and a long arm (q) on the chromatids. During mitosis, spindle fibers ...

s; and

telomere A telomere (; ) is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic feature mos ...

s. Some non-coding regions appear to be mostly nonfunctional such as

introns An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene ...

pseudogenes Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Most arise as superfluous copies of functional genes, either directly by DNA duplication or indirectly by reverse transcription of an mRNA transcript. Pseudogenes are ...

intergenic DNA An intergenic region is a stretch of DNA sequences located between genes. Intergenic regions may contain functional elements and junk DNA. ''Inter''genic regions should not be confused with ''intra''genic regions (or introns), which are non-cod ...

, and fragments of

transposons A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Tran ...

and

viruses A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Since Dmitri Ivanovsky's 1 ...

Fraction of non-coding genomic DNA

In bacteria, the coding regions typically take up 88 % of the genome. The remaining 12 % consists largely of non-coding genes and regulatory sequences, which means that almost all of the bacterial genome has a function. The amount of coding DNA in eukaryrotes is usually a much smaller fraction of the genome because eukaryotic genomes contain large amounts of repetitive DNA not found in prokaryotes. The

human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the n ...

contains somewhere between 1–2 % coding DNA. The exact number is not known because there are disputes over the number of functional coding exons and over the total size of the human genome. This means that 98–99 % of the human genome consists of non-coding DNA and this includes many functional elements such as non-coding genes and regulatory sequences.

Genome size Genome size is the total amount of DNA contained within one copy of a single complete genome. It is typically measured in terms of mass in picograms (trillionths (10−12) of a gram, abbreviated pg) or less frequently in daltons, or as the total ...

in eukaryotes can vary over a wide range, even between closely related sequences. This puzzling observation was originally known as the C-value Paradox where "C" refers to the haploid genome size. The paradox was resolved with the discovery that most of the differences were due to the expansion and contraction of repetitive DNA and not the number of genes. Some researchers speculated that this repetitive DNA was mostly junk DNA. The reasons for the changes in genome size are still being worked out and this problem is called the C-value Enigma. This led to the observation that the number of genes does not seem to correlate with perceived notions of complexity because the number of genes seems to be relatively constant, an issue termed the

G-value Paradox The G-value paradox arises from the lack of correlation between the number of protein-coding genes among eukaryotes and their relative biological complexity. The microscopic nematode '' Caenorhabditis elegans'', for example, is composed of only a ...

. For example, the genome of the unicellular ''

Polychaos dubium ''Polychaos dubium'' is a freshwater amoeboid and one of the larger species of single-celled eukaryote. Like other amoebozoans, ''P. dubium'' moves by means of temporary projections called pseudopods. ''P. dubium'' reportedly has one of the larg ...

'' (formerly known as ''Amoeba dubia'') has been reported to contain more than 200 times the amount of DNA in humans (i.e. more than 600 billion pairs of bases vs a bit more than 3 billion in humans). The pufferfish ''

Takifugu ''Takifugu'' is a genus of pufferfish, often better known by the Japanese name . There are 25 species belonging to the genus ''Takifugu'' and most of these are native to salt and brackish waters of the northwest Pacific, but a few species are ...

rubripes'' genome is only about one eighth the size of the human genome, yet seems to have a comparable number of genes. Genes take up about 30 % of the pufferfish genome and the coding DNA is about 10 %. (Non-coding DNA = 90%.) The reduced size of the pufferfish genome is due to a reduction in the length of introns and less repetitive DNA. ''

Utricularia gibba ''Utricularia gibba'', commonly known as the humped or floating bladderwort, is a small, mat-forming species of carnivorous aquatic bladderwort. It is found on all continents except Antarctica. ''U. gibba'' has an exceptionally small genome for ...

'', a

bladderwort ''Utricularia'', commonly and collectively called the bladderworts, is a genus of carnivorous plants consisting of approximately 233 species (precise counts differ based on classification opinions; a 2001 publication lists 215 species).Salmon, Br ...

plant, has a very small nuclear genome (100.7 Mb) compared to most plants. It likely evolved from an ancestral genome that was 1,500 Mb in size. The bladderwort genome has roughly the same number of genes as other plants but the total amount of coding DNA comes to about 30% of the genome. The remainder of the genome (70 % non-coding DNA) consists of promoters and regulatory sequences that are shorter than those in other plant species. The genes contain introns but there are fewer of them and they are smaller than the introns in other plant genomes. There are noncoding genes, including many copies of ribosomal RNA genes. The genome also contains telomere sequences and centromeres as expected. Much of the repetitive DNA seen in other eukaryotes has been deleted from the bladderwort genome since that lineage split from those of other plants. About 59 % of the bladderwort genome consists of transposon-related sequences but since the genome is so much smaller than other genomes, this represents a considerable reduction in the amount of this DNA. The authors of the original 2013 article note that claims of additional functional elements in the non-coding DNA of animals do not seem to apply to plant genomes. According to a New York Times piece, during the evolution of this species, "... genetic junk that didn’t serve a purpose was expunged, and the necessary stuff was kept." According to Victor Albert of the University of Buffalo, the plant is able to expunge its so-called junk DNA and "have a perfectly good multicellular plant with lots of different cells, organs, tissue types and flowers, and you can do it without the junk. Junk is not needed."

Types of non-coding DNA sequences

Noncoding genes

There are two types of genes: protein coding genes and noncoding genes. Noncoding genes are an important part of non-coding DNA and they include genes for transfer RNA and

. These genes were discovered in the 1960s. Prokaryotic genomes contain genes for a number of other noncoding RNAs but noncoding RNA genes are much more common in eukaryotes. Typical classes of noncoding genes in eukaryotes include genes for

small nuclear RNA Small nuclear RNA (snRNA) is a class of small RNA molecules that are found within the splicing speckles and Cajal bodies of the cell nucleus in eukaryotic cells. The length of an average snRNA is approximately 150 nucleotides. They are transcribe ...

s (snRNAs), small nucleolar RNAs (sno RNAs),

s (miRNAs), short interfering RNAs (siRNAs), PIWI-interacting RNAs (piRNAs), and long noncoding RNAs (lncRNAs). In addition, there are a number of unique RNA genes that produce catalytic RNAs. Noncoding genes account for only a few percent of prokaryotic genomes but they can represent a vastly higher fraction in eukaryotic genomes. In humans, the noncoding genes take up at least 6% of the genome, largely because there are hundreds of copies of ribosomal RNA genes. Protein-coding genes occupy about 38% of the genome; a fraction that is much higher than the coding region because genes contain large introns. The total number of noncoding genes in the human genome is controversial. Some scientists think that there are only about 5,000 noncoding genes while others believe that there may be more than 100,000 (see the article on

Non-coding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not Translation (genetics), translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally im ...

). The difference is largely due to debate over the number of lncRNA genes.

Promoters and regulatory elements

Promoters are DNA segments near the 5' end of the gene where transcription begins. They are the sites where

RNA polymerase In molecular biology, RNA polymerase (abbreviated RNAP or RNApol), or more specifically DNA-directed/dependent RNA polymerase (DdRP), is an enzyme that synthesizes RNA from a DNA template. Using the enzyme helicase, RNAP locally opens the ...

binds to initiate RNA synthesis. Every gene has a noncoding promoter.

Regulatory elements A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living organisms and vi ...

are sites that control the

transcription Transcription refers to the process of converting sounds (voice, music etc.) into letters or musical notes, or producing a copy of something in another medium, including: Genetics * Transcription (biology), the copying of DNA into RNA, the fir ...

of a nearby gene. They are almost always sequences where

transcription factor In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The fu ...

s bind to DNA and these transcription factors can either activate transcription (activators) or repress transcription (repressors). Regulatory elements were discovered in the 1960s and their general characteristics were worked out in the 1970s by studying specific transcription factors in bacteria and bacteriophage. Promoters and regulatory sequences represent an abundant class of noncoding DNA but they mostly consist of a collection of relatively short sequences so they don't take up a very large fraction of the genome. The exact amount of regulatory DNA in mammalian genome is unclear because it is difficult to distinguish between spurious transcription factor binding sites and those that are functional. The binding characteristics of typical

DNA-binding protein DNA-binding proteins are proteins that have DNA-binding domains and thus have a specific or general affinity for DNA#Base pairing, single- or double-stranded DNA. Sequence-specific DNA-binding proteins generally interact with the major groove ...

s were characterized in the 1970s and the biochemical properties of transcription factors predict that in cells with large genomes the majority of binding sites will be fortuitous and not biologiacally functional. Many regulatory sequences occur near promoters, usually upstream of the transcription start site of the gene. Some occur within a gene and a few are located downstream of the transcription termination site. In eukaryotes, there are some regulatory sequences that are located at a considerable distance from the promoter region. These distant regulatory sequences are often called

enhancers In genetics, an enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins ( activators) to increase the likelihood that transcription of a particular gene will occur. These proteins are usually referred to as transcriptio ...

but there is no rigorous definition of enhancer that distinguishes it from other transcription factor binding sites.

Introns

Introns are the parts of a gene that are transcribed into the

precursor RNA A primary transcript is the single-stranded ribonucleic acid ( RNA) product synthesized by transcription of DNA, and processed to yield various mature RNA products such as mRNAs, tRNAs, and rRNAs. The primary transcripts designated to be mRNAs ...

sequence, but ultimately removed by RNA splicing during the processing to mature RNA. Introns are found in both types of genes: protein-coding genes and noncoding genes. They are present in prokaryotes but they are much more common in eukaryotic genomes. Group I and group II introns take up only a small percentage of the genome when they are present. Spliceosomal introns (see Figure) are only found in eukaryotes and they can represent a substantial proportion of the genome. In humans, for example, introns in protein-coding genes cover 37% of the genome. Combining that with about 1% coding sequences means that protein-coding genes occupy about 39% of the human genome. The calculations for noncoding genes are more complicated because there's considerable dispute over the total number of noncoding genes but taking only the well-defined examples means that noncoding genes occupy at least 6% of the genome.

Untranslated regions

The standard biochemistry and molecular biology textbooks describe non-coding nucleotides in mRNA located between the 5' end of the gene and the translation initiation codon. These regions are called 5'-untranslated regions or 5'-UTRs. Similar regions called 3'-untranslated regions (3'-UTRs) are found at the end of the gene. The 5'-UTRs and 3'UTRs are very short in bacteria but they can be several hundred nucleotides in length in eukaryotes. They contain short elements that control the initiation of translation (5'-UTRs) and transcription termination (3'-UTRs) as well as regulatory elements that may control mRNA stability, processing, and targeting to different regions of the cell.

Origins of replication

DNA synthesis begins at specific sites called origins of replication. These are regions of the genome where the DNA replication machinery is assembled and the DNA is unwound to begin DNA synthesis. In most cases, replication proceeds in both directions from the replication origin. The main features of replication origins are sequences where specific initiation proteins are bound. A typical replication origin covers about 100-200 base pairs of DNA. Prokaryotes have one origin of replication per chromosome or plasmid but there are usually multiple origins in eukaryotic chromosomes. The human genome contains about 100,000 origins of replication representing about 0.3% of the genome.

Centromeres

Centromeres are the sites where spindle fibers attach to newly replicated chromosomes in order to segregate them into daughter cells when the cell divides. Each eukaryotic chromosome has a single functional centromere that's seen as a constricted region in a condensed metaphase chromosome. Centromeric DNA consists of a number of repetitive DNA sequences that often take up a significant fraction of the genome because each centromere can be millions of base pairs in length. In humans, for example, the sequences of all 24 centromeres have been determined and they account for about 6% of the genome. However, it's unlikely that all of this noncoding DNA is essential since there is considerable variation in the total amount of centromeric DNA in different individuals. Centromeres are another example of functional noncoding DNA sequences that have been known for almost half a century and it's likely that they are more abundant than coding DNA.

Telomeres

Telomeres are regions of repetitive DNA at the end of a

chromosome A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins are ...

, which provide protection from chromosomal deterioration during

DNA replication In molecular biology, DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. DNA replication occurs in all living organisms acting as the most essential part for biological inheritanc ...

. Recent studies have shown that telomeres function to aid in its own stability. Telomeric repeat-containing RNA (TERRA) are transcripts derived from telomeres. TERRA has been shown to maintain telomerase activity and lengthen the ends of chromosomes.

Scaffold attachment regions

Both prokaryotic and eukarotic genomes are organized into large loops of protein-bound DNA. In eukaryotes, the bases of the loops are called scaffold attachment regions (SARs) and they consist of stretches of DNA that bind an RNA/protein complex to stabilize the loop. There are about 100,000 loops in the human genome and each one consists of about 100 bp of DNA. The total amount of DNA devoted to SARs accounts for about 0.3% of the human genome.

Pseudogenes

Pseudogenes are mostly former genes that have become non-functional due to mutation but the term also refers to inactive DNA sequences that are derived from RNAs produced by functional genes ( processed pseudogenes). Pseudogenes are only a small fraction of noncoding DNA in prokaryotic genomes because they are eliminated by negative selection. In some eukaryotes, however, pseudogenes can accumulate because selection isn't powerful enough to eliminate them (see

Nearly neutral theory of molecular evolution The nearly neutral theory of molecular evolution is a modification of the neutral theory of molecular evolution that accounts for the fact that not all mutations are either so deleterious such that they can be ignored, or else neutral. Slightly del ...

). The human genome contains about 15,000 pseudogenes derived from protein-coding genes and an unknown number derived from noncoding genes. They may cover a substantial fraction of the genome (~5%) since many of them contain former intron sequences, . Pseudogenes are junk DNA by definition and they evolve at the neutral rate as expected for junk DNA. Some former pseudogenes have secondarily acquired a function and this leads some scientists to speculate that most pseudogenes are not junk because they have a yet-to-be-discovered function.

Repeat sequences, transposons and viral elements

Transposon A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Transpo ...

s and

retrotransposon Retrotransposons (also called Class I transposable elements or transposons via RNA intermediates) are a type of genetic component that copy and paste themselves into different genomic locations (transposon) by converting RNA back into DNA through ...

s are

mobile genetic elements Mobile genetic elements (MGEs) sometimes called selfish genetic elements are a type of genetic material that can move around within a genome, or that can be transferred from one species or replicon to another. MGEs are found in all organisms. In ...

. Retrotransposon repeated sequences, which include long interspersed nuclear elements (LINEs) and

short interspersed nuclear elements Retrotransposons (also called Class I transposable elements or transposons via RNA intermediates) are a type of genetic component that copy and paste themselves into different genomic locations (transposon) by converting RNA back into DNA through ...

(SINEs), account for a large proportion of the genomic sequences in many species.

Alu sequence An Alu element is a short stretch of DNA originally characterized by the action of the ''Arthrobacter luteus (Alu)'' restriction endonuclease. ''Alu'' elements are the most abundant transposable elements, containing over one million copies dis ...

s, classified as a short interspersed nuclear element, are the most abundant mobile elements in the human genome. Some examples have been found of SINEs exerting transcriptional control of some protein-encoding genes. Endogenous retrovirus sequences are the product of reverse transcription of retrovirus genomes into the genomes of

germ cell Germ or germs may refer to: Science * Germ (microorganism), an informal word for a pathogen * Germ cell, cell that gives rise to the gametes of an organism that reproduces sexually * Germ layer, a primary layer of cells that forms during emb ...

s. Mutation within these retro-transcribed sequences can inactivate the viral genome. Over 8% of the human genome is made up of (mostly decayed) endogenous retrovirus sequences, as part of the over 42% fraction that is recognizably derived of retrotransposons, while another 3% can be identified to be the remains of

DNA transposon DNA transposons are DNA sequences, sometimes referred to "jumping genes", that can move and integrate to different locations within the genome. They are class II transposable elements (TEs) that move through a DNA intermediate, as opposed to class ...

s. Much of the remaining half of the genome that is currently without an explained origin is expected to have found its origin in transposable elements that were active so long ago (> 200 million years) that random mutations have rendered them unrecognizable. Genome size variation in at least two kinds of plants is mostly the result of retrotransposon sequences.

Highly repetitive DNA

Highly repetitive DNA consists of short stretches of DNA that are repeated many times in

tandem Tandem, or in tandem, is an arrangement in which a team of machines, animals or people are lined up one behind another, all facing in the same direction. The original use of the term in English was in ''tandem harness'', which is used for two ...

(one after the other). The repeat segments are usually between 2 bp and 10 bp but longer ones are known. Highly repetitive DNA is rare in prokaryotes but common in eukaryotes, especially those with large genomes. It is sometimes called

satellite DNA Satellite DNA consists of very large arrays of tandemly repeating, non-coding DNA. Satellite DNA is the main component of functional centromeres, and form the main structural constituent of heterochromatin. The name "satellite DNA" refers to the ...

. Most of the highly repetitive DNA is found in centromeres and telomeres (see above) and most of it is functional although some might be redundant. The other significant fraction resides in short tandem repeats (STRs; also called microsatellites) consisting of short stretches of a simple repeat such as ATC. There are about 350,000 STRs in the human genome and they are scattered throughout the genome with an average length of about 25 repeats. Variations in the number of STR repeats can cause genetic diseases when they lie within a gene but most of these regions appear to be non-functional junk DNA where the number of repeats can vary considerably from individual to individual. This is why these length differences are used extensively in

DNA fingerprinting DNA profiling (also called DNA fingerprinting) is the process of determining an individual's DNA characteristics. DNA analysis intended to identify a species, rather than an individual, is called DNA barcoding. DNA profiling is a forensic tec ...

Junk DNA

"Junk DNA" refers broadly to "any DNA sequence that does not play a functional role in development, physiology, or some other organism-level capacity." The term "junk DNA" was used in the 1960s. but it only became widely known in 1972 in a paper by

Susumu Ohno Susumu is a masculine Japanese given name. Notable people with the name include: * Susumu Akagi (born 1972) Japanese voice actor * Susumu Aoyagi (青柳進, born 1968), Japanese baseball player *Susumu Chiba (born 1970), Japanese voice actor *, J ...

. Ohno noted that the mutational load from deleterious mutations placed an upper limit on the number of functional loci that could be expected given a typical mutation rate. He hypothesized that mammalian genomes could not have more than 30,000 loci under selection before the "cost" from the mutational load would cause an inescapable decline in fitness, and eventually extinction. The presence of junk DNA also explained the observation that even closely related species can have widely (orders-of-magnitude) different genome sizes (

C-value paradox C-value is the amount, in picograms, of DNA contained within a haploid nucleus (e.g. a gamete) or one half the amount in a diploid somatic cell of a eukaryotic organism. In some cases (notably among diploid organisms), the terms C-value and gen ...

). Some authors assert that the term "junk DNA" occurs mainly in

popular science ''Popular Science'' (also known as ''PopSci'') is an American digital magazine carrying popular science content, which refers to articles for the general reader on science and technology subjects. ''Popular Science'' has won over 58 awards, incl ...

and is no long used in serious research articles. However, examination of ''Web of Science'' shows immediately that this is at best an oversimplification. Graur, for example, calculated that each human couple would need to have a vast number of children to maintain the population if all genes were essential:

The situation becomes much more absurd and untenable if we assume that the entire genome is functional, as proclaimed by creationists .... Under the assumption of 100% functionality and the range of deleterious mutation rates used in this paper, maintaining a constant population size would necessitate that each couple on average produce a minimum of 272 and a maximum of 5 × 10⁵³ children.

Likewise, in an recent review Palazzo and Kejiou noted the impossibility of maintaining a population with 100% functionality, and point out that "many researchers continue to state, erroneously, that all non-coding DNA was once thought to be junk." Since the late 1970s it has become apparent that most of the DNA in large genomes finds its origin in the

selfish Selfishness is being concerned excessively or exclusively, for oneself or one's own advantage, pleasure, or welfare, regardless of others. Selfishness is the opposite of altruism or selflessness; and has also been contrasted (as by C. S. Lewis) w ...

amplification of

transposable element A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Transp ...

s, of which W. Ford Doolittle and Carmen Sapienza in 1980 wrote in the journal ''

Nature Nature, in the broadest sense, is the physics, physical world or universe. "Nature" can refer to the phenomenon, phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. ...

'': "When a given DNA, or class of DNAs, of unproven phenotypic function can be shown to have evolved a strategy (such as transposition) which ensures its genomic survival, then no other explanation for its existence is necessary." The amount of junk DNA can be expected to depend on the rate of amplification of these elements and the rate at which non-functional DNA is lost. Another source is

genome duplication Polyploidy is a condition in which the cells of an organism have more than one pair of ( homologous) chromosomes. Most species whose cells have nuclei ( eukaryotes) are diploid, meaning they have two sets of chromosomes, where each set contain ...

followed by a loss of function due to redundancy. In the same issue of ''Nature'',

Leslie Orgel Leslie Eleazer Orgel FRS (12 January 1927 – 27 October 2007) was a British chemist. He is known for his theories on the origin of life. Biography Leslie Orgel was born in London, England, on . He received his Bachelor of Arts degree in chemi ...

and

Francis Crick Francis Harry Compton Crick (8 June 1916 – 28 July 2004) was an English molecular biologist, biophysicist, and neuroscientist. He, James Watson, Rosalind Franklin, and Maurice Wilkins played crucial roles in deciphering the helical struc ...

wrote that junk DNA has "little specificity and conveys little or no selective advantage to the organism". The term "junk DNA" may provoke a strong reaction and some have recommended using more neutral terminology such as "nonfunctional DNA." Junk DNA is often confused with non-coding DNA.

ENCODE Project

The Encyclopedia of DNA Elements (

ENCODE The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome. ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...

) project uncovered, by direct biochemical approaches, that at least 80% of human genomic DNA has biochemical activity such as "transcription, transcription factor association, chromatin structure, and histone modification".. Though this was not necessarily unexpected due to previous decades of research discovering many functional non-coding regions, some scientists criticized the conclusion for conflating biochemical activity with

biological function In evolutionary biology, function is the reason some object or process occurred in a system that evolved through natural selection. That reason is typically that it achieves some result, such as that chlorophyll helps to capture the energy of sunl ...

. Some have argued that neither accessibility of segments of the genome to transcription factors nor their transcription guarantees that those segments have biochemical function and that their transcription is selectively advantageous. After all, non-functional sections of the genome can be transcribed, given that transcription factors typically bind to short sequences that are found (randomly) all over the whole genome. However, others have argued against relying solely on estimates from comparative genomics due to its limited scope since non-coding DNA has been found to be involved in epigenetic activity and complex networks of genetic interactions and is explored in evolutionary developmental biology. Prior to ENCODE, the much lower estimates of functionality were based on genomic conservation estimates across mammalian lineages. Estimates for the biologically functional fraction of the human genome based on comparative genomics range between 8 and 15%. One consistent indication of biological functionality of a genomic region is if the sequence of that genomic region was maintained by purifying selection (or if mutating away the sequence is deleterious to the organism). Under this definition, 90% of the genome is 'junk'. However, some stress that 'junk' is not 'garbage' and the large body of nonfunctional transcripts produced by 'junk DNA' can evolve functional elements ''de novo''. However, widespread transcription and splicing in the human genome has been discussed as another indicator of genetic function in addition to genomic conservation which may miss poorly conserved functional sequences. And much of the apparent junk DNA is involved in epigenetic regulation and appears to be necessary for the development of complex organisms. Contributing to the debating is that there is no consensus on what constitutes a "functional" element in the genome since geneticists, evolutionary biologists, and molecular biologists employ different approaches and definitions of "function", often with a lack of clarity of what they mean in the literature. Furthermore, methods used have limitations, for example, Genetic approaches may miss functional elements that do not manifest physically on the organism, evolutionary approaches have difficulties using accurate multispecies sequence alignments since genomes of even closely related species vary considerably, and with biochemical approaches, though having high reproducibility, the biochemical signatures do not always automatically signify a function. Kellis et al. noted that 70% of the transcription coverage was less than 1 transcript per cell (and may thus be based on spurious background transcription). On the other hand, they argued that 12–15% fraction of human DNA may be under functional constraint, and may still be an underestimate when lineage-specific constraints are included. Ultimately genetic, evolutionary, and biochemical approaches can all be used in a complementary way to identify regions that may be functional in human biology and disease. Some critics have argued that functionality can only be assessed in reference to an appropriate

null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...

. In this case, the null hypothesis would be that these parts of the genome are non-functional and have properties, be it on the basis of conservation or biochemical activity, that would be expected of such regions based on our general understanding of

molecular evolution Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genetics ...

and

biochemistry Biochemistry or biological chemistry is the study of chemical processes within and relating to living organisms. A sub-discipline of both chemistry and biology, biochemistry may be divided into three fields: structural biology, enzymology and ...

. According to these critics, until a region in question has been shown to have additional features, beyond what is expected of the null hypothesis, it should provisionally be labelled as non-functional.

Genome-wide association studies (GWAS) and non-coding DNA

Genome-wide association studies In genomics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any varian ...

(GWAS) identify linkages between alleles and observable traits such as phenotypes and diseases. Most of the associations are between

single-nucleotide polymorphisms In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...

(SNPs) and the trait being examined and most of these SNPs are located in non-functional DNA. The association establishes a linkage that helps map the DNA region responsible for the trait but it doesn't necessarily identify the mutations causing the disease or phenotypic difference. SNPs that are tightly linked to traits are the ones most likely to identify a causal mutation. (The association is referred to as tight

linkage disequilibrium In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is h ...

.) About 12% of these polymorphisms are found in coding regions; about 40% are located in introns; and most of the rest are found in intergenic regions, including regulatory sequences.

References

External links

Plant DNA C-values Database
at

Royal Botanic Gardens, Kew Royal Botanic Gardens, Kew is a non-departmental public body in the United Kingdom sponsored by the Department for Environment, Food and Rural Affairs. An internationally important botanical research and education institution, it employs 1,100 ...

Fungal Genome Size Database
at

Estonian Institute of Zoology and Botany The Estonian Institute of Zoology and Botany ( Estonian: ''Zooloogia ja Botaanika Instituut'') (ZBI) was a zoological and botanical research institute based in Tartu, Estonia. It was founded in 1947. Since 1997, it belonged to the Estonian Univ ...

ENCODE: The human encyclopaedia
at ''

{{Authority control DNA Gene expression