HOME

TheInfoList



OR:

''De novo'' gene birth is the process by which new
gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
s evolve from DNA sequences that were ancestrally non-genic. '' De novo'' genes represent a subset of novel genes, and may be protein-coding or instead act as RNA genes. The processes that govern ''de novo'' gene birth are not well understood, although several models exist that describe possible mechanisms by which ''de novo'' gene birth may occur. Although ''de novo'' gene birth may have occurred at any point in an organism's evolutionary history, ancient ''de novo'' gene birth events are difficult to detect. Most studies of ''de novo'' genes to date have thus focused on young genes, typically taxonomically restricted genes (TRGs) that are present in a single species or lineage, including so-called
orphan gene Orphan genes, ORFans, or taxonomically restricted genes (TRGs) are genes that lack a detectable homologue outside of a given species or lineage. Most genes have known homologues. Two genes are homologous when they share an evolutionary history, a ...
s, defined as genes that lack any identifiable homolog. It is important to note, however, that not all orphan genes arise ''de novo'', and instead may emerge through fairly well characterized mechanisms such as
gene duplication Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene. ...
(including retroposition) or
horizontal gene transfer Horizontal gene transfer (HGT) or lateral gene transfer (LGT) is the movement of genetic material between unicellular and/or multicellular organisms other than by the ("vertical") transmission of DNA from parent to offspring (reproduction). H ...
followed by sequence divergence, or by gene fission/fusion. Although ''de novo'' gene birth was once viewed as a highly unlikely occurrence, several unequivocal examples have now been described, and some researchers speculate that ''de novo'' gene birth could play a major role in evolutionary innovation.


History

As early as the 1930s,
J. B. S. Haldane John Burdon Sanderson Haldane (; 5 November 18921 December 1964), nicknamed "Jack" or "JBS", was a British-Indian scientist who worked in physiology, genetics, evolutionary biology, and mathematics. With innovative use of statistics in biolo ...
and others suggested that copies of existing genes may lead to new genes with novel functions. In 1970,
Susumu Ohno Susumu is a masculine Japanese given name. Notable people with the name include: * Susumu Akagi (born 1972) Japanese voice actor * Susumu Aoyagi (青柳 進, born 1968), Japanese baseball player *Susumu Chiba (born 1970), Japanese voice actor *, J ...
published the seminal text ''Evolution by
Gene Duplication Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene. ...
.'' For some time subsequently, the consensus view was that virtually all genes were derived from ancestral genes, with François Jacob famously remarking in a 1977 essay that "the probability that a functional protein would appear ''de novo'' by random association of amino acids is practically zero." In the same year, however, Pierre-Paul Grassé coined the term "
overprinting Overprinting refers to the process of printing one colour on top of another in reprographics. This is closely linked to the reprographic technique of 'trapping'. Another use of overprinting is to create a rich black Rich may refer to: Commo ...
" to describe the emergence of genes through the expression of alternative
open reading frame In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible readi ...
s (ORFs) that overlap preexisting genes. These new ORFs may be out of frame with or antisense to the preexisting gene. They may also be in frame with the existing ORF, creating a truncated version of the original gene, or represent 3’ extensions of an existing ORF into a nearby ORF. The first two types of overprinting may be thought of as a particular subtype of ''de novo'' gene birth; although overlapping with a previously coding region of the genome, the primary amino-acid sequence of the new protein is entirely novel and derived from a frame that did not previously contain a gene. The first examples of this phenomenon in
bacteriophage A bacteriophage (), also known informally as a ''phage'' (), is a duplodnaviria virus that infects and replicates within bacteria and archaea. The term was derived from "bacteria" and the Greek φαγεῖν ('), meaning "to devour". Bac ...
s were reported in a series of studies from 1976 to 1978, and since then numerous other examples have been identified in viruses, bacteria, and several eukaryotic species. The phenomenon of exonization also represents a special case of ''de novo'' gene birth, in which, for example, often-repetitive intronic sequences acquire splice sites through mutation, leading to ''de novo'' exons. This was first described in 1994 in the context of ''Alu'' sequences found in the coding regions of primate mRNAs. Interestingly, such ''de novo'' exons are frequently found in minor splice variants, which may allow the evolutionary “testing” of novel sequences while retaining the functionality of the major splice variant(s). Still, it was thought by some that most or all eukaryotic proteins were constructed from a constrained pool of “starter type” exons. Using the sequence data available at the time, a 1991 review estimated the number of unique, ancestral eukaryotic exons to be < 60,000, while in 1992 a piece was published estimating that the vast majority of proteins belonged to no more than 1,000 families. Around the same time, however, the sequence of chromosome III of the budding yeast ''
Saccharomyces cerevisiae ''Saccharomyces cerevisiae'' () (brewer's yeast or baker's yeast) is a species of yeast (single-celled fungus microorganisms). The species has been instrumental in winemaking, baking, and brewing since ancient times. It is believed to have b ...
'' was released, representing the first time an entire chromosome from any eukaryotic organism had been sequenced. Sequencing of the entire yeast nuclear genome was then completed by early 1996 through a massive, collaborative international effort. In his review of the yeast genome project,
Bernard Dujon Bernard Dujon is a French geneticist, born on August 8, 1947 in Meudon (Hauts-de-Seine). He is Professor Emeritus at Sorbonne University and the Institut Pasteur since 2015. He is a member of the French Academy of sciences. Early life and educat ...
noted that the unexpected abundance of genes lacking any known homologs was perhaps the most striking finding of the entire project. In 2006 and 2007, a series of studies provided arguably the first documented examples of ''de novo'' gene birth that did not involve overprinting. These studies were conducted using the accessory gland transcriptomes of ''
Drosophila yakuba ''Drosophila yakuba'' is an African species In biology, a species is the basic unit of Taxonomy (biology), classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest grou ...
'' and ''
Drosophila erecta ''Drosophila erecta'' is a West African species In biology, a species is the basic unit of Taxonomy (biology), classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest ...
'' and they identified 20 putative lineage-restricted genes that appeared unlikely to have resulted from gene duplication. Levine and colleagues identified and confirmed five ''de novo'' candidate genes specific to ''
Drosophila melanogaster ''Drosophila melanogaster'' is a species of fly (the taxonomic order Diptera) in the family Drosophilidae. The species is often referred to as the fruit fly or lesser fruit fly, or less commonly the " vinegar fly" or "pomace fly". Starting with ...
'' and/or the closely related '' Drosophila simulans'' through a rigorous approach that combined bioinformatic and experimental techniques. Since these initial studies, many groups have identified specific cases of ''de novo'' gene birth events in diverse organisms. The first ''de novo'' gene identified in yeast, ''BSC4'' gene was identified in ''S. cerevisiae'' in 2008. This gene shows evidence of purifying selection, is expressed at both the mRNA and protein levels, and when deleted is synthetically lethal with two other yeast genes, all of which indicate a functional role for the ''BSC4'' gene product. Historically, one argument against the notion of widespread ''de novo'' gene birth is the evolved complexity of protein folding. Interestingly, Bsc4 was later shown to adopt a partially folded state that combines properties of native and non-native protein folding. In plants, the first ''de novo'' gene to be functionally characterized was ''QQS'', an ''
Arabidopsis thaliana ''Arabidopsis thaliana'', the thale cress, mouse-ear cress or arabidopsis, is a small flowering plant native to Eurasia and Africa. ''A. thaliana'' is considered a weed; it is found along the shoulders of roads and in disturbed land. A winter ...
'' gene identified in 2009 that regulates carbon and nitrogen metabolism. The first functionally characterized ''de novo'' gene identified in mice, a noncoding RNA gene, was also described in 2009. In primates, a 2008 informatic analysis estimated that 15/270 primate orphan genes had been formed ''de novo''. A 2009 report identified the first three ''de novo'' human genes, one of which is a therapeutic target in chronic lymphocytic leukemia. Since this time, a plethora of genome-level studies have identified large numbers of orphan genes in many organisms, although the extent to which they arose ''de novo'', and the degree to which they can be deemed functional, remain debated.


Identification


Identification of ''de novo'' emerging sequences

There are two major approaches to the systematic identification of novel genes:
genomic phylostratigraphy Genomic phylostratigraphy is a novel genetic statistical method developed in order to date the origin of specific genes by looking at its homologs across species. It was first developed by Ruđer Bošković Institute in Zagreb, Croatia. The syste ...
and synteny-based methods. Both approaches are widely used, individually or in a complementary fashion.


Genomic phylostratigraphy

Genomic phylostratigraphy involves examining each gene in a focal, or reference, species and inferring the presence or absence of ancestral homologs through the use of the
BLAST Blast or The Blast may refer to: *Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film), ...
sequence alignment algorithms or related tools. Each gene in the focal species can be assigned an age (aka “conservation level” or “genomic phylostratum”) that is based on a predetermined phylogeny, with the age corresponding to the most distantly related species in which a homolog is detected. When a gene lacks any detectable homolog outside of its own genome, or close relatives, it is said to be a novel, taxonomically restricted or orphan gene. Phylostratigraphy is limited by the set of closely related genomes that are available, and results are dependent on BLAST search criteria. In addition, it is often difficult to determine based on lack of observed sequence similarity whether a novel gene has emerged ''de novo'' or has diverged from an ancestral gene beyond recognition, for instance following a duplication event. This was pointed out by a study that simulated the evolution of genes of equal age and found that distant orthologs can be undetectable for rapidly evolving genes. On the other hand, when accounting for changes in the rate of evolution in young regions of genes, a phylostratigraphic approach was more accurate at assigning gene ages in simulated data. Subsequent studies using simulated evolution found that phylostratigraphy failed to detect an ortholog in the most distantly related species for 13.9% of ''D. melanogaster'' genes and 11.4% of ''S. cerevisiae'' genes. However, a reanalysis of studies that used phylostratigraphy in yeast, fruit flies and humans found that even when accounting for such error rates and excluding difficult-to-stratify genes from the analyses, the qualitative conclusions were unaffected. The impact of phylostratigraphic bias on studies examining various features of ''de novo'' genes remains debated.


Synteny-based approaches

Synteny-based approaches use order and relative positioning of genes (or other features) to identify the potential ancestors of candidate ''de novo'' genes. Syntenic alignments are anchored by conserved “markers.” Genes are the most common marker in defining syntenic blocks, although k-mers and exons are also used. Confirmation that the syntenic region lacks coding potential in outgroup species allows a ''de novo'' origin to be asserted with higher confidence. The strongest possible evidence for ''de novo'' emergence is the inference of the specific "enabling" mutation(s) that created coding potential, typically through the analysis of smaller sequence regions, termed microsyntenic regions, of closely related species. One challenge in applying synteny-based methods is that synteny can be difficult to detect across longer timescales. To address this, various optimization techniques have been created, such as using exons clustered irrespective of their specific order to define syntenic blocks or algorithms that use well-conserved genomic regions to expand microsyntenic blocks. There are also difficulties associated with applying synteny-based approaches to genome assemblies that are fragmented or in lineages with high rates of chromosomal rearrangements, as is common in insects. Synteny-based approaches can be applied to genome-wide surveys of ''de novo'' genes and represent a promising area of algorithmic development for gene birth dating. Some have used synteny-based approaches in combination with similarity searches in an attempt to develop standardized, stringent pipelines that can be applied to any group of genomes in an attempt to address discrepancies in the various lists of ''de novo'' genes that have been generated.


Determination of status

Even when the evolutionary origin of a particular coding sequence has been established, there is still a lack of consensus about what constitutes a genuine ''de novo'' gene birth event. One reason for this is a lack of agreement on whether or not the entirety of the sequence must be non-genic in origin. For protein-coding ''de novo'' genes, it has been proposed that ''de novo'' genes be divided into subtypes based on the proportion of the ORF in question that was derived from a previously noncoding sequence. Furthermore, for ''de novo'' gene birth to occur, the sequence in question must be a gene which has led to a questioning of what constitutes a gene, with some models establishing a strict dichotomy between genic and non-genic sequences, and others proposing a more fluid continuum. All definitions of genes are linked to the notion of function, as it is generally agreed that a genuine gene should encode a functional product, be it RNA or protein. There are, however, different views of what constitutes function, depending whether a given sequence is assessed using genetic, biochemical, or evolutionary approaches. The ambiguity of the concept of ‘function’ is especially problematic for the ''de novo'' gene birth field, where the objects of study are often rapidly evolving. To address these challenges, the Pittsburgh Model of Function deconstructs ‘function’ into five meanings to describe the different properties that are acquired by a locus undergoing ''de novo'' gene birth : Expression, Capacities, Interactions, Physiological Implications, and Evolutionary Implications. It is generally accepted that a genuine ''de novo'' gene is expressed in at least some context, allowing selection to operate, and many studies use evidence of expression as an inclusion criterion in defining ''de novo'' genes. The expression of sequences at the mRNA level may be confirmed individually through techniques such as quantitative PCR, or globally through RNA sequencing (RNA-seq). Similarly, expression at the protein level can be determined with high confidence for individual proteins using techniques such as
mass spectrometry Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a '' mass spectrum'', a plot of intensity as a function of the mass-to-charge ratio. Mass spectrometry is u ...
or
western blot The western blot (sometimes called the protein immunoblot), or western blotting, is a widely used analytical technique in molecular biology and immunogenetics to detect specific proteins in a sample of tissue homogenate or extract. Besides detect ...
ting, while
ribosome profiling Ribosome profiling, or Ribo-Seq (also named ribosome footprinting), is an adaptation of a technique developed by Joan Steitz and Marilyn Kozak almost 50 years ago that Nicholas Ingolia and Jonathan Weissman adapted to work with next generation ...
(Ribo-seq) provides a global survey of translation in a given sample. Ideally, to confirm a gene arose ''de novo'', a lack of expression of the syntenic region of outgroup species would also be demonstrated. Genetic approaches to detect a specific phenotype or change in fitness upon disruption of a particular sequence, are useful to infer function. Other experimental approaches, including screens for protein-protein and/or genetic interactions, may also be employed to confirm a biological effect for a particular ''de novo'' ORF. Evolutionary approaches may be employed to infer the existence of a molecular function from computationally derived signatures of selection. In the case of TRGs, one common signature of selection is the ratio of nonsynonymous to synonymous substitutions ( dN/dS ratio), calculated from different species from the same taxon. Similarly, in the case of species-specific genes, polymorphism data may be used to calculate a pN/pS ratio from different strains or populations of the focal species. Given that young, species-specific ''de novo'' genes lack deep conservation by definition, detecting statistically significant deviations from 1 can be difficult without an unrealistically large number of sequenced strains/populations. An example of this can be seen in ''Mus musculus'', where three very young ''de novo'' genes lack signatures of selection despite well-demonstrated physiological roles. For this reason, pN/pS approaches are often applied to groups of candidate genes, allowing researchers to infer that at least some of them are evolutionarily conserved, without being able to specify which. Other signatures of selection, such as the degree of nucleotide divergence within syntenic regions, conservation of ORF boundaries, or for protein-coding genes, a coding score based on nucleotide hexamer frequencies, have instead been employed.


Prevalence


Estimates of numbers

Frequency and number estimates of ''de novo'' genes in various lineages vary widely and are highly dependent on methodology. Studies may identify ''de novo'' genes by phylostratigraphy/BLAST-based methods alone, or may employ a combination of computational techniques, and may or may not assess experimental evidence for expression and/or biological role. Furthermore, genome-scale analyses may consider all or most ORFs in the genome, or may instead limit their analysis to previously annotated genes. The ''D. melanogaster'' lineage is illustrative of these differing approaches. An early survey using a combination of BLAST searches performed on cDNA sequences along with manual searches and synteny information identified 72 new genes specific to ''D. melanogaster'' and 59 new genes specific to three of the four species in the ''D. melanogaster'' species complex. This report found that only 2/72 (~2.8%) of ''D. melanogaster''-specific new genes and 7/59 (~11.9%) of new genes specific to the species complex were derived ''de novo'', with the remainder arising via duplication/retroposition. Similarly, an analysis of 195 young (<35 million years old) ''D. melanogaster'' genes identified from syntenic alignments found that only 16 had arisen ''de novo''. In contrast, an analysis focused on transcriptomic data from the testes of six ''D. melanogaster'' strains identified 106 fixed and 142 segregating ''de novo'' genes. For many of these, ancestral ORFs were identified but were not expressed. A newer study found that up to 39 % of orphan genes in the ''Drosophila'' clade may have emerged ''de novo'', as they overlap with non-coding regions of the genome. Highlighting the differences between inter- and intra-species comparisons, a study in natural ''
Saccharomyces paradoxus ''Saccharomyces paradoxus'' is a wild yeast and the closest known species to the baker's yeast ''Saccharomyces cerevisiae''. It is used in population genomics and phylogenetic studies to compare its wild characteristics to laboratory yeasts. Ec ...
'' populations found that the number of ''de novo'' polypeptides identified more than doubled when considering intra-species diversity. In primates, one early study identified 270 orphan genes (unique to humans, chimpanzees, and macaques), of which 15 were thought to have originated ''de novo''. Later reports identified many more ''de novo'' genes in humans alone that are supported by transcriptional and proteomic evidence. Studies in other lineages/organisms have also reached different conclusions with respect to the number of ''de novo'' genes present in each organism, as well as the specific sets of genes identified. A sample of these large-scale studies is described in the table below. Generally speaking, it remains debated whether duplication and divergence or ''de novo'' gene birth represent the dominant mechanism for the emergence of new genes, in part because ''de novo'' genes are likely to both emerge and be lost more frequently than other young genes. In a study on the origin of orphan genes in 3 different eukaryotic lineages, authors found that on average only around 30% of orphan genes can be explained by sequence divergence.


Dynamics

It is important to distinguish between the frequency of ''de novo'' gene birth and the number of ''de novo'' genes in a given lineage. If ''de novo'' gene birth is frequent, it might be expected that genomes would tend to grow in their gene content over time; however, the gene content of genomes is usually relatively stable. This implies that a frequent gene death process must balance ''de novo'' gene birth, and indeed, ''de novo'' genes are distinguished by their rapid turnover relative to established genes. In support of this notion, recently emerged ''Drosophila'' genes are much more likely to be lost, primarily through pseudogenization, with the youngest orphans being lost at the highest rate; this is despite the fact that some ''Drosophila'' orphan genes have been shown to rapidly become essential. A similar trend of frequent loss among young gene families was observed in the nematode genus '' Pristionchus''. Similarly, an analysis of five mammalian transcriptomes found that most ORFs in mice were either very old or species specific, implying frequent birth and death of ''de novo'' transcripts. A comparable trend could be shown by further analyses of six primate transcriptomes. In wild ''S. paradoxus'' populations, ''de novo'' ORFs emerge and are lost at similar rates. Nevertheless, there remains a positive correlation between the number of species-specific genes in a genome and the evolutionary distance from its most recent ancestor. A rapid gain and loss of ''de novo'' genes was also found on a population level by analyzing nine natural three-spined stickleback populations. In addition to the birth and death of ''de novo'' genes at the level of the ORF, mutational and other processes also subject genomes to constant “transcriptional turnover”. One study in murines found that while all regions of the ancestral genome were transcribed at some point in at least one descendant, the portion of the genome under active transcription in a given strain or subspecies is subject to rapid change. The transcriptional turnover of noncoding RNA genes is particularly fast compared to coding genes.


Example ''de novo'' gene table


Features


General Features

Recently emerged ''de novo'' genes differ from established genes in a number of ways. Across a broad range of species, young and/or taxonomically restricted genes have been reported to be shorter in length than established genes, to evolve more rapidly, and to be less expressed. Although these trends could be a result of homology detection bias, a reanalysis of several studies that accounted for this bias found that the qualitative conclusions reached were unaffected. Another feature includes the tendency for young genes to have their hydrophobic amino acids more clustered near one another along the primary sequence. The expression of young genes has also been found to be more tissue- or condition-specific than that of established genes. In particular, relatively high expression of ''de novo'' genes was observed in male reproductive tissues in ''Drosophila'', stickleback, mice, and humans, and, in the human brain. In animals with adaptive immune systems, higher expression in the brain and testes may be a function of the immune-privileged nature of these tissues. An analysis in mice found specific expression of intergenic transcripts in the thymus and spleen (in addition to the brain and testes). It has been proposed that in vertebrates ''de novo'' transcripts must first be expressed in tissues lacking immune cells before they can be expressed in tissues that have immune surveillance.


Features that promote ''de novo'' gene birth

Its also of interest to compare features of recently emerged ''de novo'' genes to the pool of non-genic ORFs from which they emerge. Theoretical modeling has shown that such differences are the product both of selection for features that increase the likelihood of functionalization, and of neutral evolutionary forces that influence allelic turnover. Experiments in ''S. cerevisiae'' showed that predicted transmembrane domains were strongly associated with beneficial fitness effects when young ORFs were overexpressed, but not when established (older) ORFs were overexpressed. Experiments in ''E. coli'' showed that random peptides tended to have more benign effects when they were enriched for amino acids that were small, and that promoted intrinsic structural disorder.


Lineage-dependent features

Features of ''de novo'' genes can depend on the species or lineage being examined. This appears to partly be a result of varying GC content in genomes and that young genes bear more similarity to non-genic sequences from the genome in which they arose than do established genes. Features in the resulting protein, such as the percentage of transmembrane residues and the relative frequency of various predicted secondary structural features show a strong GC dependency in orphan genes, whereas in more ancient genes these features are only weakly influenced by GC content. The relationship between gene age and the amount of predicted intrinsic structural disorder (ISD) in the encoded proteins has been subject to considerable debate. It has been claimed that ISD is also a lineage-dependent feature, exemplified by the fact that in organisms with relatively high GC content, ranging from ''D. melanogaster'' to the parasite ''
Leishmania major ''Leishmania major'' is a species of parasite found in the genus '' Leishmania'', and is associated with the disease zoonotic cutaneous leishmaniasis (also known as Aleppo boil, Baghdad boil, Bay sore, Biskra button, Chiclero ulcer, Delhi boil, ...
'', young genes have high ISD, while in a low GC genome such as budding yeast, several studies have shown that young genes have low ISD. However, a study that excluded young genes with dubious evidence for functionality, defined in binary terms as being under selection for gene retention, found that the remaining young yeast genes have high ISD, suggesting that the yeast result may be due to contamination of the set of young genes with ORFs that do not meet this definition, and hence are more likely to have properties that reflect GC content and other non-genic features of the genome. Beyond the very youngest orphans, this study found that ISD tends to decrease with increasing gene age, and that this is primarily due to amino acid composition rather than GC content. Within shorter time scales ,using ''de novo'' genes that have the most validation suggests that younger genes are more disordered in ''Lachancea'', but less disordered in ''Saccharomyces''. Intrinsic structural disorder and aggregation propensity did not show significant differences with age in some studies of mammals and primates, but did in other studies of mammals. One large study of the entire Pfam protein domain database showed enrichment of younger protein domain for disorder-promoting amino acids across animals, but enrichment on the basis of amino acid availability in plants.


Role of epigenetic modifications

An examination of ''de novo'' genes in ''A. thaliana'' found that they are both
hypermethylated In the chemical sciences, methylation denotes the addition of a methyl group on a substrate, or the substitution of an atom (or group) by a methyl group. Methylation is a form of alkylation, with a methyl group replacing a hydrogen atom. These t ...
and generally depleted of
histone In biology, histones are highly basic proteins abundant in lysine and arginine residues that are found in eukaryotic cell nuclei. They act as spools around which DNA winds to create structural units called nucleosomes. Nucleosomes in turn a ...
modifications. In agreement with either the proto-gene model or contamination with non-genes, methylation levels of ''de novo'' genes were intermediate between established genes and intergenic regions. The methylation patterns of these ''de novo'' genes are stably inherited, and methylation levels were highest, and most similar to established genes, in ''de novo'' genes with verified protein-coding ability. In the pathogenic fungus ''Magnaporthe oryzae'', less conserved genes tend to have
methylation In the chemical sciences, methylation denotes the addition of a methyl group on a substrate, or the substitution of an atom (or group) by a methyl group. Methylation is a form of alkylation, with a methyl group replacing a hydrogen atom. These ...
patterns associated with low levels of transcription. A study in yeasts also found that ''de novo'' genes are enriched at recombination hotspots, which tend to be nucleosome-free regions. In ''
Pristionchus pacificus ''Pristionchus pacificus'' is a species of free-living nematodes (roundworms) in the family Diplogastridae. The species has been established as a satellite model organism to ''Caenorhabditis elegans'', with which it shared a common ancestor 200 ...
'', orphan genes with confirmed expression display chromatin states that differ from those of similarly expressed established genes. Orphan gene start sites have epigenetic signatures that are characteristic of enhancers, in contrast to conserved genes that exhibit classical promoters. Many unexpressed orphan genes are decorated with repressive histone modifications, while a lack of such modifications facilitates transcription of an expressed subset of orphans, supporting the notion that open chromatin promotes the formation of novel genes.


Structural features

As structure is usually more conserved than sequence, comparing structures between orthologs could provide deeper insides into ''de novo'' gene emergence and evolution and help to confirm these genes as true ''de novo'' genes. Nevertheless, so far only very few ''de novo'' proteins have been structurally and functionally characterized, especially due to problems with protein purification and subsequent stability. Progresses have been made using different purification tags, cell types and chaperones. The ‘antifreeze glycoprotein’ (AFGP) in Arctic codfishes prevents their blood from freezing in arctic waters. Bsc4, a short non-essential ''de novo'' protein in yeast, has been shown to be built mainly by beta-sheets and has a hydrophobic core. It is associated to DNA repair under nutrient-deficient conditions. The ''Drosophila'' ''de novo'' protein Goddard has been characterized for the first time in 2017. Knockdown ''Drosophila melanogaster'' male flies were not able to produce sperm. Recently, it could be shown that this lack was due to failure of individualization of elongated spermatids. By using computational phylogenomic and structure predictions, experimental structural analyses, and cell biological assays, it was proposed that half of Goddard's structure is disordered and the other half is composed by alpha-helical amino acids. These analyses also indicated that Goddard's orthologs show similar results. Goddard's structure therefore appears to have been mainly conserved since its emergence.


Mechanisms


Pervasive expression

With the development of technologies such as RNA-seq and Ribo-seq, eukaryotic genomes are now known to be pervasively transcribed and translated. Many ORFs that are either unannotated, or annotated as long non-coding RNAs (lncRNAs), are translated at some level, either in a condition or tissue-specific manner. Though infrequent, these translation events expose non-genic sequence to selection. This pervasive expression forms the basis for several models describing ''de novo'' gene birth. It has been speculated that the epigenetic landscape of ''de novo'' genes in the early stages of formation may be particularly variable between and among populations, resulting in variable gene expression thereby allowing young genes to explore the “expression landscape.” The ''QQS'' gene in ''A. thaliana'' is one example of this phenomenon; its expression is negatively regulated by DNA methylation that, while heritable for several generations, varies widely in its levels both among natural accessions and within wild populations. Epigenetics are also largely responsible for the permissive transcriptional environment in the testes, particularly through the incorporation into nucleosomes of non-canonical histone variants that are replaced by histone-like protamines during spermatogenesis.


Intergenic ORFs as elementary structural modules

Analysis of the fold potential diversity shows that the majority of the amino acid sequences encoded by the intergenic ORFs of ''S. cerevisiae'' are predicted to be foldable. More importantly, these amino acid sequences with folding potential can serve as elementary building blocks for de novo genes or integrate into pre-existing genes.


Order of events

For birth of a ''de novo'' protein-coding gene to occur, a non-genic sequence must both be transcribed and acquire an ORF before becoming translated. These events could occur in either order, and there is evidence supporting both an “ORF first” and a “transcription first” model. An analysis of ''de novo'' genes that are segregating in ''D. melanogaster'' found that sequences that are transcribed had similar coding potential to the orthologous sequences from lines lacking evidence of transcription. This finding supports the notion that many ORFs can exist prior to being transcribed. The antifreeze glycoprotein gene ''AFGP'', which emerged ''de novo'' in Arctic codfishes, provides a more definitive example in which the ''de novo'' emergence of the ORF was shown to precede the promoter region. Furthermore, putatively non-genic ORFs long enough to encode functional peptides are numerous in eukaryotic genomes, and expected to occur at high frequency by chance. Through tracing the evolution history of ORF sequences and transcription activation of human ''de novo'' genes, a study showed that some ORFs were ready to confer biological significance upon their birth. At the same time, transcription of eukaryotic genomes is far more extensive than previously thought, and there are documented examples of genomic regions that were transcribed prior to the appearance of an ORF that became a ''de novo'' gene. The proportion of ''de novo'' genes that are protein-coding is unknown, but the appearance of “transcription first” has led some to posit that protein-coding ''de novo'' genes may first exist as RNA gene intermediates. The case of bifunctional RNAs, which are both translated and function as RNA genes, shows that such a mechanism is plausible. The two events may occur simultaneously when chromosomal rearrangement is the event that precipitates gene birth.


Models

Several theoretical models and possible mechanisms of ''de novo'' gene birth have been described. The models are generally not mutually exclusive, and it is possible that multiple mechanisms may give rise to ''de novo'' genes. An example is the type III antifreeze protein gene, which originates from an old sialic acid synthase (''SAS'') gene, in an Antarctic zoarcid fish.


“Out of Testis” hypothesis

An early case study of ''de novo'' gene birth, which identified five ''de novo'' genes in ''D. melanogaster'', noted preferential expression of these genes in the testes, and several additional ''de novo'' genes were identified using transcriptomic data derived from the testes and male accessory glands of ''D. yakuba'' and ''D. erecta''. This is in agreement with other studies that showed there is rapid evolution of genes related to reproduction across a range of lineages, suggesting that sexual selection may play a key role in adaptive evolution and ''de novo'' gene birth. A subsequent large-scale analysis of six ''D. melanogaster'' strains identified 248 testis-expressed ''de novo'' genes, of which ~57% were not fixed. A recent study on twelve ''Drosophila'' species additionally identified a higher proportion of ''de novo'' genes with testis-biased expression compared to annotated proteome. It has been suggested that the large number of ''de novo'' genes with male-specific expression identified in ''Drosophila'' is likely due to the fact that such genes are preferentially retained relative to other ''de novo'' genes, for reasons that are not entirely clear. Interestingly, two putative ''de novo'' genes in ''Drosophila'' (''Goddard'' and ''Saturn'') were shown to be required for normal male fertility. A genetic screen of over 40 putative de novo genes with testis-enriched expression in ''Drosophila melanogaster'' revealed that one of the de novo genes, ''atlas'', was required for proper chromatin condensation during the final stages of spermatogenesis in male. ''atlas'' evolved from the fusion of a protein-coding gene that arose at the base of Drosophila genus and a conserved non-coding RNA. Comparative analysis of the transcriptomes of testis and accessory glands, a somatic tissue of males that is important for fertility, of ''D. melanogaster'' suggests that de novo genes make greater contribution to the transcriptomic complexity of testis as compared to accessory glands. Single-cell RNA-seq of ''D. melanogaster'' testis revealed that the expression pattern of de novo genes was biased toward early spermatogenesis. In humans, a study that identified 60 human-specific ''de novo'' genes found that their average expression, as measured by RNA-seq, was highest in the testes. Another study looking at mammalian-specific genes more generally also found enriched expression in the testes. Transcription in mammalian testes is thought to be particularly promiscuous, due in part to elevated expression of the transcription machinery and an open chromatin environment. Along with the immune-privileged nature of the testes, this promiscuous transcription is thought to create the ideal conditions for the expression of non-genic sequences required for ''de novo'' gene birth. Testes-specific expression seems to be a general feature of all novel genes, as an analysis of ''Drosophila'' and vertebrate species found that young genes showed testes-biased expression regardless of their mechanism of origination.


Preadaptation model

The preadaptation model of ''de novo'' gene birth uses mathematical modeling to show that when sequences that are normally hidden are exposed to weak or shielded selection, the resulting pool of “cryptic” sequences (i.e. proto-genes) can be purged of “self-evidently deleterious” variants, such as those prone to lead to protein aggregation, and thus enriched in potential adaptations relative to a completely non-expressed and unpurged set of sequences. This revealing and purging of cryptic deleterious non-genic sequences is a byproduct of pervasive transcription and translation of intergenic sequences, and is expected to facilitate the birth of functional ''de novo'' protein-coding genes. This is because by eliminating the most deleterious variants, what is left is, by a process of elimination, more likely to be adaptive than expected from random sequences. Using the evolutionary definition of function (i.e. that a gene is by definition under purifying selection against loss), the preadaptation model assumes that “gene birth is a sudden transition to functionality” that occurs as soon as an ORF acquires a net beneficial effect. In order to avoid being deleterious, newborn genes are expected to display exaggerated versions of genic features associated with the avoidance of harm. This is in contrast to the proto-gene model, which expects newborn genes to have features intermediate between old genes and non-genes. The mathematics of the preadaptation model assume that the distribution of fitness effects is bimodal, with new sequences of mutations tending to break something or tinker, but rarely in between. Following this logic, populations may either evolve local solutions, in which selection operates on each individual locus and a relatively high error rate is maintained, or a global solution with a low error rate which permits the accumulation of deleterious cryptic sequences. ''De novo'' gene birth is thought to be favored in populations that evolve local solutions, as the relatively high error rate will result in a pool of cryptic variation that is “preadapted” through the purging of deleterious sequences. Local solutions are more likely in populations with a high
effective population size The effective population size (''N'e'') is a number that, in some simplified scenarios, corresponds to the number of breeding individuals in the population. More generally, ''N'e'' is the number of individuals that an idealised population w ...
. In support of the preadaptation model, an analysis of ISD in mice and yeast found that young genes have higher ISD than old genes, while random non-genic sequences tend to show the lowest levels of ISD. Although the observed trend may have partly resulted from a subset of young genes derived by overprinting, higher ISD in young genes is also seen among overlapping viral gene pairs. With respect to other predicted structural features such as β-strand content and aggregation propensity, the peptides encoded by proto-genes are similar to non-genic sequences and categorically distinct from canonical genes.


Proto-gene model

This proto-gene model agrees with the preadaptation model about the importance of pervasive expression, and refers to the set of pervasively expressed sequences that do not meet all definitions of a gene as “proto-genes”. In contrast to the preadaptation model, the proto-gene model, suggests newborn genes have features intermediate between old genes and non-genes. Specifically this model envisages a more gradual process under selection from non-genic to genic state, rejecting the binary classification of gene and non-gene. In an extension of the proto-gene model, it has been proposed that as proto-genes become more gene-like, their potential for adaptive change gives way to selected effects; thus, the predicted impact of mutations on fitness is dependent on the evolutionary status of the ORF. This notion is supported by the fact that overexpression of established ORFs in S. cerevisiae tends to be less beneficial (and more harmful) than does overexpression of emerging ORFs. Several features of ORFs correlate with ORF age as determined by phylostratigraphic analysis, with young ORFs having properties intermediate between old ORFs and non-genes; this has been taken as evidence in favor of the proto-gene model, in which proto-gene state is a continuum . This evidence has been criticized, because the same apparent trends are also expected under a model in which identity as a gene is a binary. Under this model, when each age group contains a different ratio of genes vs. non-genes,
Simpson's paradox Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science st ...
can generate correlations in the wrong direction.


Grow slow and moult model

The “grow slow and moult” model describes a potential mechanism of ''de novo'' gene birth, particular to protein-coding genes. In this scenario, existing protein-coding ORFs expand at their ends, especially their 3’ ends, leading to the creation of novel N- and C-terminal domains. Novel C-terminal domains may first evolve under weak selection via occasional expression through read-through translation, as in the preadaptation model, only later becoming constitutively expressed through a mutation that disrupts the stop codon. Genes experiencing high translational readthrough tend to have intrinsically disordered C-termini. Furthermore, existing genes are often close to repetitive sequences that encode disordered domains. These novel, disordered domains may initially confer some non-specific binding capability that becomes gradually refined by selection. Sequences encoding these novel domains may occasionally separate from their parent ORF, leading or contributing to the creation of a ''de novo'' gene. Interestingly, an analysis of 32 insect genomes found that novel domains (i.e. those unique to insects) tend to evolve fairly neutrally, with only a few sites under positive selection, while their host proteins remain under purifying selection, suggesting that new functional domains emerge gradually and somewhat stochastically.


Escape from adaptive conflict

The evolutionary model escape from adaptive conflict (EAC) proposes a possible way for new gene duplication to be fixed: conflict due to contrasting function within a single gene drives the fixation of new duplication.


Human health

In addition to its significance for the field of evolutionary biology, ''de novo'' gene birth has implications for human health. It has been speculated that novel genes, including ''de novo'' genes, may play an outsized role in species-specific traits; however, many species-specific genes lack functional annotation. Nevertheless, there is evidence to suggest that human-specific ''de novo'' genes are involved in diseases such as cancer. ''NYCM'', a ''de novo'' gene unique to humans and chimpanzees, regulates the pathogenesis of neuroblastomas in mouse models, and the primate-specific ''PART1'', an lncRNA gene, has been identified as both a tumor suppressor and an oncogene in different contexts. Several other human- or primate-specific ''de novo'' genes, including ''PBOV1'', ''GR6'', ''MYEOV'', ''ELFN1-AS1'', and ''CLLU1'', are also linked to cancer. Some have even suggested considering tumor-specifically expressed, evolutionary novel genes as their own class of genetic elements, noting that many such genes are under positive selection and may be neofunctionalized in the context of tumors. The specific expression of many ''de novo'' genes in the human brain also raises the intriguing possibility that ''de novo'' genes influence human cognitive traits. One such example is ''FLJ33706'', a ''de novo'' gene that was identified in GWAS and linkage analyses for nicotine addiction and shows elevated expression in the brains of Alzheimer's patients. Generally speaking, expression of young, primate-specific genes is enriched in the fetal human brain relative to the expression of similarly young genes in the mouse brain. Most of these young genes, several of which originated ''de novo'', are expressed in the neocortex, which is thought to be responsible for many aspects of human-specific cognition. Many of these young genes show signatures of positive selection, and functional annotations indicate that they are involved in diverse molecular processes, but are enriched for transcription factors. In addition to their roles in cancer processes, ''de novo'' originated human genes have been implicated in the maintenance of pluripotency and in immune function. The preferential expression of ''de novo'' genes in the testes is also suggestive of a role in reproduction. Given that the function of many ''de novo'' human genes remains uncharacterized, it seems likely that an appreciation of their contribution to human health and development will continue to grow. Note: For purposes of this table, genes are defined as orphan genes (when species-specific) or TRGs (when limited to a closely related group of species) when the mechanism of origination has not been investigated, and as ''de novo'' genes when ''de novo'' origination has been inferred, irrespective of method of inference. The designation of ''de novo'' genes as “candidates” or “proto-genes” reflects the language used by the authors of the respective studies.


See also

*
Molecular evolution Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genet ...
*
Population genetics Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and po ...
* Evolvability *
Overlapping gene An overlapping gene (or OLG) is a gene whose expressible nucleotide sequence partially overlaps with the expressible nucleotide sequence of another gene. In this way, a nucleotide sequence may make a contribution to the function of one or more gen ...
*
Orphan gene Orphan genes, ORFans, or taxonomically restricted genes (TRGs) are genes that lack a detectable homologue outside of a given species or lineage. Most genes have known homologues. Two genes are homologous when they share an evolutionary history, a ...


References

{{Reflist Genes Modification of genetic information