History
As early as the 1930s,Identification
Identification of ''de novo'' emerging sequences
There are two major approaches to the systematic identification of novel genes:Genomic phylostratigraphy
Genomic phylostratigraphy involves examining each gene in a focal, or reference, species and inferring the presence or absence of ancestral homologs through the use of theSynteny-based approaches
Synteny-based approaches use order and relative positioning of genes (or other features) to identify the potential ancestors of candidate ''de novo'' genes. Syntenic alignments are anchored by conserved “markers.” Genes are the most common marker in defining syntenic blocks, although k-mers and exons are also used. Confirmation that the syntenic region lacks coding potential in outgroup species allows a ''de novo'' origin to be asserted with higher confidence. The strongest possible evidence for ''de novo'' emergence is the inference of the specific "enabling" mutation(s) that created coding potential, typically through the analysis of smaller sequence regions, termed microsyntenic regions, of closely related species. One challenge in applying synteny-based methods is that synteny can be difficult to detect across longer timescales. To address this, various optimization techniques have been created, such as using exons clustered irrespective of their specific order to define syntenic blocks or algorithms that use well-conserved genomic regions to expand microsyntenic blocks. There are also difficulties associated with applying synteny-based approaches to genome assemblies that are fragmented or in lineages with high rates of chromosomal rearrangements, as is common in insects. Synteny-based approaches can be applied to genome-wide surveys of ''de novo'' genes and represent a promising area of algorithmic development for gene birth dating. Some have used synteny-based approaches in combination with similarity searches in an attempt to develop standardized, stringent pipelines that can be applied to any group of genomes in an attempt to address discrepancies in the various lists of ''de novo'' genes that have been generated.Determination of status
Even when the evolutionary origin of a particular coding sequence has been established, there is still a lack of consensus about what constitutes a genuine ''de novo'' gene birth event. One reason for this is a lack of agreement on whether or not the entirety of the sequence must be non-genic in origin. For protein-coding ''de novo'' genes, it has been proposed that ''de novo'' genes be divided into subtypes based on the proportion of the ORF in question that was derived from a previously noncoding sequence. Furthermore, for ''de novo'' gene birth to occur, the sequence in question must be a gene which has led to a questioning of what constitutes a gene, with some models establishing a strict dichotomy between genic and non-genic sequences, and others proposing a more fluid continuum. All definitions of genes are linked to the notion of function, as it is generally agreed that a genuine gene should encode a functional product, be it RNA or protein. There are, however, different views of what constitutes function, depending whether a given sequence is assessed using genetic, biochemical, or evolutionary approaches. The ambiguity of the concept of ‘function’ is especially problematic for the ''de novo'' gene birth field, where the objects of study are often rapidly evolving. To address these challenges, the Pittsburgh Model of Function deconstructs ‘function’ into five meanings to describe the different properties that are acquired by a locus undergoing ''de novo'' gene birth : Expression, Capacities, Interactions, Physiological Implications, and Evolutionary Implications. It is generally accepted that a genuine ''de novo'' gene is expressed in at least some context, allowing selection to operate, and many studies use evidence of expression as an inclusion criterion in defining ''de novo'' genes. The expression of sequences at the mRNA level may be confirmed individually through techniques such as quantitative PCR, or globally through RNA sequencing (RNA-seq). Similarly, expression at the protein level can be determined with high confidence for individual proteins using techniques such asPrevalence
Estimates of numbers
Frequency and number estimates of ''de novo'' genes in various lineages vary widely and are highly dependent on methodology. Studies may identify ''de novo'' genes by phylostratigraphy/BLAST-based methods alone, or may employ a combination of computational techniques, and may or may not assess experimental evidence for expression and/or biological role. Furthermore, genome-scale analyses may consider all or most ORFs in the genome, or may instead limit their analysis to previously annotated genes. The ''D. melanogaster'' lineage is illustrative of these differing approaches. An early survey using a combination of BLAST searches performed on cDNA sequences along with manual searches and synteny information identified 72 new genes specific to ''D. melanogaster'' and 59 new genes specific to three of the four species in the ''D. melanogaster'' species complex. This report found that only 2/72 (~2.8%) of ''D. melanogaster''-specific new genes and 7/59 (~11.9%) of new genes specific to the species complex were derived ''de novo'', with the remainder arising via duplication/retroposition. Similarly, an analysis of 195 young (<35 million years old) ''D. melanogaster'' genes identified from syntenic alignments found that only 16 had arisen ''de novo''. In contrast, an analysis focused on transcriptomic data from the testes of six ''D. melanogaster'' strains identified 106 fixed and 142 segregating ''de novo'' genes. For many of these, ancestral ORFs were identified but were not expressed. A newer study found that up to 39 % of orphan genes in the ''Drosophila'' clade may have emerged ''de novo'', as they overlap with non-coding regions of the genome. Highlighting the differences between inter- and intra-species comparisons, a study in natural ''Dynamics
It is important to distinguish between the frequency of ''de novo'' gene birth and the number of ''de novo'' genes in a given lineage. If ''de novo'' gene birth is frequent, it might be expected that genomes would tend to grow in their gene content over time; however, the gene content of genomes is usually relatively stable. This implies that a frequent gene death process must balance ''de novo'' gene birth, and indeed, ''de novo'' genes are distinguished by their rapid turnover relative to established genes. In support of this notion, recently emerged ''Drosophila'' genes are much more likely to be lost, primarily through pseudogenization, with the youngest orphans being lost at the highest rate; this is despite the fact that some ''Drosophila'' orphan genes have been shown to rapidly become essential. A similar trend of frequent loss among young gene families was observed in the nematode genus '' Pristionchus''. Similarly, an analysis of five mammalian transcriptomes found that most ORFs in mice were either very old or species specific, implying frequent birth and death of ''de novo'' transcripts. A comparable trend could be shown by further analyses of six primate transcriptomes. In wild ''S. paradoxus'' populations, ''de novo'' ORFs emerge and are lost at similar rates. Nevertheless, there remains a positive correlation between the number of species-specific genes in a genome and the evolutionary distance from its most recent ancestor. A rapid gain and loss of ''de novo'' genes was also found on a population level by analyzing nine natural three-spined stickleback populations. In addition to the birth and death of ''de novo'' genes at the level of the ORF, mutational and other processes also subject genomes to constant “transcriptional turnover”. One study in murines found that while all regions of the ancestral genome were transcribed at some point in at least one descendant, the portion of the genome under active transcription in a given strain or subspecies is subject to rapid change. The transcriptional turnover of noncoding RNA genes is particularly fast compared to coding genes.Example ''de novo'' gene table
Features
General Features
Recently emerged ''de novo'' genes differ from established genes in a number of ways. Across a broad range of species, young and/or taxonomically restricted genes have been reported to be shorter in length than established genes, to evolve more rapidly, and to be less expressed. Although these trends could be a result of homology detection bias, a reanalysis of several studies that accounted for this bias found that the qualitative conclusions reached were unaffected. Another feature includes the tendency for young genes to have their hydrophobic amino acids more clustered near one another along the primary sequence. The expression of young genes has also been found to be more tissue- or condition-specific than that of established genes. In particular, relatively high expression of ''de novo'' genes was observed in male reproductive tissues in ''Drosophila'', stickleback, mice, and humans, and, in the human brain. In animals with adaptive immune systems, higher expression in the brain and testes may be a function of the immune-privileged nature of these tissues. An analysis in mice found specific expression of intergenic transcripts in the thymus and spleen (in addition to the brain and testes). It has been proposed that in vertebrates ''de novo'' transcripts must first be expressed in tissues lacking immune cells before they can be expressed in tissues that have immune surveillance.Features that promote ''de novo'' gene birth
Its also of interest to compare features of recently emerged ''de novo'' genes to the pool of non-genic ORFs from which they emerge. Theoretical modeling has shown that such differences are the product both of selection for features that increase the likelihood of functionalization, and of neutral evolutionary forces that influence allelic turnover. Experiments in ''S. cerevisiae'' showed that predicted transmembrane domains were strongly associated with beneficial fitness effects when young ORFs were overexpressed, but not when established (older) ORFs were overexpressed. Experiments in ''E. coli'' showed that random peptides tended to have more benign effects when they were enriched for amino acids that were small, and that promoted intrinsic structural disorder.Lineage-dependent features
Features of ''de novo'' genes can depend on the species or lineage being examined. This appears to partly be a result of varying GC content in genomes and that young genes bear more similarity to non-genic sequences from the genome in which they arose than do established genes. Features in the resulting protein, such as the percentage of transmembrane residues and the relative frequency of various predicted secondary structural features show a strong GC dependency in orphan genes, whereas in more ancient genes these features are only weakly influenced by GC content. The relationship between gene age and the amount of predicted intrinsic structural disorder (ISD) in the encoded proteins has been subject to considerable debate. It has been claimed that ISD is also a lineage-dependent feature, exemplified by the fact that in organisms with relatively high GC content, ranging from ''D. melanogaster'' to the parasite ''Role of epigenetic modifications
An examination of ''de novo'' genes in ''A. thaliana'' found that they are bothStructural features
As structure is usually more conserved than sequence, comparing structures between orthologs could provide deeper insides into ''de novo'' gene emergence and evolution and help to confirm these genes as true ''de novo'' genes. Nevertheless, so far only very few ''de novo'' proteins have been structurally and functionally characterized, especially due to problems with protein purification and subsequent stability. Progresses have been made using different purification tags, cell types and chaperones. The ‘antifreeze glycoprotein’ (AFGP) in Arctic codfishes prevents their blood from freezing in arctic waters. Bsc4, a short non-essential ''de novo'' protein in yeast, has been shown to be built mainly by beta-sheets and has a hydrophobic core. It is associated to DNA repair under nutrient-deficient conditions. The ''Drosophila'' ''de novo'' protein Goddard has been characterized for the first time in 2017. Knockdown ''Drosophila melanogaster'' male flies were not able to produce sperm. Recently, it could be shown that this lack was due to failure of individualization of elongated spermatids. By using computational phylogenomic and structure predictions, experimental structural analyses, and cell biological assays, it was proposed that half of Goddard's structure is disordered and the other half is composed by alpha-helical amino acids. These analyses also indicated that Goddard's orthologs show similar results. Goddard's structure therefore appears to have been mainly conserved since its emergence.Mechanisms
Pervasive expression
With the development of technologies such as RNA-seq and Ribo-seq, eukaryotic genomes are now known to be pervasively transcribed and translated. Many ORFs that are either unannotated, or annotated as long non-coding RNAs (lncRNAs), are translated at some level, either in a condition or tissue-specific manner. Though infrequent, these translation events expose non-genic sequence to selection. This pervasive expression forms the basis for several models describing ''de novo'' gene birth. It has been speculated that the epigenetic landscape of ''de novo'' genes in the early stages of formation may be particularly variable between and among populations, resulting in variable gene expression thereby allowing young genes to explore the “expression landscape.” The ''QQS'' gene in ''A. thaliana'' is one example of this phenomenon; its expression is negatively regulated by DNA methylation that, while heritable for several generations, varies widely in its levels both among natural accessions and within wild populations. Epigenetics are also largely responsible for the permissive transcriptional environment in the testes, particularly through the incorporation into nucleosomes of non-canonical histone variants that are replaced by histone-like protamines during spermatogenesis.Intergenic ORFs as elementary structural modules
Analysis of the fold potential diversity shows that the majority of the amino acid sequences encoded by the intergenic ORFs of ''S. cerevisiae'' are predicted to be foldable. More importantly, these amino acid sequences with folding potential can serve as elementary building blocks for de novo genes or integrate into pre-existing genes.Order of events
For birth of a ''de novo'' protein-coding gene to occur, a non-genic sequence must both be transcribed and acquire an ORF before becoming translated. These events could occur in either order, and there is evidence supporting both an “ORF first” and a “transcription first” model. An analysis of ''de novo'' genes that are segregating in ''D. melanogaster'' found that sequences that are transcribed had similar coding potential to the orthologous sequences from lines lacking evidence of transcription. This finding supports the notion that many ORFs can exist prior to being transcribed. The antifreeze glycoprotein gene ''AFGP'', which emerged ''de novo'' in Arctic codfishes, provides a more definitive example in which the ''de novo'' emergence of the ORF was shown to precede the promoter region. Furthermore, putatively non-genic ORFs long enough to encode functional peptides are numerous in eukaryotic genomes, and expected to occur at high frequency by chance. Through tracing the evolution history of ORF sequences and transcription activation of human ''de novo'' genes, a study showed that some ORFs were ready to confer biological significance upon their birth. At the same time, transcription of eukaryotic genomes is far more extensive than previously thought, and there are documented examples of genomic regions that were transcribed prior to the appearance of an ORF that became a ''de novo'' gene. The proportion of ''de novo'' genes that are protein-coding is unknown, but the appearance of “transcription first” has led some to posit that protein-coding ''de novo'' genes may first exist as RNA gene intermediates. The case of bifunctional RNAs, which are both translated and function as RNA genes, shows that such a mechanism is plausible. The two events may occur simultaneously when chromosomal rearrangement is the event that precipitates gene birth.Models
Several theoretical models and possible mechanisms of ''de novo'' gene birth have been described. The models are generally not mutually exclusive, and it is possible that multiple mechanisms may give rise to ''de novo'' genes. An example is the type III antifreeze protein gene, which originates from an old sialic acid synthase (''SAS'') gene, in an Antarctic zoarcid fish.“Out of Testis” hypothesis
An early case study of ''de novo'' gene birth, which identified five ''de novo'' genes in ''D. melanogaster'', noted preferential expression of these genes in the testes, and several additional ''de novo'' genes were identified using transcriptomic data derived from the testes and male accessory glands of ''D. yakuba'' and ''D. erecta''. This is in agreement with other studies that showed there is rapid evolution of genes related to reproduction across a range of lineages, suggesting that sexual selection may play a key role in adaptive evolution and ''de novo'' gene birth. A subsequent large-scale analysis of six ''D. melanogaster'' strains identified 248 testis-expressed ''de novo'' genes, of which ~57% were not fixed. A recent study on twelve ''Drosophila'' species additionally identified a higher proportion of ''de novo'' genes with testis-biased expression compared to annotated proteome. It has been suggested that the large number of ''de novo'' genes with male-specific expression identified in ''Drosophila'' is likely due to the fact that such genes are preferentially retained relative to other ''de novo'' genes, for reasons that are not entirely clear. Interestingly, two putative ''de novo'' genes in ''Drosophila'' (''Goddard'' and ''Saturn'') were shown to be required for normal male fertility. A genetic screen of over 40 putative de novo genes with testis-enriched expression in ''Drosophila melanogaster'' revealed that one of the de novo genes, ''atlas'', was required for proper chromatin condensation during the final stages of spermatogenesis in male. ''atlas'' evolved from the fusion of a protein-coding gene that arose at the base of Drosophila genus and a conserved non-coding RNA. Comparative analysis of the transcriptomes of testis and accessory glands, a somatic tissue of males that is important for fertility, of ''D. melanogaster'' suggests that de novo genes make greater contribution to the transcriptomic complexity of testis as compared to accessory glands. Single-cell RNA-seq of ''D. melanogaster'' testis revealed that the expression pattern of de novo genes was biased toward early spermatogenesis. In humans, a study that identified 60 human-specific ''de novo'' genes found that their average expression, as measured by RNA-seq, was highest in the testes. Another study looking at mammalian-specific genes more generally also found enriched expression in the testes. Transcription in mammalian testes is thought to be particularly promiscuous, due in part to elevated expression of the transcription machinery and an open chromatin environment. Along with the immune-privileged nature of the testes, this promiscuous transcription is thought to create the ideal conditions for the expression of non-genic sequences required for ''de novo'' gene birth. Testes-specific expression seems to be a general feature of all novel genes, as an analysis of ''Drosophila'' and vertebrate species found that young genes showed testes-biased expression regardless of their mechanism of origination.Preadaptation model
The preadaptation model of ''de novo'' gene birth uses mathematical modeling to show that when sequences that are normally hidden are exposed to weak or shielded selection, the resulting pool of “cryptic” sequences (i.e. proto-genes) can be purged of “self-evidently deleterious” variants, such as those prone to lead to protein aggregation, and thus enriched in potential adaptations relative to a completely non-expressed and unpurged set of sequences. This revealing and purging of cryptic deleterious non-genic sequences is a byproduct of pervasive transcription and translation of intergenic sequences, and is expected to facilitate the birth of functional ''de novo'' protein-coding genes. This is because by eliminating the most deleterious variants, what is left is, by a process of elimination, more likely to be adaptive than expected from random sequences. Using the evolutionary definition of function (i.e. that a gene is by definition under purifying selection against loss), the preadaptation model assumes that “gene birth is a sudden transition to functionality” that occurs as soon as an ORF acquires a net beneficial effect. In order to avoid being deleterious, newborn genes are expected to display exaggerated versions of genic features associated with the avoidance of harm. This is in contrast to the proto-gene model, which expects newborn genes to have features intermediate between old genes and non-genes. The mathematics of the preadaptation model assume that the distribution of fitness effects is bimodal, with new sequences of mutations tending to break something or tinker, but rarely in between. Following this logic, populations may either evolve local solutions, in which selection operates on each individual locus and a relatively high error rate is maintained, or a global solution with a low error rate which permits the accumulation of deleterious cryptic sequences. ''De novo'' gene birth is thought to be favored in populations that evolve local solutions, as the relatively high error rate will result in a pool of cryptic variation that is “preadapted” through the purging of deleterious sequences. Local solutions are more likely in populations with a highProto-gene model
This proto-gene model agrees with the preadaptation model about the importance of pervasive expression, and refers to the set of pervasively expressed sequences that do not meet all definitions of a gene as “proto-genes”. In contrast to the preadaptation model, the proto-gene model, suggests newborn genes have features intermediate between old genes and non-genes. Specifically this model envisages a more gradual process under selection from non-genic to genic state, rejecting the binary classification of gene and non-gene. In an extension of the proto-gene model, it has been proposed that as proto-genes become more gene-like, their potential for adaptive change gives way to selected effects; thus, the predicted impact of mutations on fitness is dependent on the evolutionary status of the ORF. This notion is supported by the fact that overexpression of established ORFs in S. cerevisiae tends to be less beneficial (and more harmful) than does overexpression of emerging ORFs. Several features of ORFs correlate with ORF age as determined by phylostratigraphic analysis, with young ORFs having properties intermediate between old ORFs and non-genes; this has been taken as evidence in favor of the proto-gene model, in which proto-gene state is a continuum . This evidence has been criticized, because the same apparent trends are also expected under a model in which identity as a gene is a binary. Under this model, when each age group contains a different ratio of genes vs. non-genes,Grow slow and moult model
The “grow slow and moult” model describes a potential mechanism of ''de novo'' gene birth, particular to protein-coding genes. In this scenario, existing protein-coding ORFs expand at their ends, especially their 3’ ends, leading to the creation of novel N- and C-terminal domains. Novel C-terminal domains may first evolve under weak selection via occasional expression through read-through translation, as in the preadaptation model, only later becoming constitutively expressed through a mutation that disrupts the stop codon. Genes experiencing high translational readthrough tend to have intrinsically disordered C-termini. Furthermore, existing genes are often close to repetitive sequences that encode disordered domains. These novel, disordered domains may initially confer some non-specific binding capability that becomes gradually refined by selection. Sequences encoding these novel domains may occasionally separate from their parent ORF, leading or contributing to the creation of a ''de novo'' gene. Interestingly, an analysis of 32 insect genomes found that novel domains (i.e. those unique to insects) tend to evolve fairly neutrally, with only a few sites under positive selection, while their host proteins remain under purifying selection, suggesting that new functional domains emerge gradually and somewhat stochastically.Escape from adaptive conflict
The evolutionary model escape from adaptive conflict (EAC) proposes a possible way for new gene duplication to be fixed: conflict due to contrasting function within a single gene drives the fixation of new duplication.Human health
In addition to its significance for the field of evolutionary biology, ''de novo'' gene birth has implications for human health. It has been speculated that novel genes, including ''de novo'' genes, may play an outsized role in species-specific traits; however, many species-specific genes lack functional annotation. Nevertheless, there is evidence to suggest that human-specific ''de novo'' genes are involved in diseases such as cancer. ''NYCM'', a ''de novo'' gene unique to humans and chimpanzees, regulates the pathogenesis of neuroblastomas in mouse models, and the primate-specific ''PART1'', an lncRNA gene, has been identified as both a tumor suppressor and an oncogene in different contexts. Several other human- or primate-specific ''de novo'' genes, including ''PBOV1'', ''GR6'', ''MYEOV'', ''ELFN1-AS1'', and ''CLLU1'', are also linked to cancer. Some have even suggested considering tumor-specifically expressed, evolutionary novel genes as their own class of genetic elements, noting that many such genes are under positive selection and may be neofunctionalized in the context of tumors. The specific expression of many ''de novo'' genes in the human brain also raises the intriguing possibility that ''de novo'' genes influence human cognitive traits. One such example is ''FLJ33706'', a ''de novo'' gene that was identified in GWAS and linkage analyses for nicotine addiction and shows elevated expression in the brains of Alzheimer's patients. Generally speaking, expression of young, primate-specific genes is enriched in the fetal human brain relative to the expression of similarly young genes in the mouse brain. Most of these young genes, several of which originated ''de novo'', are expressed in the neocortex, which is thought to be responsible for many aspects of human-specific cognition. Many of these young genes show signatures of positive selection, and functional annotations indicate that they are involved in diverse molecular processes, but are enriched for transcription factors. In addition to their roles in cancer processes, ''de novo'' originated human genes have been implicated in the maintenance of pluripotency and in immune function. The preferential expression of ''de novo'' genes in the testes is also suggestive of a role in reproduction. Given that the function of many ''de novo'' human genes remains uncharacterized, it seems likely that an appreciation of their contribution to human health and development will continue to grow. Note: For purposes of this table, genes are defined as orphan genes (when species-specific) or TRGs (when limited to a closely related group of species) when the mechanism of origination has not been investigated, and as ''de novo'' genes when ''de novo'' origination has been inferred, irrespective of method of inference. The designation of ''de novo'' genes as “candidates” or “proto-genes” reflects the language used by the authors of the respective studies.See also
*References
{{Reflist Genes Modification of genetic information