Genome mining describes the exploitation of genomic information for the discovery of biosynthetic pathways of natural products and their possible interactions. It depends on computational technology and

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

tools. The mining process relies on a huge amount of data (represented by

DNA sequences A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are usua ...

and annotations) accessible in genomic

database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...

s. By applying data mining

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

s, the data can be used to generate new knowledge in several areas of

medicinal chemistry Medicinal or pharmaceutical chemistry is a scientific discipline at the intersection of chemistry and pharmacy involved with designing and developing pharmaceutical drugs. Medicinal chemistry involves the identification, synthesis and developm ...

, such as discovering novel

natural product A natural product is a natural compound or substance produced by a living organism—that is, found in nature. In the broadest sense, natural products include any substance produced by life. Natural products can also be prepared by chemical syn ...

History

In the mid- to late 1980s, researchers have increasingly focused on genetic studies with the advancing

sequencing technologies DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The ...

. The GenBank database was established in 1982 for the collection, management, storage, and distribution of DNA sequence data due to the increasing availability of DNA sequences. With the increasing number of genetic data, biotechnological companies have been able to use human DNA sequence to develop protein and antibody drugs through genome mining since 1992. In the late 1990s, many companies, such as Amgen, Immunec, Genentech were able to develop drugs that progressed to the clinical stage by adopting genome mining. Since the

Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...

was completed in the early 2000, researchers have been sequencing the genomes of many

microorganism A microorganism, or microbe,, ''mikros'', "small") and ''organism'' from the el, ὀργανισμός, ''organismós'', "organism"). It is usually written as a single word but is sometimes hyphenated (''micro-organism''), especially in olde ...

s. Subsequently, many of these genomes have been carefully studied to identify new genes and biosynthetic pathways.

Algorithms

As large quantities of genomic sequence data began to accumulate in public databases,

genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to gene ...

s became important to decipher the enormous collection of genomic data. They are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover and selection. The followings are commonly used genetic algorithms: * AntiSMASH (Antibiotics and Secondary Metabolite Analysis Shell) addresses secondary metabolite genome pipelines. * PRISM (Prediction Informatics for Secondary Metabolites) is a combinatorial approach to chemical structure prediction for genetically encoded nonribosomal peptides and type I and II polyketides. * SIM (Statistically based sequence similarity) method, such as

FASTA FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics. History The original FASTA program ...

PSI-BLAST In bioinformatics, BLAST (basic local alignment search tool) is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLA ...

, infer orthologous homology. *

BLAST Blast or The Blast may refer to: * Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film) ...

(Basic local alignment search tool) is an approach for rapid sequence comparison.

Applications

Genome mining applies on the discovery of natural product by facilitating the characterization of novel molecules and biosynthetic pathways.

Natural product discovery

The production of

s is regulated by the biosynthetic

gene cluster A gene family is a set of homologous genes within one organism. A gene cluster is a group of two or more genes found within an organism's DNA that encode similar polypeptides, or proteins, which collectively share a generalized function and are o ...

s (BGCs) encoded in the microorganism. By adopting genome mining, the BGCs that produce the target natural product can be predicted. Some important enzymes responsible for the formation of natural products are

polyketide Polyketides are a class of natural products derived from a precursor molecule consisting of a chain of alternating ketone (or reduced forms of a ketone) and methylene groups: (-CO-CH2-). First studied in the early 20th century, discovery, biosynth ...

synthases (PKS),

non-ribosomal peptide Nonribosomal peptides (NRP) are a class of peptide secondary metabolites, usually produced by microorganisms like bacteria and fungi. Nonribosomal peptides are also found in higher organisms, such as nudibranchs, but are thought to be made by bacter ...

synthases (NRPS), ribosomally and post-translationally modified peptides (RiPPs), and

terpenoid The terpenoids, also known as isoprenoids, are a class of naturally occurring organic chemicals derived from the 5-carbon compound isoprene and its derivatives called terpenes, diterpenes, etc. While sometimes used interchangeably with "terpenes" ...

s, and many more. Mining for enzymes, researchers can figure out the classes that BGCs encode and compare target gene clusters to known gene clusters. To verify the relation between the BGCs and natural products, the target BGCs can be expressed by suitable host through the use of

molecular cloning Molecular cloning is a set of experimental methods in molecular biology that are used to assemble recombinant DNA molecules and to direct their replication within host organisms. The use of the word ''cloning'' refers to the fact that the metho ...

Databases and tools

Genetic data has been accumulated in databases. Researchers are able to utilize algorithms to decipher the data accessible from databases for the discovery of new processes, targets, and products. The following are databases and tools: *

GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...

database provides genomic datasets for analysis. *

UCSC Genome Browser The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate spec ...

* AntiSMASH-DB allows comparing the sequences of newly sequenced BGCs against those of previously predicted and experimentally characterized ones. * BIG-FAM is a biosynthetic gene cluster family database. * DoBISCUIT is a database of secondary metabolite biosynthetic gene clusters. * MIBiG (Minimum Information about a Biosynthetic Gene cluster specification) provides a standard for annotations and metadata on biosynthetic gene clusters and their molecular products. * Interactive tree of life (iTOL) is a web-based tool for the display, manipulation and annotation of phylogenetic trees.

References

{{Reflist Wikipedia Student Program Medicinal chemistry DNA

Mining Mining is the extraction of valuable minerals or other geological materials from the Earth, usually from an ore body, lode, vein, seam, reef, or placer deposit. The exploitation of these deposits for raw material is based on the economic via ...