Computational genomics refers to the use of computational and statistical analysis to decipher biology from

genome sequence A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...

s and related data, including both

DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...

and

RNA Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...

sequence as well as other "post-genomic" data (i.e., experimental data obtained with technologies that require the genome sequence, such as genomic DNA microarrays). These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

and

computational biology Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...

, but with a focus on using whole genomes (rather than individual genes) to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

History

The roots of computational genomics are shared with those of

. During the 1960s, Margaret Dayhoff and others at the National Biomedical Research Foundation assembled databases of homologous protein sequences for evolutionary study. Their research developed a

phylogenetic tree A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...

that determined the evolutionary changes that were required for a particular protein to change into another protein based on the underlying

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...

sequences. This led them to create a scoring matrix that assessed the likelihood of one protein being related to another. Beginning in the 1980s, databases of genome sequences began to be recorded, but this presented new challenges in the form of searching and comparing the databases of gene information. Unlike text-searching algorithms that are used on websites such as Google or Wikipedia, searching for sections of genetic similarity requires one to find strings that are not simply identical, but similar. This led to the development of the Needleman-Wunsch algorithm, which is a dynamic programming algorithm for comparing sets of amino acid sequences with each other by using scoring matrices derived from the earlier research by Dayhoff. Later, the BLAST algorithm was developed for performing fast, optimized searches of gene sequence databases. BLAST and its derivatives are probably the most widely used algorithms for this purpose. The emergence of the phrase "computational genomics" coincides with the availability of complete sequenced genomes in the mid-to-late 1990s. The first meeting of the Annual Conference on Computational Genomics was organized by scientists from The Institute for Genomic Research (TIGR) in 1998, providing a forum for this speciality and effectively distinguishing this area of science from the more general fields of

Genomics Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...

Computational Biology Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...

. The first use of this term in scientific literature, according to

MEDLINE MEDLINE (Medical Literature Analysis and Retrieval System Online, or MEDLARS Online) is a bibliographic database of life sciences and biomedical information. It includes bibliographic information for articles from academic journals covering medic ...

abstracts, was just one year earlier in

Nucleic Acids Research ''Nucleic Acids Research'' is an open-access peer-reviewed scientific journal published since 1974 by the Oxford University Press. The journal covers research on nucleic acids, such as DNA and RNA, and related work. According to the ''Journal Cita ...

. The final Computational Genomics conference was held in 2006, featuring a keynote talk by Nobel Laureate Barry Marshall, co-discoverer of the link between

Helicobacter pylori ''Helicobacter pylori'', previously known as ''Campylobacter pylori'', is a gram-negative, Flagellum#bacterial, flagellated, Bacterial cellular morphologies#Helical, helical bacterium. Mutants can have a rod or curved rod shape that exhibits l ...

and stomach ulcers. As of 2014, the leading conferences in the field include

Intelligent Systems for Molecular Biology Intelligent Systems for Molecular Biology (ISMB) is an annual academic conference on the subjects of bioinformatics and computational biology organised by the International Society for Computational Biology (ISCB). The principal focus of the con ...

(ISMB) and Research in Computational Molecular Biology (RECOMB). The development of computer-assisted mathematics (using products such as

Mathematica Wolfram (previously known as Mathematica and Wolfram Mathematica) is a software system with built-in libraries for several areas of technical computing that allows machine learning, statistics, symbolic computation, data manipulation, network ...

Matlab MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementat ...

) has helped engineers, mathematicians and computer scientists to start operating in this domain, and a public collection of case studies and demonstrations is growing, ranging from whole genome comparisons to

gene expression Gene expression is the process (including its Regulation of gene expression, regulation) by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, proteins or non-coding RNA, ...

analysis. This has increased the introduction of different ideas, including concepts from systems and control, information theory, strings analysis and data mining. It is anticipated that computational approaches will become and remain a standard topic for research and teaching, while students fluent in both topics start being formed in the multiple courses created in the past few years.

Contributions of computational genomics research to biology

Contributions of computational genomics research to biology include: * proposing

cellular signalling In biology, cell signaling (cell signalling in British English) is the Biological process, process by which a Cell (biology), cell interacts with itself, other cells, and the environment. Cell signaling is a fundamental property of all Cell (biol ...

networks * proposing mechanisms of genome evolution * predict precise locations of all human

genes In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...

using

comparative genomics Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach c ...

techniques with several mammalian and vertebrate

species A species () is often defined as the largest group of organisms in which any two individuals of the appropriate sexes or mating types can produce fertile offspring, typically by sexual reproduction. It is the basic unit of Taxonomy (biology), ...

* predict conserved genomic regions that are related to early

embryonic development In developmental biology, animal embryonic development, also known as animal embryogenesis, is the developmental stage of an animal embryo. Embryonic development starts with the fertilization of an egg cell (ovum) by a sperm, sperm cell (spermat ...

* discover potential links between repeated sequence motifs and tissue-specific

* measure regions of genomes that have undergone unusually rapid evolution

Genome comparison

Computational tools have been developed to assess the similarity of genomic sequences. Some of them are

alignment Alignment may refer to: Archaeology * Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks * Stone alignment, a linear arrangement of upright, parallel megalithic standing stones Biology * Struc ...

-based distances such as Average Nucleotide Identity. These methods are highly specific, while being computationally slow. Other, alignment-free methods, include statistical and probabilistic approaches. One example is Mash, a probabilistic approach using minhash. In this method, given a number k, a genomic sequence is transformed into a shorter sketch through a random

hash function A hash function is any Function (mathematics), function that can be used to map data (computing), data of arbitrary size to fixed-size values, though there are some hash functions that support variable-length output. The values returned by a ...

on the possible k-mers. For example, if

k=2

, sketches of size 4 are being constructed and given the following hash function the sketch of the sequence is which are the smallest hash values of its k-mers of size 2. These sketches are then compared to estimate the fraction of shared k-mers ( Jaccard index) of the corresponding sequences. It is worth noticing that a hash value is a binary number. In a real genomic setting a useful size of k-mers ranges from 14 to 21, and the size of the sketches would be around 1000. By reducing the size of the sequences, even hundreds of times, and comparing them in an alignment-free way, this method reduces significantly the time of estimation of the similarity of sequences.

Clusterization of genomic data

Clustering data is a tool used to simplify statistical analysis of a genomic sample. For example, in the authors developed a tool (BiG-SCAPE) to analize sequence similarity networks of biosynthetic gene clusters (BGC). In successive layers of clusterization of biosynthetic gene clusters are used in the automated tool BiG-MAP, both to filter redundant data and identify gene clusters families. This tool profiles the abundance and expressions levels of BGC's in microbiome samples.

Biosynthetic gene clusters

Bioinformatic tools have been developed to predict, and determine the abundance and expression of, this kind of gene cluster in microbiome samples, from metagenomic data. Since the size of metagenomic data is considerable, filtering and clusterization thereof are important parts of these tools. These processes can consist of dimensionality -reduction techniques, such as Minhash, and clusterization algorithms such as

k-medoids -medoids is a classical partitioning technique of clustering that splits the data set of objects into clusters, where the number of clusters assumed known ''a priori'' (which implies that the programmer must specify k before the execution of a - ...

and affinity propagation. Also several metrics and similarities have been developed to compare them. Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs). BiG-SLiCE (Biosynthetic Genes Super-Linear Clustering Engine), a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion. Satria et. al, 2021 across BiG-SLiCE demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential, opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry.

Compression algorithms

References

External links

* Harvard Extension School Biophysics 101, Genomics and Computational Biology, http://www.courses.fas.harvard.edu/~bphys101/info/syllabus.html * University of Bristol course in Computational Genomics, http://www.computational-genomics.net/ {{Portal bar, Biology, Medicine Bioinformatics Genomics Computational fields of study