Integrative bioinformatics is a discipline of
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
that focuses on problems of
data integration
Data integration involves combining data residing in different sources and providing users with a unified view of them.
This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
for the
life sciences
This list of life sciences comprises the branches of science that involve the scientific study of life – such as microorganisms, plants, and animals including human beings. This science is one of the two major branches of natural science, the ...
.
With the rise of
high-throughput (HTP) technologies in the life sciences, particularly in
molecular biology
Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and physi ...
, the amount of collected
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
has grown in an exponential fashion. Furthermore, the data are scattered over a plethora of both public and private
repositories, and are stored using a large number of different
formats. This situation makes searching these data and performing the analysis necessary for the extraction of new knowledge from the complete set of available data very difficult. Integrative bioinformatics attempts to tackle this problem by providing unified access to life science data.
Approaches
Semantic web approaches
In the
Semantic Web approach, data from multiple websites or databases is searched via
metadata
Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive metadata – the descriptive ...
. Metadata is
machine-readable code, which defines the contents of the page for the program so that the comparisons between the data and the search terms are more accurate. This serves to decrease the number of results that are irrelevant or unhelpful. Some meta-data exists as definitions called
ontologies
In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...
, which can be tagged by either users or programs; these serve to facilitate searches by using key terms or phrases to find and return the data.
Advantages of this approach include the general increased quality of the data returned in searches and with proper tagging, ontologies finding entries that may not explicitly state the search term but are still relevant. One disadvantage of this approach is that the results that are returned come in the format of the database of their origin and as such, direct comparisons may be difficult. Another problem is that the terms used in tagging and searching can sometimes be ambiguous and may cause confusion among the results.
[Van Ophuizen, E.A.A. & Leunissen, J.A.M. (2010). "An evaluation of the performance of three semantic background knowledge sources in comparative anatomy." Journal of Integrative Bioinformatics. Retrieved 28 October 2012.] In addition, the semantic web approach is still considered an emerging technology and is not in wide-scale use at this time.
One of the current applications of ontology-based search in the biomedical sciences is
GoPubMed GoPubMed was a knowledge-based search engine for biomedical texts. The
Gene Ontology (GO) and Medical Subject Headings (MeSH) served as "Table of contents" in order to structure the millions of articles in the MEDLINE database. MeshPubMed was at on ...
, which searches the
PubMed
PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintain the ...
database of scientific literature.
Another use of ontologies is within databases such as
SwissProt
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
,
Ensembl
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
and
TrEMBL
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
, which use this technology to search through the stores of human proteome-related data for tags related to the search term.
[Verschelde, et al. (2007). "Ontology-Assisted Database Integration to Support ]Natural Language Processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
and Biomedical Data-mining." Journal of Integrative Bioinformatics. Retrieved 28 October 2012.
Some of the research in this field has focused on creating new and specific ontologies. Other researchers have worked on verifying the results of existing ontologies.
In a specific example, the goal of Verschelde, et al. was the integration of several different ontology libraries into a larger one that contained more definitions of different subspecialties (medical, molecular biological, etc.) and was able to distinguish between ambiguous tags; the result was a data-warehouse like effect, with easy access to multiple databases through the use of ontologies.
In a separate project, Bertens, et al. constructed a lattice work of three ontologies (for anatomy and development of model organisms) on a novel framework ontology of generic organs. For example, results from a search of ‘heart’ in this ontology would return the heart plans for each of the vertebrate species whose ontologies were included. The stated goal of the project is to facilitate comparative and evolutionary studies.
Data warehousing approaches
In the
data warehousing
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integra ...
strategy, the data from different sources are extracted and integrated in a single database. For example, various
'omics' datasets may be integrated to provide biological insights into biological systems. Examples include data from genomics, transcriptomics, proteomics, interactomics, metabolomics. Ideally, changes in these sources are regularly synchronized to the integrated database. The data is presented to the users in a common format. Many programs aimed to aid in the creation of such warehouses are designed to be extremely versatile to allow for them to be implemented in diverse research projects. One advantage of this approach is that data is available for analysis at a single site, using a uniform schema. Some disadvantages are that the datasets are often huge and difficult to keep up to date. Another problem with this method is that it is costly to compile such a warehouse.
Standardized formats for different types of data (ex: protein data) are now emerging due to the influence of groups like the
Proteomics Standards Initiative The Proteomics Standards Initiative (PSI) is a working group of the Human Proteome Organization. It aims to define data standards for proteomics to facilitate data comparison, exchange and verification.
The Proteomics Standards Initiative focuses ...
(PSI). Some data warehousing projects even require the submission of data in one of these new formats.
Other approaches
Data mining uses statistical methods to search for patterns in existing data. This method generally returns many patterns, of which some are spurious and some are significant, but all of the patterns the program finds must be evaluated individually. Currently, some research is focused on incorporating existing data mining techniques with novel pattern analysis methods that reduce the need to spend time going over each pattern found by the initial program, but instead, return a few results with a high likelihood of relevance.
[Belmamoune, et al. (2010). "Mining and Analysing Spatio-Temporal Patterns of Gene Expression in An Integrative database Framework." Journal of Integrative Bioinformatics. Retrieved 27 October 2012.] One drawback of this approach is that it does not integrate multiple databases, which means that comparisons across databases are not possible. The major advantage to this approach is that it allows for the generation of new hypotheses to test.
See also
*
Biological database
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...
*
Biological data visualization
*
InterMine - an open-source biological data warehouse system
References
{{reflist
External links
Journal of Integrative BioinformaticsIMBioGoPubMedBMC BioinformaticsNetherlands Bioinformatics Centre
Bioinformatics