HOME

TheInfoList



OR:

Literature-based discovery (LBD), also called literature-related discovery (LRD) is a form of
knowledge extraction Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must r ...
and automated
hypothesis A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can test it. Scientists generally base scientific hypotheses on previous obse ...
generation that uses papers and other
academic publications Academic publishing is the subfield of publishing which distributes academic research and scholarship. Most academic work is published in academic journal articles, books or theses. The part of academic written output that is not formally publ ...
(the "literature") to find new relationships between existing knowledge (the "discovery"). Literature-based discovery aims to discover new knowledge by connecting information which have been explicitly stated in literature to deduce connections which have not been explicitly stated. LBD can help researchers to quickly discover and explore hypotheses as well as gain information on relevant advances inside and outside of their niches and increase interdisciplinary information sharing. The most basic and widespread type of LBD is called the ABC paradigm because it centers around three concepts called A, B and C. It states that if there is a connection between A and B and one between B and C, then there is one between A and C which, if not explicitly stated, is yet to be explored.


History

The LBD technique was pioneered by Don R. Swanson in the 1980s. He hypothesized that the combination of two separately published results indicating an A-B relationship and a B-C relationship are evidence of an A-C relationship which is unknown or unexplored. He used this to propose
fish oil Fish oil is oil derived from the tissues of oily fish. Fish oils contain the omega-3 fatty acids eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA), precursors of certain eicosanoids that are known to reduce inflammation in the body a ...
as a treatment for
Raynaud syndrome Raynaud syndrome, also known as Raynaud's phenomenon, eponymously named after the physician Auguste Gabriel Maurice Raynaud, who first described it in his doctoral thesis in 1862, is a medical condition in which the spasm of small arteries cau ...
due to their shared relationship with
blood viscosity Hemorheology, also spelled haemorheology (from Greek ‘αἷμα, ''haima'' 'blood' and rheology, from Greek ῥέω ''rhéō'', ' flow' and -λoγία, ''-logia'' 'study of'), or blood rheology, is the study of flow properties of blood and its ...
. This hypothesis was later shown to have merit in a prospective study and he continually proposed other discoveries using similar methods.


Swanson linking

''Swanson linking'' is a term proposed in 2003 that refers to connecting two pieces of knowledge previously thought to be unrelated. For example, it may be known that illness A is caused by chemical B, and that drug C is known to reduce the amount of chemical B in the body. However, because the respective articles were published separately from one another (called "disjoint data"), the relationship between illness A and drug C may be unknown. ''Swanson linking'' aims to find these relationships and report them. Although the ABC paradigm is widely used, critics of the system have argued that much of science is not captured on simple assertions and it is rather built from analogies and images at a higher level of
abstraction Abstraction in its main sense is a conceptual process wherein general rules and concepts are derived from the usage and classification of specific examples, literal ("real" or "concrete") signifiers, first principles, or other methods. "An abstr ...
.


Systems

LBD comes generally in two flavours: open and closed discovery. In open discovery, only A is given. The approach finds Bs and uses them to return possibly interesting Cs to the user, thus ''generating hypotheses'' from A. With closed discovery, the A and C are given to the approach which seeks to find the Bs which can link the two, thus ''testing a hypothesis'' about A and C. A number of systems to perform literature-based discovery have been developed over the years, extending the original idea of Don Swanson, and the evaluation of the quality of such systems is an active area of research. Some systems include web versions for increased user-friendliness. A common approach to many systems is the use of MeSH terms to represent scientific articles. This is used by the systems Manjal, BITOLA and LitLinker. One well-known system within the field is called ''Arrowsmith'' and is tailored to find connections between two disjoint sets of articles, an approach labeled "two-node" search. Another well-known system, LION LBD, uses PubTator for annotating PubMed scientific articles with concepts such as
chemical A chemical substance is a form of matter having constant chemical composition and characteristic properties. Some references add that chemical substance cannot be separated into its constituent elements by physical separation methods, i.e., wi ...
s, genes/proteins,
mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mi ...
s,
disease A disease is a particular abnormal condition that negatively affects the structure or function of all or part of an organism, and that is not immediately due to any external injury. Diseases are often known to be medical conditions that a ...
s and
species In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate s ...
; as well as sentence-level annotation of cancer hallmarks that describe fundamental cancer processes and behaviour. It uses co-occurrence metrics to rank relations between concepts and performs both open and closed discovery. While LBD systems are based on traditional statistical methods, other systems leverage sophisticated
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
methods, like
neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
s. Some LBD systems represent the connection between concepts as a
knowledge graph The Google Knowledge Graph is a knowledge base from which Google serves relevant information in an infobox beside its search results. This allows the user to see the answer in a glance. The data is generated automatically from a variety of so ...
, and thus employ techniques of
graph theory In mathematics, graph theory is the study of ''graphs'', which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of '' vertices'' (also called ''nodes'' or ''points'') which are conne ...
. The graph-based representation is also the foundation for LBD systems that employ
graph database A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the ''graph'' (or ''edge'' or ''relationship''). The graph relat ...
s like
Neo4J Neo4j is a graph database management system developed by Neo4j, Inc. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, Neo4j is available in a non-open-source "community edition" ...
, enabling discovery via graph query languages such as
Cypher Cypher is an alternative spelling for cipher. Cypher may also refer to: Arts and entertainment * Cypher (French Group), a Goa trance music group * Cypher (band), an Australian instrumental band * Cypher (film), ''Cypher'' (film), a 2002 film * C ...
. Graph-based LBD systems represent the relations between concepts using a different relation types, such as those in the
UMLS The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences (created 1986). It provides a mapping structure among these vocabularies and thus allows one to translate among the various termin ...
Semantic Network. Some approaches go further and try to apply contextualized relations, an approach also used by the
Gene Ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and g ...
for their Causal Activity Modeling (GO-CAM).


Use of databases

Besides extracting information from the body of scientific articles, LBD systems often employ structured knowledge from biocurated biological resources, like the Online Mendelian Inheritance in Men (OMIM).


List of systems

These are the published LBD systems, ordered by date of publication: * 1986 - Arrowsmith * 2000 - BITOLA V1 * 2001 - DAD * 2003 - LitLinker * 2004 - ACS * 2004 - Manjal * 2004 - IRIDESCENT * 2005 - BITOLA V2 * 2006 - LitLinker V2 * 2007 - Arrowsmith V2 * 2008 - Anni 2.0 * 2008 - CoPub Discovery * 2009 - RajoLink * 2010 - Sem-BT * 2015 - Obvio * 2016 - Spark * 2017 - Mine the gap * 2019 - LION LBD


Semantic typing

A common task in literature-based discovery is assigning words/concepts to different semantic types. A concept might be classified under one type or multiple types. For example in the
Unified Medical Language System The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences (created 1986). It provides a mapping structure among these vocabularies and thus allows one to translate among the various termin ...
(UMLS) the term ''migraine'' is classified under the type ''disease and syndrome'', while the term ''magnesium'' is under two types: ''biologically active substance'' and ''element'', ''ion'', or ''isotope.'' The ''typing'' of concepts hones the discovery of connections between particular classes of concepts, i.e. ''diseases''-''genes'' or ''diseases''-''drugs''. ''''


System evaluation

The evaluation of literature-based discoveries is challenging, and includes both experimental and ''in silico'' methods. Methods try to quantify the amount of knowledge generated by systems, that should be provided in an amount and richness that is useful for scientists. Evaluation is difficult in LBD for several reasons: disagreement about the role of LBD systems in research and thus what makes a successful one; difficulty in determining how useful, interesting or actionable a discovery is; and difficulty in objectively defining a ‘
discovery Discovery may refer to: * Discovery (observation), observing or finding something unknown * Discovery (fiction), a character's learning something unknown * Discovery (law), a process in courts of law relating to evidence Discovery, The Discovery ...
’, which hinders the creation of a standard evaluation set which quantifies when a discovery has been replicated or found. A popular method used in LBD is to ''replicate previous discoveries.'' These are usually LBD-based discoveries as they are relatively easy to quantify compared to other discoveries. There are only a handful of such discoveries and approaches e tuned to perform well on these discoveries might not generalise. In this type of evaluation, the
literature Literature is any collection of written work, but it is also used more narrowly for writings specifically considered to be an art form, especially prose fiction, drama, and poetry. In recent centuries, the definition has expanded to include ...
before the discovery to be replicated is used to generate a ranked list of discovery candidates as target or linking terms. Success is measured by reporting the rank of the term(s) of interest; the higher the rank, the better the approach. ''Literature- or time-slicing'' involves splitting the existing literature at a point in time. The LBD system is then exposed to the literature before the split and is evaluated by how many of the discoveries in the later period it can discover. LBD systems have used term co-occurrences, relationships from external biomedical resources (e.g SemMedDB) and semantic relationships to generate the gold standards. A high precision approach is to get expert opinion to generate the gold standard, but this is time-consuming, expensive and tends to produce low recall rates. The advantage of time-slicing in comparison to the replication of previous discoveries is the evaluation on a large number of test instances. This raises the need for evaluation metrics which can quantify performance on large, ranked lists. LBD works have used metrics popular in Information Retrieval which include Precision, Recall,
Area Under the Curve In mathematics, an integral assigns numbers to functions in a way that describes displacement, area, volume, and other concepts that arise by combining infinitesimal data. The process of finding integrals is called integration. Along with ...
(AUC), Precision at ''k'',
Mean Average Precision Evaluation measures for an information retrieval (IR) system assess how well an index, search engine or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of informatio ...
(MAP) and others. The approach of ''Proposing new discoveries'' ''or treatments'' goes beyond replicating past discoveries or predicting time-sliced instances of a particular relationship and shows that a system is capable of being used in realistic situations. This is usually accompanied by peer-reviewed publication in the domain or vetting by a
domain expert A subject-matter expert (SME) is a person who has accumulated great knowledge in a particular field or topic and this level of knowledge is demonstrated by the person's degree, licensure, and/or through years of professional experience with the s ...
.


Text mining

The automation of literature-based discovery relies heavily on
text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
. The language in scientific articles often include ambiguities, and an important step for coeherent parsing of the literature is the extraction of the sense of each term in the context they are used, a task called
Word-sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consci ...
(WSD). For example, terms for genes like CT (''PCYT1A'') called and MR (''NR3C2'') can be confused with the acronyms for Computational Tomography and
Magnetic Resonance Magnetic resonance is a process by which a physical excitation (resonance) is set up via magnetism. This process was used to develop magnetic resonance imaging and Nuclear magnetic resonance spectroscopy technology. It is also being used to ...
, requiring sofisticated disambiguation systems. Terms are often reconciled to
ontologies In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...
or other sources of unique identifiers, such as the
Unified Medical Language System The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences (created 1986). It provides a mapping structure among these vocabularies and thus allows one to translate among the various termin ...
(UMLS). This process of mapping multiple different utterances to a single name or identifier is called normalization.


Usage


Life sciences

LBD has already been used in different waysto identify new connections between biomedical entities and new candidate genes and treatments for illnesses.


Drug discovery

LBD has seen use in drug development and repurposing as well as predicting adverse drug reactions. The method of literature-based discovery has been used to search for treatments for a number of human diseases, including: *
diabetic retinopathy Diabetic retinopathy (also known as diabetic eye disease), is a medical condition in which damage occurs to the retina due to diabetes mellitus. It is a leading cause of blindness in developed countries. Diabetic retinopathy affects up to 80 perc ...
*
dilated cardiomyopathy Dilated cardiomyopathy (DCM) is a condition in which the heart becomes enlarged and cannot pump blood effectively. Symptoms vary from none to feeling tired, leg swelling, and shortness of breath. It may also result in chest pain or fainting. Co ...
*
Parkinson's disease Parkinson's disease (PD), or simply Parkinson's, is a long-term degenerative disorder of the central nervous system that mainly affects the motor system. The symptoms usually emerge slowly, and as the disease worsens, non-motor symptoms becom ...
*
prostate cancer Prostate cancer is cancer of the prostate. Prostate cancer is the second most common cancerous tumor worldwide and is the fifth leading cause of cancer-related mortality among men. The prostate is a gland in the male reproductive system that sur ...
*
gastric cancer Stomach cancer, also known as gastric cancer, is a cancer that develops from the lining of the stomach. Most cases of stomach cancers are gastric carcinomas, which can be divided into a number of subtypes, including gastric adenocarcinomas. Lymph ...
*
multiple sclerosis Multiple (cerebral) sclerosis (MS), also known as encephalomyelitis disseminata or disseminated sclerosis, is the most common demyelinating disease, in which the insulating covers of nerve cells in the brain and spinal cord are damaged. This d ...


Gene and protein function discovery

The approach has also been used to propose relations of genes with particular diseases, like
breast cancer Breast cancer is cancer that develops from breast tissue. Signs of breast cancer may include a lump in the breast, a change in breast shape, dimpling of the skin, milk rejection, fluid coming from the nipple, a newly inverted nipple, or a re ...
. In the context of
systems A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, surrounded and influenced by its environment, is described by its boundaries, structure and purpose and express ...
vaccinology A vaccine is a biological Dosage form, preparation that provides active acquired immunity to a particular infectious disease, infectious or cancer, malignant disease. The safety and effectiveness of vaccines has been widely studied and verifie ...
, it was used to identify proteins related to
interferon gamma Interferon gamma (IFN-γ) is a dimerized soluble cytokine that is the only member of the type II class of interferons. The existence of this interferon, which early in its history was known as immune interferon, was described by E. F. Wheelock ...
and that play a role in the response to
vaccine A vaccine is a biological Dosage form, preparation that provides active acquired immunity to a particular infectious disease, infectious or cancer, malignant disease. The safety and effectiveness of vaccines has been widely studied and verifie ...
s. It has also been used to propose mechanisms for currently used drugs.


Biomarker discovery

LBD has been explored as a tool to identify
biomarker In biomedical contexts, a biomarker, or biological marker, is a measurable indicator of some biological state or condition. Biomarkers are often measured and evaluated using blood, urine, or soft tissues to examine normal biological processes, ...
s for
diagnostic Diagnosis is the identification of the nature and cause of a certain phenomenon. Diagnosis is used in many different disciplines, with variations in the use of logic, analytics, and experience, to determine " cause and effect". In systems engine ...
and
prognostic Prognosis (Greek: πρόγνωσις "fore-knowing, foreseeing") is a medical term for predicting the likely or expected development of a disease, including whether the signs and symptoms will improve or worsen (and how quickly) or remain stable ...
for diseases, e.g. for the risk of
type 2 diabetes Type 2 diabetes, formerly known as adult-onset diabetes, is a form of diabetes mellitus that is characterized by high blood sugar, insulin resistance, and relative lack of insulin. Common symptoms include increased thirst, frequent urination, ...
.


Other uses

Besides providing scientific hypotheses about the world, LBD has also been used to improve
data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
, via the automatic identification of possible
confounding factors In statistics, a confounder (also confounding variable, confounding factor, extraneous determinant or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Con ...
using the medical literature. It has also been used to understand better disease
etiology Etiology (pronounced ; alternatively: aetiology or ætiology) is the study of causation or origination. The word is derived from the Greek (''aitiología'') "giving a reason for" (, ''aitía'', "cause"); and ('' -logía''). More completely, e ...
and the relation of different diseases, for example looking for the genes connecting
myocardial infarction A myocardial infarction (MI), commonly known as a heart attack, occurs when blood flow decreases or stops to the coronary artery of the heart, causing damage to the heart muscle. The most common symptom is chest pain or discomfort which may ...
and depression, and connections between psychiatric and somatic diseases.


Beyond life sciences

LBD has mostly been deployed in the biomedical domain, but it has also been used outside of it as it has been applied to research into developing water purification systems, accelerating development of
developing countries A developing country is a sovereign state with a lesser developed industrial base and a lower Human Development Index (HDI) relative to other countries. However, this definition is not universally agreed upon. There is also no clear agreem ...
and identifying promising research collaborations.


See also

* Arrowsmith System *
Implicature In pragmatics, a subdiscipline of linguistics, an implicature is something the speaker suggests or implies with an utterance, even though it is not literally expressed. Implicatures can aid in communicating more efficiently than by explicitly sayi ...
*
Latent semantic indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
*
Metaphor A metaphor is a figure of speech that, for rhetorical effect, directly refers to one thing by mentioning another. It may provide (or obscure) clarity or identify hidden similarities between two different ideas. Metaphors are often compared wit ...
*
Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
*
Biocuration Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the ...
*
BioCreative BioCreAtIvE (A critical assessment of text mining methods in molecular biology) consists in a community-wide effort for evaluating information extraction and text mining developments in the biological domain. It was preceded by the Knowledge Disco ...


Additional reading

* Wilson, Patrick (1977). ''Public Knowledge, Private Ignorance: Toward a Library and Information Policy''. Greenwood Publishing Group. p. 156. .


References

{{reflist Information retrieval techniques Medical research