Biomedical text mining (including biomedical natural language processing or BioNLP) refers to the methods and study of how

text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...

may be applied to texts and literature of the

biomedical Biomedicine (also referred to as Western medicine, mainstream medicine or conventional medicine)

and

molecular biology Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and physi ...

domains. As a field of research, biomedical text mining incorporates ideas from

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

medical informatics Health informatics is the field of science and engineering that aims at developing methods and technologies for the acquisition, processing, and study of patient data, which can come from different sources and modalities, such as electronic hea ...

and

computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

. The strategies developed through studies in this field are frequently applied to the biomedical and

literature available through services such as

PubMed PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintain the ...

Considerations

Applying text mining approaches to biomedical text requires specific considerations common to the domain.

Availability of annotated text data

Westergaard et al 2018 PLOS Comp Biol Fig 1

Large annotated

corpora Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...

used in the development and training of general purpose text mining methods (e.g., sets of movie dialogue, product reviews, or Wikipedia article text) are not specific for biomedical language. While they may provide evidence of general text properties such as parts of speech, they rarely contain concepts of interest to biologists or clinicians. Development of new methods to identify features specific to biomedical documents therefore requires assembly of specialized corpora. Resources designed to aid in building new biomedical text mining methods have been developed through the Informatics for Integrating Biology and the Bedside (i2b2) challenges and biomedical informatics researchers. Text mining researchers frequently combine these corpora with the

controlled vocabularies Control may refer to: Basic meanings Economics and business * Control (management), an element of management * Control, an element of management accounting * Comptroller (or controller), a senior financial officer in an organization * Controlling ...

and

ontologies In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...

available through the National Library of Medicine's Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH).

Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

-based methods often require very large data sets as training data to build useful models. Manual annotation of large text corpora is not realistically possible. Training data may therefore be products of weak supervision or purely statistical methods.

Data structure variation

Like other text documents, biomedical documents contain

unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, num ...

. Research publications follow different formats, contain different types of information, and are interspersed with figures, tables, and other non-text content. Both unstructured text and semi-structured document elements, such as tables, may contain important information that should be text mined. Clinical documents may vary in structure and language between departments and locations. Other types of biomedical text, such as drug labels, may follow general structural guidelines but lack further details.

Uncertainty

Biomedical literature contains statements about observations that may not be statements of fact. This text may express uncertainty or skepticism about claims. Without specific adaptations, text mining approaches designed to identify claims within text may mis-characterize these "hedged" statements as facts.

Supporting clinical needs

Biomedical text mining applications developed for clinical use should ideally reflect the needs and demands of clinicians. This is a concern in environments where

clinical decision support A clinical decision support system (CDSS) is a health information technology, provides clinicians, staff, patients, or other individuals with knowledge and person-specific information, to help health and health care. CDSS encompasses a variety of ...

is expected to be informative and accurate. A comprehensive overview of the development and uptake of NLP methods applied to free-text clinical notes related to chronic diseases is presented in.

Interoperability with clinical systems

New text mining systems must work with existing standards, electronic medical records, and databases. Methods for interfacing with clinical systems such as

LOINC Logical Observation Identifiers Names and Codes (LOINC) is a database and universal standard for identifying medical laboratory observations. First developed in 1994, it was created and is maintained by the Regenstrief Institute, a US nonprofit me ...

have been developed but require extensive organizational effort to implement and maintain.

Patient privacy

Text mining systems operating with private medical data must respect its security and ensure it is rendered anonymous where appropriate.

Processes

Specific sub tasks are of particular concern when processing biomedical text.

Named entity recognition

Developments in biomedical text mining have incorporated identification of biological entities with

named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...

, or NER. Names and identifiers for biomolecules such as

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

s and

gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...

s, chemical compounds and drugs, and disease names have all been used as entities. Most entity recognition methods are supported by pre-defined linguistic features or vocabularies, though methods incorporating

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...

and

word embedding In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...

s have also been successful at biomedical NER.

Document classification and clustering

Biomedical documents may be classified or clustered based on their contents and topics. In classification, document categories are specified manually, while in clustering, documents form algorithm-dependent, distinct groups. These two tasks are representative of supervised and

unsupervised ''Unsupervised'' is an American adult animated sitcom created by David Hornsby, Rob Rosell, and Scott Marder which ran on FX from January 19 to December 20, 2012. The show was created, and for the most part, written by David Hornsby, Scott Marder ...

methods, respectively, yet the goal of both is to produce subsets of documents based on their distinguishing features. Methods for biomedical document clustering have relied upon ''k''-means clustering.

Relationship discovery

Biomedical documents describe connections between concepts, whether they are interactions between biomolecules, events occurring subsequently over time (i.e., temporal relationships), or

causal Causality (also referred to as causation, or cause and effect) is influence by which one event, process, state, or object (''a'' ''cause'') contributes to the production of another event, process, state, or object (an ''effect'') where the cau ...

relationships. Text mining methods may perform relation discovery to identify these connections, often in concert with named entity recognition.

Hedge cue detection

The challenge of identifying uncertain or "hedged" statements has been addressed through hedge cue detection in biomedical literature.

Claim detection

Multiple researchers have developed methods to identify specific scientific claims from literature. In practice, this process involves both isolating phrases and sentences denoting the core arguments made by the authors of a document (a process known as

argument mining Argument mining, or argumentation mining, is a research area within the natural-language processing field. The goal of argument mining is the automatic extraction and identification of argumentative structures from natural language text with the aid ...

, employing tools used in fields such as political science) and comparing claims to find potential contradictions between them.

Information extraction

Information extraction, or IE, is the process of automatically identifying structured information from unstructured or partially structured text. IE processes can involve several or all of the above activities, including named entity recognition, relationship discovery, and document classification, with the overall goal of translating text to a more structured form, such as the contents of a template or

knowledge base A knowledge base (KB) is a technology used to store complex structured and unstructured information used by a computer system. The initial use of the term was in connection with expert systems, which were the first knowledge-based systems. Ori ...

. In the biomedical domain, IE is used to generate links between concepts described in text, such as ''gene A inhibits gene B'' and ''gene C is involved in disease G.'' Biomedical knowledge bases containing this type of information are generally products of extensive manual curation, so replacement of manual efforts with automated methods remains a compelling area of research.

Information retrieval and question answering

Biomedical text mining supports applications for identifying documents and concepts matching search queries. Search engines such as

search allow users to query literature databases with words or phrases present in document contents,

metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...

, or indices such as

MeSH A mesh is a barrier made of connected strands of metal, fiber, or other flexible or ductile materials. A mesh is similar to a web or a net in that it has many attached or woven strands. Types * A plastic mesh may be extruded, oriented, ex ...

. Similar approaches may be used for

medical literature retrieval Medical literature retrieval or medical document retrieval is an activity that uses professional methods for medical research papers retrieval, report and other data to improve medicine research and practice. Medical search engine Professional med ...

. For more fine-grained results, some applications permit users to search with natural language queries and identify specific biomedical relationships. On 16 March 2020, the

National Library of Medicine The United States National Library of Medicine (NLM), operated by the United States federal government, is the world's largest medical library. Located in Bethesda, Maryland, the NLM is an institute within the National Institutes of Health. Its ...

and others launched the COVID-19 Open Research Dataset (CORD-19) to enable

of the current literature on the novel virus. The dataset is hosted by the Semantic Scholar project of the

Allen Institute for AI The Allen Institute for AI (abbreviated AI2) is a research institute founded by late Microsoft co-founder Paul Allen. The institute seeks to achieve scientific breakthroughs by constructing AI systems with reasoning, learning, and reading capabi ...

. Other participants include

Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...

Microsoft Research Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...

, the

Center for Security and Emerging Technology The Center for Security and Emerging Technology (CSET) is a think tank dedicated to policy analysis at the intersection of national and international security and emerging technologies, based at Georgetown University's School of Foreign Service. C ...

, and the

Chan Zuckerberg Initiative The Chan Zuckerberg Initiative (CZI) is an organization established and owned by Facebook founder Mark Zuckerberg and his wife Priscilla Chan with an investment of 99 percent of the couple's wealth from their Facebook shares over their lifetime ...

Resources

Corpora

The following table lists a selection of biomedical text corpora and their contents. These items include annotated corpora, sources of biomedical research literature, and resources frequently used as vocabulary and/or ontology references, such as

. Items marked "Yes" under "Freely Available" can be downloaded from a publicly accessible location.

Word embeddings

Several groups have developed sets of biomedical vocabulary mapped to vectors of real numbers, known as word vectors or word embeddings. Sources of pre-trained embeddings specific for biomedical vocabulary are listed in the table below. The majority are results of the

word2vec Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...

model developed by Mikolov ''et al'' or variants of word2vec.

Applications

Text mining applications in the biomedical field include computational approaches to assist with studies in protein docking,

protein interactions Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respond ...

, and protein-disease associations.

Gene cluster identification

Methods for determining the association of

gene cluster A gene family is a set of homologous genes within one organism. A gene cluster is a group of two or more genes found within an organism's DNA that encode similar polypeptides, or proteins, which collectively share a generalized function and are o ...

s obtained by

microarray A microarray is a multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of genes from a sample (e.g. from a tissue). It is a two-dimensional array on a solid substrate—usually a glass slide or silicon t ...

experiments with the biological context provided by the corresponding literature have been developed.

Protein interactions

Automatic extraction of protein interactions and associations of proteins to functional concepts (e.g.

gene ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and g ...

terms) has been explored. The search engine PIE was developed to identify and return protein-protein interaction mentions from

MEDLINE MEDLINE (Medical Literature Analysis and Retrieval System Online, or MEDLARS Online) is a bibliographic database of life sciences and biomedical information. It includes bibliographic information for articles from academic journals covering medic ...

-indexed articles. The extraction of kinetic parameters from text or the

subcellular location The cells of eukaryotic organisms are elaborately subdivided into functionally-distinct membrane-bound compartments. Some major constituents of eukaryotic cells are: extracellular space, plasma membrane, cytoplasm, nucleus, mitochondria, Golgi ...

of proteins have also been addressed by information extraction and text mining technology.

Gene-disease associations

Text mining can aid in gene prioritization, or identification of genes most likely to contribute to

genetic disease A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...

. One group compared several vocabularies, representations and ranking algorithms to develop gene prioritization benchmarks.

Gene-trait associations

An agricultural genomics group identified genes related to

bovine Bovines (subfamily Bovinae) comprise a diverse group of 10 genera of medium to large-sized ungulates, including cattle, bison, African buffalo, water buffalos, and the four-horned and spiral-horned antelopes. The evolutionary relationship betwee ...

reproductive traits using text mining, among other approaches.

Protein-disease associations

Text mining enables an unbiased evaluation of protein-disease relationships within a vast quantity of unstructured textual data.

Applications of phrase mining to disease associations

A text mining study assembled a collection of 709 core

extracellular matrix proteins In biology, the extracellular matrix (ECM), also called intercellular matrix, is a three-dimensional network consisting of extracellular macromolecules and minerals, such as collagen, enzymes, glycoproteins and hydroxyapatite that provide struct ...

and associated proteins based on two databases: MatrixDB
matrixdb.univ-lyon1.fr
and

UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...

. This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools. The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of

cardiovascular disease Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood vessels. CVD includes coronary artery diseases (CAD) such as angina and myocardial infarction (commonly known as a heart attack). Other CVDs include stroke, h ...

s. They used a phrase-mining pipeline, Context-aware Semantic

Online Analytical Processing Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repo ...

(CaseOLAP), then semantically scored all 709 proteins according to their Integrity, Popularity, and Distinctiveness using the CaseOLAP pipeline. The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.

Software tools

Search engines

Search engines designed to retrieve biomedical literature relevant to a user-provided query frequently rely upon text mining approaches. Publicly available tools specific for research literature include

search,

Europe PubMed Central Europe PubMed Central (Europe PMC) is an open-access repository which contains millions of biomedical research works. It was known as UK PubMed Central until 1 November 2012. Service Europe PMC provides free access to more than 3.7 million full-te ...

search, GeneView, and APSE Similarly, search engines and indexing systems specific for biomedical data have been developed, including DataMed and OmicsDI. Some search engines, such as Essie, OncoSearch,

PubGene PubGene AS is a bioinformatics company located in Oslo, Norway and is the daughter company of PubGene Inc. In 2001, PubGene founders demonstrated one of the first applications of text mining to research in biomedicine (i.e., biomedical text min ...

, and

GoPubMed GoPubMed was a knowledge-based search engine for biomedical texts. The Gene Ontology (GO) and Medical Subject Headings (MeSH) served as "Table of contents" in order to structure the millions of articles in the MEDLINE database. MeshPubMed was at one ...

were previously public but have since been discontinued, rendered obsolete, or integrated into commercial products.

Medical record analysis systems

Electronic medical records An electronic health record (EHR) is the systematized collection of patient and population electronically stored health information in a digital format. These records can be shared across different health care settings. Records are shared throu ...

(EMRs) and

electronic health record An electronic health record (EHR) is the systematized collection of patient and population electronically stored health information in a digital format. These records can be shared across different health care settings. Records are shared throu ...

s (EHRs) are collected by clinical staff in the course of diagnosis and treatment. Though these records generally include structured components with predictable formats and data types, the remainder of the reports are often free-text and difficult to search, leading to challenges with patient care. Numerous complete systems and tools have been developed to analyse these free-text portions. The MedLEE system was originally developed for analysis of chest

radiology Radiology ( ) is the medical discipline that uses medical imaging to diagnose diseases and guide their treatment, within the bodies of humans and other animals. It began with radiography (which is why its name has a root referring to radiat ...

reports but later extended to other report topics. The clinical Text Analysis and Knowledge Extraction System, or cTAKES, annotates clinical text using a dictionary of concepts. The CLAMP system offers similar functionality with a user-friendly interface.

Frameworks

Computational frameworks have been developed to rapidly build tools for biomedical text mining tasks. SwellShark is a framework for biomedical NER that requires no human-labeled data but does make use of resources for weak supervision (e.g.,

UMLS The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences (created 1986). It provides a mapping structure among these vocabularies and thus allows one to translate among the various termin ...

semantic types). The SparkText framework uses

Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...

data streaming, a

NoSQL A NoSQL (originally referring to "non- SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed ...

database, and basic

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

methods to build predictive models from scientific articles.

APIs

Some biomedical text mining and natural language processing tools are available through application programming interfaces, or APIs. NOBLE Coder performs concept recognition through an API.

Conferences

The following

academic conference An academic conference or scientific conference (also congress, symposium, workshop, or meeting) is an event for researchers (not necessarily academics) to present and discuss their scholarly work. Together with academic or scientific journals an ...

s and workshops host discussions and presentations in biomedical text mining advances. Most publish

proceedings In academia and librarianship, conference proceedings is a collection of academic papers published in the context of an academic conference or workshop. Conference proceedings typically contain the contributions made by researchers at the confere ...

Journals

A variety of

academic journal An academic journal or scholarly journal is a periodical publication in which scholarship relating to a particular academic discipline is published. Academic journals serve as permanent and transparent forums for the presentation, scrutiny, and d ...

s publishing manuscripts on biology and medicine include topics in text mining and natural language processing software. Some journals, including the ''

Journal of the American Medical Informatics Association The ''Journal of the American Medical Informatics Association'' is a peer-reviewed scientific journal covering research in the field of medical informatics published by the American Medical Informatics Association. According to the ''Journal Ci ...

'' (JAMIA) and the ''

Journal of Biomedical Informatics The ''Journal of Biomedical Informatics'' is a peer-reviewed scientific journal that covers research in health informatics or in translational bioinformatics. It is considered a premier methodology journal in the field of biomedical informatics. ...

'' are popular publications for these topics.

References

External links

Bio-NLP resources, systems and application database collection

The BioNLP mailing list archives

Corpora for biomedical text mining

The BioCreative evaluations of biomedical text mining technologies

Data mining Bioinformatics Text mining Clinical data management