Biomedical text mining (including biomedical natural language processing or BioNLP) refers to the methods and study of how
text mining
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
may be applied to texts and literature of the
biomedical
Biomedicine (also referred to as Western medicine, mainstream medicine or conventional medicine) domain. As a field of research, biomedical text mining incorporates ideas from
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
,
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
,
medical informatics
Health informatics combines communications, information technology (IT), and health care to enhance patient care and is at the forefront of the medical technological revolution. It can be viewed as a branch of engineering and applied science.
...
and
computational linguistics
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
. The strategies in this field have been applied to the biomedical literature available through services such as
PubMed
PubMed is an openly accessible, free database which includes primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institute ...
.
In recent years, the scientific literature has shifted to electronic publishing but the volume of information available can be overwhelming. This revolution of publishing has caused a high demand for text mining techniques. Text mining offers information retrieval (IR) and entity recognition (ER).
IR allows the retrieval of relevant papers according to the topic of interest, e.g. through PubMed. ER is practiced when certain biological terms are recognized (e.g.
proteins
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, re ...
or
genes
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
) for further processing.
Considerations
Applying text mining approaches to biomedical text requires specific considerations common to the domain.
Availability of annotated text data

Large annotated
corpora
Corpus (plural ''corpora'') is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of ...
used in the development and training of general purpose text mining methods (e.g., sets of movie dialogue, product reviews, or Wikipedia article text) are not specific for biomedical language. While they may provide evidence of general text properties such as parts of speech, they rarely contain concepts of interest to biologists or clinicians. Development of new methods to identify features specific to biomedical documents therefore requires assembly of specialized corpora.
Resources designed to aid in building new biomedical text mining methods have been developed through the Informatics for Integrating Biology and the Bedside (i2b2) challenges
and biomedical informatics researchers. Text mining researchers frequently combine these corpora with the
controlled vocabularies
A controlled vocabulary provides a way to organize knowledge for subsequent retrieval. Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Controlled vo ...
and
ontologies
In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More ...
available through the
National Library of Medicine's Unified Medical Language System (UMLS) and
Medical Subject Headings (MeSH).
Machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
-based methods often require very large data sets as training data to build useful models. Manual annotation of large text corpora is not realistically possible. Training data may therefore be products of weak supervision or purely statistical methods.
Data structure variation
Like other text documents, biomedical documents contain
unstructured data
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically plain text, text-heavy, but may contain data such ...
.
Research publications follow different formats, contain different types of information, and are interspersed with figures, tables, and other non-text content. Both unstructured text and semi-structured document elements, such as tables, may contain important information that should be text mined. Clinical documents may vary in structure and language between departments and locations. Other types of biomedical text, such as drug labels, may follow general structural guidelines but lack further details.
Uncertainty
Biomedical literature contains statements about observations that may not be statements of fact. This text may express uncertainty or skepticism about claims. Without specific adaptations, text mining approaches designed to identify claims within text may mis-characterize these "hedged" statements as facts.
Supporting clinical needs
Biomedical text mining applications developed for clinical use should ideally reflect the needs and demands of clinicians.
This is a concern in environments where
clinical decision support is expected to be informative and accurate. A comprehensive overview of the development and uptake of NLP methods applied to free-text clinical notes related to chronic diseases
is presented in.
Interoperability with clinical systems
New text mining systems must work with existing standards, electronic medical records, and databases.
Methods for interfacing with clinical systems such as
LOINC have been developed but require extensive organizational effort to implement and maintain.
Patient privacy
Text mining systems operating with private medical data must respect its security and ensure it is rendered anonymous where appropriate.
Processes
Specific sub tasks are of particular concern when processing biomedical text.
Named entity recognition
Developments in biomedical text mining have incorporated identification of biological entities with
named entity recognition, or NER. Names and identifiers for biomolecules such as
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
s and
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s, chemical compounds and drugs, and disease names have all been used as entities. Most entity recognition methods are supported by pre-defined linguistic features or vocabularies, though methods incorporating
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
and
word embedding
In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that ...
s have also been successful at biomedical NER.
Document classification and clustering
Biomedical documents may be
classified
Classified may refer to:
General
*Classified information, material that a government body deems to be sensitive
*Classified advertising or "classifieds"
Music
*Classified (rapper) (born 1977), Canadian rapper
* The Classified, a 1980s American ro ...
or
clustered based on their contents and topics. In classification, document categories are specified manually, while in clustering, documents form algorithm-dependent, distinct groups.
These two tasks are representative of
supervised and
unsupervised methods, respectively, yet the goal of both is to produce subsets of documents based on their distinguishing features. Methods for biomedical document clustering have relied upon
''k''-means clustering.
Relationship discovery
Biomedical documents describe connections between concepts, whether they are interactions between biomolecules, events occurring subsequently over time (i.e.,
temporal relationships), or
causal
Causality is an influence by which one Event (philosophy), event, process, state, or Object (philosophy), object (''a'' ''cause'') contributes to the production of another event, process, state, or object (an ''effect'') where the cause is at l ...
relationships. Text mining methods may perform relation discovery to identify these connections, often in concert with named entity recognition.
Hedge cue detection
The challenge of identifying uncertain or "hedged" statements has been addressed through hedge cue detection in biomedical literature.
Claim detection
Multiple researchers have developed methods to identify specific scientific claims from literature.
In practice, this process involves both isolating phrases and sentences denoting the core arguments made by the authors of a document (a process known as
argument mining, employing tools used in fields such as political science) and comparing claims to find potential contradictions between them.
Information extraction
Information extraction, or IE, is the process of automatically identifying structured information from
unstructured or partially structured text. IE processes can involve several or all of the above activities, including named entity recognition, relationship discovery, and document classification, with the overall goal of translating text to a more structured form, such as the contents of a template or
knowledge base
In computer science, a knowledge base (KB) is a set of sentences, each sentence given in a knowledge representation language, with interfaces to tell new sentences and to ask questions about what is known, where either of these interfaces migh ...
. In the biomedical domain, IE is used to generate links between concepts described in text, such as ''gene A inhibits gene B'' and ''gene C is involved in disease G.'' Biomedical knowledge bases containing this type of information are generally products of extensive manual curation, so replacement of manual efforts with automated methods remains a compelling area of research.
Information retrieval and question answering
Biomedical text mining supports applications for identifying documents and concepts matching search queries. Search engines such as
PubMed
PubMed is an openly accessible, free database which includes primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institute ...
search allow users to query literature databases with words or phrases present in document contents,
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
, or
indices such as
MeSH
Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences. It serves as a thesaurus of index terms that facilitates searching. Created and updated by th ...
. Similar approaches may be used for
medical literature retrieval. For more fine-grained results, some applications permit users to search with
natural language queries and identify specific biomedical relationships.
On 16 March 2020, the
National Library of Medicine
The United States National Library of Medicine (NLM), operated by the United States federal government, is the world's largest medical library.
Located in Bethesda, Maryland, the NLM is an institute within the National Institutes of Health. I ...
and others launched the COVID-19 Open Research Dataset (CORD-19) to enable
text mining
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
of the current literature on the novel virus. The dataset is hosted by the Semantic Scholar project of the
Allen Institute for AI. Other participants include
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
,
Microsoft Research
Microsoft Research (MSR) is the research subsidiary of Microsoft. It was created in 1991 by Richard Rashid, Bill Gates and Nathan Myhrvold with the intent to advance state-of-the-art computing and solve difficult world problems through technologi ...
, the
Center for Security and Emerging Technology
The Center for Security and Emerging Technology (CSET) is a think tank dedicated to policy analysis at the intersection of national and international security and emerging technologies, based at Georgetown University's School of Foreign Service. ...
, and the
Chan Zuckerberg Initiative
The Chan Zuckerberg Initiative (CZI) is an organization established and owned by Facebook founder Mark Zuckerberg and his wife Priscilla Chan with an investment of 99 percent of the couple's wealth from their Facebook shares over their lifetim ...
.
Resources
Corpora
The following table lists a selection of biomedical text corpora and their contents. These items include annotated corpora, sources of biomedical research literature, and resources frequently used as vocabulary and/or ontology references, such as
MeSH
Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences. It serves as a thesaurus of index terms that facilitates searching. Created and updated by th ...
. Items marked "Yes" under "Freely Available" can be downloaded from a publicly accessible location.
Word embeddings
Several groups have developed sets of biomedical vocabulary mapped to vectors of real numbers, known as
word vectors or word embeddings. Sources of pre-trained embeddings specific for biomedical vocabulary are listed in the table below. The majority are results of the
word2vec
Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
model developed by Mikolov ''et al'' or variants of word2vec.
Applications

Text mining applications in the biomedical field include computational approaches to assist with studies in
protein docking,
protein interactions, and protein-disease associations.
Text mining techniques have several advantages over traditional manual curation for identifying associations. Text mining algorithms can identify and extract information from a vast amount of literature, and more efficiently than manual curation. This includes the integration of data from different sources, including literature,
database
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
s, and experimental results. These algorithms have transformed the process of identifying and prioritizing novel genes and gene-disease associations that have previously been overlooked.


These methods are the foundation to facilitate systematic searches of overlooked scientific and biomedicalĀ literature which could carry significant association between research. The combination of information can stem new discoveries and hypotheses especially with the integration of datasets. It must be noted that the quality of the database is as important as the size of it. Promising text mining methods such as iProLINK (integrated Protein Literature Information and Knowledge) have been developed to curate data sources that can aid text mining research in areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. Curated databases such as UniProt can accelerate the accessibility of targeted information not only for genetic sequences, but also for literature and phylogeny.
Gene cluster identification
Methods for determining the association of
gene cluster
A gene cluster is a group of two or more genes found within an organism's DNA that encode similar peptide, polypeptides or proteins which collectively share a generalized function and are often located within a few thousand base pairs of each othe ...
s obtained by
microarray
A microarray is a multiplex (assay), multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of biological interactions. It is a two-dimensional array on a Substrate (materials science), solid substrateāusu ...
experiments with the biological context provided by the corresponding literature have been developed.
Protein interactions
Automatic extraction of protein interactions and associations of proteins to functional concepts (e.g.
gene ontology
The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ...
terms) has been explored. The search engine PIE was developed to identify and return protein-protein interaction mentions from
MEDLINE
MEDLINE (Medical Literature Analysis and Retrieval System Online, or MEDLARS Online) is a bibliographic database of life sciences and biomedical information. It includes bibliographic information for articles from academic journals covering medic ...
-indexed articles. The extraction of kinetic parameters from text or the
subcellular location of proteins have also been addressed by information extraction and text mining technology.
Gene-disease associations
Computational gene prioritization is an essential step in understanding the genetic basis of diseases, particularly within
genetic linkage
Genetic linkage is the tendency of Nucleic acid sequence, DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two Genetic marker, genetic markers that are physically near ...
analysis. Text mining and other computational tools extract relevant information, including gene-disease associations, among others, from numerous data sources, then apply different
ranking algorithms to prioritize the genes based on their relevance to the specific disease. Text mining and gene prioritization allow researchers to focus their efforts on the most promising candidates for further research.
Computational tools for gene prioritization continue to be developed and analyzed. One group studied the performance of various text-mining techniques for disease gene prioritization. They investigated different domain vocabularies, text representation schemes, and ranking algorithms in order to find the best approach for identifying disease-causing genes to establish a
benchmark.
Gene-trait associations
An agricultural genomics group identified genes related to
bovine
Bovines (subfamily Bovinae) comprise a diverse group of 10 genera of medium to large-sized ungulates, including Bos, cattle, bison, African buffalo, Bubalus, water buffalos, and the four-horned and spiral-horned antelopes. The members of this gro ...
reproductive traits using text mining, among other approaches.
Applications of phrase mining to disease associations
A text mining study assembled a collection of 709 core
extracellular matrix proteins
In biology, the extracellular matrix (ECM), also called intercellular matrix (ICM), is a network consisting of extracellular macromolecules and minerals, such as collagen, enzymes, glycoproteins and hydroxyapatite that provide structural and bi ...
and associated proteins based on two databases:
MatrixDBmatrixdb.univ-lyon1.fr and
UniProt
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...
. This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools. The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of
cardiovascular disease
Cardiovascular disease (CVD) is any disease involving the heart or blood vessels. CVDs constitute a class of diseases that includes: coronary artery diseases (e.g. angina, heart attack), heart failure, hypertensive heart disease, rheumati ...
s. They used a phrase-mining pipeline, Context-aware Semantic
Online Analytical Processing
In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction proces ...
(CaseOLAP),
then semantically scored all 709 proteins according to their Integrity, Popularity, and Distinctiveness using the CaseOLAP pipeline. The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.
Software tools
Search engines
Search engines designed to
retrieve biomedical literature relevant to a user-provided query frequently rely upon text mining approaches. Publicly available tools specific for research literature include
PubMed
PubMed is an openly accessible, free database which includes primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institute ...
search,
Europe PubMed Central
Europe PubMed Central (Europe PMC) is an open-access repository that contains millions of biomedical research works. It was known as UK PubMed Central until 1 November 2012.
Service
Europe PMC provides free access to more than 9.3 million full-te ...
search, GeneView, and APSE Similarly, search engines and indexing systems specific for biomedical data have been developed, including DataMed and OmicsDI.
Some search engines, such as Essie, OncoSearch,
PubGene, and
GoPubMed were previously public but have since been discontinued, rendered obsolete, or integrated into commercial products.
Medical record analysis systems
Electronic medical records
An electronic health record (EHR) is the systematized collection of electronically stored patient and population health information in a digital format. These records can be shared across different health care settings. Records are shared thro ...
(EMRs) and
electronic health record
An electronic health record (EHR) is the systematized collection of electronically stored patient and population health information in a digital format. These records can be shared across different health care settings. Records are shared thro ...
s (EHRs) are collected by clinical staff in the course of diagnosis and treatment. Though these records generally include structured components with predictable formats and data types, the remainder of the reports are often free-text and difficult to search, leading to challenges with patient care. Numerous complete systems and tools have been developed to analyse these free-text portions. The MedLEE system was originally developed for analysis of chest
radiology
Radiology ( ) is the medical specialty that uses medical imaging to diagnose diseases and guide treatment within the bodies of humans and other animals. It began with radiography (which is why its name has a root referring to radiation), but tod ...
reports but later extended to other report topics. The
clinical Text Analysis and Knowledge Extraction System, or cTAKES, annotates clinical text using a dictionary of concepts. The CLAMP system offers similar functionality with a user-friendly interface.
Frameworks
Computational frameworks have been developed to rapidly build tools for biomedical text mining tasks. SwellShark is a framework for biomedical NER that requires no human-labeled data but does make use of resources for weak supervision (e.g.,
UMLS semantic types). The SparkText framework uses
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
data streaming, a
NoSQL
NoSQL (originally meaning "Not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which ...
database, and basic
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
methods to build
predictive models from scientific articles.
APIs
Some biomedical text mining and natural language processing tools are available through
application programming interfaces, or APIs. NOBLE Coder performs concept recognition through an API.
Conferences
The following
academic conference
An academic conference or scientific conference (also congress, symposium, workshop, or meeting) is an Convention (meeting), event for researchers (not necessarily academics) to present and discuss their scholarly work. Together with academic jou ...
s and workshops host discussions and presentations in biomedical text mining advances. Most publish
proceedings
In academia and librarianship, conference proceedings are a collection of academic papers published in the context of an academic conference or workshop. Conference proceedings typically contain the contributions made by researchers at the confer ...
.
Journals
A variety of
academic journal
An academic journal (or scholarly journal or scientific journal) is a periodical publication in which Scholarly method, scholarship relating to a particular academic discipline is published. They serve as permanent and transparent forums for the ...
s publishing manuscripts on biology and medicine include topics in text mining and natural language processing software. Some journals, including the ''
Journal of the American Medical Informatics Association'' (JAMIA) and the ''
Journal of Biomedical Informatics'' are popular publications for these topics.
References
Further reading
*
*
*
Biomedical Literature Mining Publications (BLIMP): A comprehensive and regularly updated index of publications on (bio)medical text mining
External links
Bio-NLP resources, systems and application database collection
The BioNLP mailing list archivesCorpora for biomedical text mining
The BioCreative evaluations of biomedical text mining technologies{{Webarchive, url=https://web.archive.org/web/20110809230446/http://compbio.ucdenver.edu/Hunter_lab/Cohen/bioNlpPeople.html , date=2011-08-09
Data mining
Bioinformatics
Text mining
Clinical data management