Biocuration is the field of
life sciences
This list of life sciences comprises the branches of science that involve the scientific study of life – such as microorganisms, plants, and animals including human beings. This science is one of the two major branches of natural science, ...
dedicated to organizing biomedical data, information and knowledge into structured formats, such as
spreadsheet
A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in c ...
s,
tables and
knowledge graphs.
The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators,
software developers
A programmer, computer programmer or coder is an author of computer source code someone with skill in computer programming.
The professional titles Software development, ''software developer'' and Software engineering, ''software engineer' ...
and
bioinformaticians
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
and is at the base of the work of
biological database
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including geno ...
s.
Biocuration as a profession

A biocurator is a professional
scientist
A scientist is a person who Scientific method, researches to advance knowledge in an Branches of science, area of the natural sciences.
In classical antiquity, there was no real ancient analog of a modern scientist. Instead, philosophers engag ...
who
curates
A curate () is a person who is invested with the ''care'' or ''cure'' () of souls of a parish. In this sense, ''curate'' means a parish priest; but in English-speaking countries the term ''curate'' is commonly used to describe clergy who are ass ...
, collects, annotates, and validates information that is disseminated by
biological
Biology is the scientific study of life and living organisms. It is a broad natural science that encompasses a wide range of fields and unifying principles that explain the structure, function, growth, origin, evolution, and distribution of ...
and
model organism database
Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms
A model organism is a non-human species that is extensively studied ...
s.
It is a new profession, with the first mentions in the scientific literature dating of 2006 in the context of the work in databases like the
Immune Epitope Database and Analysis Resource.
Biocurators usually are
PhD
A Doctor of Philosophy (PhD, DPhil; or ) is a terminal degree that usually denotes the highest level of academic achievement in a given discipline and is awarded following a course of graduate study and original research. The name of the deg ...
-level with a mix of experiences in
wet lab
A wet lab, or experimental lab, is a type of laboratory where it is necessary to handle various types of chemicals and potential "wet" hazards, so the room has to be carefully designed, constructed, and controlled to avoid spillage and contaminatio ...
and computational representations of
knowledge
Knowledge is an Declarative knowledge, awareness of facts, a Knowledge by acquaintance, familiarity with individuals and situations, or a Procedural knowledge, practical skill. Knowledge of facts, also called propositional knowledge, is oft ...
(e.g. via
ontologies
In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More ...
).
The role of a biocurator encompasses quality control of primary biological research data intended for publication, extracting and organizing data from original scientific literature, and describing the data with standard
annotation
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented Marginalia, in the margin of book page ...
protocols and vocabularies that enable powerful queries and
biological database
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including geno ...
interoperability. Biocurators communicate with researchers to ensure the accuracy of curated information and to foster data exchanges with research laboratories.
Biocurators are present in diverse research environments, but may not self-identify as biocurators. Projects such as
ELIXIR
An elixir is a sweet liquid used for medical purposes, to be taken orally and intended to cure one's illness. When used as a dosage form, pharmaceutical preparation, an elixir contains at least one active ingredient designed to be taken orall ...
(the European life-sciences Infrastructure for biological Information) and GOBLET (Global Organization for Bioinformatics Learning, Education and Training) promote training and support biocuration as a career path.
In 2011, biocuration was already recognized as a profession, but there were no formal degree courses to prepare curators for biological data in a targeted fashion. With the growth of the field, the
University of Cambridge
The University of Cambridge is a Public university, public collegiate university, collegiate research university in Cambridge, England. Founded in 1209, the University of Cambridge is the List of oldest universities in continuous operation, wo ...
and the
EMBL-EBI started to jointly offer a Postgraduate Certificate in Biocuration, considered as a step towards recognising biocuration as a discipline on its own. There is a perceived increase in demand of biocuration, and a need for additional biocuration training by
graduate programs.
Organizations that employ biocurators, like
Clinical Genome Resource (ClinGen), often provide specialized materials and training for biocuration.
Biological knowledgebases
The role of biocurators is best known among the field of
biological knowledgebases. Such databases, like
UniProt
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...
and
PDB rely on professional biocurators to organize information. Among other things, biocurators work to improve the data quality, for example, by merging duplicated entries.
An important part of those knowledgebases are
model organisms databases, which rely on biocurators to curate information regarding organisms of particular kinds. Some notable examples of model organism databases are
FlyBase
FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, ''Drosophila melanogaster'', a wide range of da ...
,
PomBase
PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase we ...
,
and
ZFIN, dedicated to curate information about ''
Drosophila
''Drosophila'' (), from Ancient Greek δρόσος (''drósos''), meaning "dew", and φίλος (''phílos''), meaning "loving", is a genus of fly, belonging to the family Drosophilidae, whose members are often called "small fruit flies" or p ...
,
Schizosaccharomyces pombe
''Schizosaccharomyces pombe'', also called "fission yeast", is a species of yeast used in traditional brewing and as a model organism in molecular and cell biology. It is a unicellular eukaryote, whose cells are rod-shaped. Cells typically meas ...
'' and
zebrafish
The zebrafish (''Danio rerio'') is a species of freshwater ray-finned fish belonging to the family Danionidae of the order Cypriniformes. Native to South Asia, it is a popular aquarium fish, frequently sold under the trade name zebra danio (an ...
respectively.
Curation and annotation
Biocuration is the integration of biological information into on-line databases in a semantically standardized way, using appropriate unique traceable identifiers, and providing necessary metadata including source and provenance.
Ontologies, controlled vocabularies and standard names
Biocurators commonly employ and take part in the creation and development of shared biomedical
ontologies
In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More ...
: structured,
controlled vocabularies
A controlled vocabulary provides a way to organize knowledge for subsequent retrieval. Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Controlled vo ...
that encompass many biological and medical knowledge domains, such as the
Open Biomedical Ontologies. These domains include
genomics
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
and
proteomics
Proteomics is the large-scale study of proteins. Proteins are vital macromolecules of all living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replicatio ...
,
anatomy
Anatomy () is the branch of morphology concerned with the study of the internal structure of organisms and their parts. Anatomy is a branch of natural science that deals with the structural organization of living things. It is an old scien ...
, animal and plant
development
Development or developing may refer to:
Arts
*Development (music), the process by which thematic material is reshaped
* Photographic development
*Filmmaking, development phase, including finance and budgeting
* Development hell, when a proje ...
,
biochemistry
Biochemistry, or biological chemistry, is the study of chemical processes within and relating to living organisms. A sub-discipline of both chemistry and biology, biochemistry may be divided into three fields: structural biology, enzymology, a ...
,
metabolic pathways
In biochemistry, a metabolic pathway is a linked series of chemical reactions occurring within a cell. The reactants, products, and intermediates of an enzymatic reaction are known as metabolites, which are modified by a sequence of chemical ...
,
taxonomic classification
In biology, taxonomy () is the scientific study of naming, defining ( circumscribing) and classifying groups of biological organisms based on shared characteristics. Organisms are grouped into taxa (singular: taxon), and these groups are given ...
, and mutant
phenotypes
In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology (physical form and structure), its developmental processes, its biochemical and physiological properti ...
. Given the variety of existing ontologies, there are guidelines that orient researchers on how to choose a suitable one.
The
Unified Medical Language System is one such systems that integrates and distributes millions of terms used in the life sciences domain.
Biocurators enforce the consistent use of
gene nomenclature
Gene nomenclature is the scientific naming of genes, the units of heredity in living organisms. It is also closely associated with protein nomenclature, as genes and the proteins they code for usually have similar nomenclature. An international co ...
guidelines and participate in the genetic nomenclature committees of various
model organisms
A model organism is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the workings of other organisms. Mo ...
, often in collaboration with the
HUGO Gene Nomenclature Committee
HGNC. They also enforce other nomenclature guidelines like those provided by the Nomenclature Committee of the
International Union of Biochemistry and Molecular Biology
The International Union of Biochemistry and Molecular Biology (IUBMB) is an international non-governmental organisation concerned with biochemistry and molecular biology. Formed in 1955 as the International Union of Biochemistry (IUB), the union ...
(IUBMB), one example of which is the Enzyme Commission
EC number.
More generally, the use of
persistent identifier
A persistent identifier (PI or PID) is a long-lasting reference to a document, file, web page, or other object.
The term "persistent identifier" is usually used in the context of digital objects that are accessible over the Internet. Typically, s ...
s is praised by the community, so to improve clarity and facilitate knowledge
DNA annotation
In
genome annotation for example, the identifiers defined by the ontologists and consortia are used to describe parts of the genome. For example, the
gene ontology
The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ...
(GO) curates terms for biological processes, which are used to describe what we know about specific
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s.
Text annotation
As of 2021, life sciences communication is still done primarily via free natural languages, like
English or
German
German(s) may refer to:
* Germany, the country of the Germans and German things
**Germania (Roman era)
* Germans, citizens of Germany, people of German ancestry, or native speakers of the German language
** For citizenship in Germany, see also Ge ...
, which hold a degree of ambiguity and make it hard to connect knowledge. So, besides annotating biological sequences, biocurators also annotate texts, linking words to unique identifiers. This aids in disambiguation, clarifying the meaning intended, and making the texts processable by computers. One application of text annotation is to specify the exact gene a scientist is referring to.
Publicly available text annotations make it possible to biologists to take further advantage of biomedical text. The
Europe PMC has an
Application Programming Interface
An application programming interface (API) is a connection between computers or between computer programs. It is a type of software Interface (computing), interface, offering a service to other pieces of software. A document or standard that des ...
which centralizes text annotations from a variety of sources and make them available in a
Graphic User Interface
A graphical user interface, or GUI, is a form of user interface that allows users to interact with electronic devices through graphical icons and visual indicators such as secondary notation. In many applications, GUIs are used instead of te ...
called SciLite. The PubTator Central also provides annotations, but is fully based on computerized text-mining and does not provide a user interface. There are also programs that allow users to manually annotate the biomedical texts they are interested, such as the ezTag system.
Variant Curation
A type of biocuration within the field of
medical genetics
Medical genetics is the branch of medicine that involves the diagnosis and management of hereditary disorders. Medical genetics differs from human genetics in that human genetics is a field of scientific research that may or may not apply to me ...
, variant curation is a process for assessment of genetic changes according to the likelihood that they may cause disease.
This is an evidence-based process that uses data from a multitude of sources. These sources can include population data, computational data, functional data,
segregation Segregation may refer to:
Separation of people
* Geographical segregation, rates of two or more populations which are not homogenous throughout a defined space
* School segregation
* Housing segregation
* Racial segregation, separation of human ...
data, ''
de novo'' data,
allelic data, among others.
It is a collaborative process that can be automated, however manual curation is considered to be the gold standard.
There is no single standardised process of variant curation; different researchers and organisations use different variant curation processes.
However, a set of internationally accepted
standards and guidelines for the interpretation of genetic variants have been jointly developed by the
American College of Medical Genetics and the
Association for Molecular Pathology.
These are known as the ACMG/AMP guidelines. These guidelines provide a framework for classifying genetic variants as “pathogenic”, “likely pathogenic”, “uncertain significance”, “likely benign” or “benign”, in order from most likely to cause disease to least likely to cause disease. The guidelines also list various levels of evidence ranging from very strong, strong, moderate or supporting. The combination of types of evidence found, and the levels in which those pieces of evidence exist, allows for each variant to be classified along the scale from "pathogenic" to "benign".
International Society for Biocuration (ISB)
The
International Society for Biocuration (ISB) is a non-profit organisation that "promotes the field of biocuration and provides a forum for information exchange through meetings and workshops." It has grown from the International Biocuration Conferences and was founded in early 2009.
The ISB offers the
Biocuration Career Award to biocurators in the community: the Biocurator Career Award (given annually) and the ISB Award for Exceptional Contributions to Biocuration (given biannually).
The official journal of the ISB,
''Database'', is a venue specialized in articles about databases and biocuration.
Community curation
Traditionally, biocuration has been done by dedicated experts, which integrate data into databases. Community curation has emerged as a promising approach to improve the dissemination of knowledge from published data and provide a cost-effective way to improve the scalability of biocuration. In some cases, community help is leveraged in jamborees that introduce domain experts to curation tasks, carried during the event, while others rely on asynchronous contributions of experts and non-experts.
Biological databases

Several biological databases include author contributions in their functional curation strategy to some extent, which may range from associating gene identifiers with publications or free-text, to more structured and detailed annotation of sequences and functional data, outputting curation to the same standards as professional biocurators. Most community curation at
Model Organism Databases
Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large set ...
involves annotation by original authors of published research (first-pass annotation) to effectively obtain accurate identifiers for objects to be curated, or identify data-types for detailed curation. For example:
*
WormBase
WormBase is an online biological database about the biology and genome of the nematode model organism ''Caenorhabditis elegans'' and contains information about other related nematodes. WormBase is used by the ''C. elegans'' research community bo ...
successfully solicits first-pass annotation from users and has integrated author curation with the micropublication process. WormBase also integrates text-mining to its platform, providing suggestions to community curators.
*
FlyBase
FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, ''Drosophila melanogaster'', a wide range of da ...
sends email requests to authors of new publications, inviting them to list the genes and data types described via an online tool and has also mobilized the community to write gene summary paragraphs.
Other databases, such as
PomBase
PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase we ...
, rely on publication authors to submit highly detailed, ontology-based annotations for their publications, and meta-data associated with genome-wide data-sets using controlled vocabularies. A web-based tool
Canto
The canto () is a principal form of division in medieval and modern long poetry.
Etymology and equivalent terms
The word ''canto'' is derived from the Italian word for "song" or "singing", which comes from the Latin ''cantus'', "song", from th ...
; was developed to facilitate community submissions. Since Canto is freely available, generic and highly configurable, it has been adopted by other projects. Curation is subjected to review by professional curators resulting in high quality in-depth curation of all molecular data-types.
The widely used
UniProt knowledgebase also has a community curation mechanism that allows researchers to add information about proteins.
Wiki-style resources
Bio-wikis rely on their communities to provide content and a series of wiki-style resources are available for biocuration.
''AuthorReward'', for example, is an extension to MediaWiki that quantifies researchers' contributions to biology wikis. RiceWiki was an example of a wiki-based database for community curation of rice genes equipped with ''AuthorReward''. CAZypedia is another such wiki for community biocuration of information on
carbohydrate-active enzymes (CAZys).
The
WikiProteins/WikiProfessional was a project to semantically organize biological data led by
Barend Mons
Barend Mons (born 1957, The Hague) is a molecular biologist and a FAIR data specialist. The first decade of his scientific career he spent on fundamental research on malaria parasites and later on translational research for malaria vaccines. In ...
.
The 2007 project had direct contributions of
Jimmy Wales
Jimmy Donal Wales (born August 7, 1966), also known as Jimbo Wales, is an American List of Internet entrepreneurs, Internet entrepreneur and former Trader (finance), financial trader. He is a Founders of Wikipedia, co-founder of the non-profi ...
, Wikipedia co-founder, and took
Wikidata
Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, are able to use under the CC0 public domain ...
as an inspiration.
A currently active project that runs on an adaptation of
mediawiki software is
WikiPathways, which crowdsources information about
biological pathways.
Wikipedia
There is some overlap between the work of biocurators and
Wikipedia
Wikipedia is a free content, free Online content, online encyclopedia that is written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. Founded by Jimmy Wales and La ...
, with boundaries between scientific
databases
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
and Wikipedia becoming increasingly blurred.
Databases like
Rfam and the
Protein Data Bank
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules such as proteins and nucleic acids, which is overseen by the Worldwide Protein Data Bank (wwPDB). This structural data is obtained a ...
for example make heavy use of Wikipedia and its editors to curate information. However, most databases offer highly structured data that is searchable in complex combinations, which is usually not possible on Wikipedia, although
Wikidata
Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, are able to use under the CC0 public domain ...
aims at solving this problem to some extent.
The
Gene Wiki project used Wikipedia for collaborative curation of thousands of genes and gene products, such as
titin
Titin (; also called connectin) is a protein that in humans is encoded by the ''TTN'' gene. The protein, which is over 1 μm in length, functions as a molecular spring that is responsible for the passive elasticity of muscle. It comprises 2 ...
and
insulin
Insulin (, from Latin ''insula'', 'island') is a peptide hormone produced by beta cells of the pancreatic islets encoded in humans by the insulin (''INS)'' gene. It is the main Anabolism, anabolic hormone of the body. It regulates the metabol ...
. Several projects also employ Wikipedia as a platform for curation of medical information.
One other way that Wikipedia is used for biocuration is via its
list articles. For example, the
Comprehensive Antibiotic Resistance Database
The Comprehensive Antibiotic Resistance Database (CARD) is a biological database that collects and organizes reference information on antimicrobial resistance genes, proteins and phenotypes. The database covers all types of drug classes and resist ...
integrates its assessment of databases about
antibiotic resistance
Antimicrobial resistance (AMR or AR) occurs when microbes evolve mechanisms that protect them from antimicrobials, which are drugs used to treat infections. This resistance affects all classes of microbes, including bacteria (antibiotic resis ...
to a
particular Wikipedia list.
Wikidata
The Wikimedia knowledge base
Wikidata
Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, are able to use under the CC0 public domain ...
is increasingly being used by the biocuration community as an integrative repository across life sciences. Wikidata is being seen by some as an alternative with better prospects of maintenance and interoperability than smaller independent biological knowledge bases.
Wikidata has been used to curate information on
SARS-CoV-2
Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) is a strain of coronavirus that causes COVID-19, the respiratory illness responsible for the COVID-19 pandemic. The virus previously had the Novel coronavirus, provisional nam ...
and the
COVID-19 pandemic
The COVID-19 pandemic (also known as the coronavirus pandemic and COVID pandemic), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), began with an disease outbreak, outbreak of COVID-19 in Wuhan, China, in December ...
and by the
Gene Wiki project to curate information about
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s. Data from biocuration on Wikidata is reused on external resources via
SPARQL queries. Some projects use curation via Wikidata as a path to improve life-sciences information on Wikipedia.
Gamified resources
An approach to involve the crowd in biocuration is via gamified platforms that use
game design
Game design is the process of creating and shaping the mechanics, systems, rules, and gameplay of a game. Game design processes apply to board games, card games, dice games, casino games, role-playing games, sports, Wargame (video games), war ga ...
principles to boost engagement. A few examples are:
* Mark2Cure, a gamified platform for community curation of biomedical abstracts
* Cochrane Crowd, a platform by
Cochrane for curation of
clinical trial
Clinical trials are prospective biomedical or behavioral research studies on human subject research, human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel v ...
s and to categorize and summarize biomedical literature.
*CIViC, a portal for annotation of genomic variants related to
cancer
Cancer is a group of diseases involving Cell growth#Disorders, abnormal cell growth with the potential to Invasion (cancer), invade or Metastasis, spread to other parts of the body. These contrast with benign tumors, which do not spread. Po ...
which tracks scores and keeps leaderboards.
*APICURON, a database to credit and acknowledge the work of biocurators, that collects and aggregates biocuration events from third party resources and generates achievements and leaderboards.
Computational text mining for curation
Natural-language processing and
text mining
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
technologies can help biocurators extract information for manual curation. Text mining can scale curation efforts, supporting the identification of gene names, for example, as well as for partially inferring
ontologies
In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More ...
. The conversion of unstructured assertions to structured information makes use of techniques like
named entity recognition and
parsing
Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
of
dependencies. Text-mining of biomedical concepts faces challenges regarding variations in reporting, and the community is working to increase the machine-readability of articles.
During the
COVID-19 pandemic
The COVID-19 pandemic (also known as the coronavirus pandemic and COVID pandemic), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), began with an disease outbreak, outbreak of COVID-19 in Wuhan, China, in December ...
, biomedical text mining was heavily used to cope with the large amount of published scientific research about the topic (over 50.000 articles).
The popular NLP
python package
SpaCy has a modification for biomedical texts, SciSpaCy, which is maintained by the
Allen Institute for AI.
Among the challenges for text-mining applied to biocuration is the difficulty of accessing full texts of biomedical articles due to pay walls, linking the challenges of biocuration to those of the
open-access movement.
A complementary approach to biocuration via text mining involves applying
optical character recognition
Optical character recognition or optical character reader (OCR) is the electronics, electronic or machine, mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo ...
to biomedical figures, coupled to automatic annotation algorithms. This has been used to extract gene information from
pathway figures, for example.
Suggestions to improve the written text to facilitate annotations range from using
controlled natural languages to providing clear association of concepts (such as
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s and
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
s) with the particular
species
A species () is often defined as the largest group of organisms in which any two individuals of the appropriate sexes or mating types can produce fertile offspring, typically by sexual reproduction. It is the basic unit of Taxonomy (biology), ...
of interest.
While challenges remain, text-mining is already an integral part of the workflow of biocuration in several
biological knowledgebases.
Biocreative challenges
The BioCreAtivE (Critical Assessment of Information Extraction systems in Biology) Challenge is a community-wide effort to develop and evaluate text mining and information extraction systems for the life sciences. The challenge was first launched in 2004 and has since become an important event in the biocuration and bioinformatics communities.
The main goal of the challenge is to foster the development of advanced computational tools that can effectively extract information from the vast amount of biological data available.

The BioCreative Challenge is organized into several subtasks that cover various aspects of text mining and information extraction in the life sciences. These subtasks include gene normalization, relation extraction, entity recognition, and document classification. Participants in the challenge are provided with a set of annotated data to develop and test their systems, and their performance is evaluated based on various metrics, such as precision, recall, and F-score.
The BioCreative Challenge has led to the development of many innovative text mining and information extraction systems that have greatly improved the efficiency and accuracy of biocuration efforts. These systems have been integrated into many biocuration pipelines and have helped to speed up the curation process and enhance the quality of curated data.
See also
*
AgBase
*
Biological database
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including geno ...
*
Digital curation
Digital curation is the selection, Preservation (library and archival science), preservation, maintenance, collection, and archiving of Digital data, digital assets.
Digital curation establishes, maintains, and adds value to repositories of digita ...
*
International Society for Biocuration
*
Model Organism Database
Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms
A model organism is a non-human species that is extensively studied ...
*
OBO Foundry
The Open Biological and Biomedical Ontologies (OBO) Foundry is a group of people who build and maintain ontologies related to the life sciences. The OBO Foundry establishes a set of principles for ontology development for creating a suite of in ...
References
{{reflist
External links
International Society for BiocurationBiocreativeOnline course on biocuration at EMBL-EBI
Biological databases