Entity Extraction
   HOME

TheInfoList



OR:

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Most research on NER/NEE systems has been structured as taking an unannotated block of text, such as this one: And producing an annotated block of text that highlights the names of entities: In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified. State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of
F-measure In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the nu ...
while human annotators scored 97.60% and 96.95%.


Named-entity recognition platforms

Notable NER platforms include: *
GATE A gate or gateway is a point of entry to or from a space enclosed by walls. The word derived from old Norse "gat" meaning road or path; But other terms include ''yett and port''. The concept originally referred to the gap or hole in the wall ...
supports NER across many languages and domains out of the box, usable via a
graphical interface The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
and a
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
API. *
OpenNLP The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as language detection, tokenization, sentence segmentation, part-of-speech tagging, named en ...
includes rule-based and statistical named-entity recognition. *
SpaCy spaCy ( ) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines ...
features fast statistical NER as well as an open-source named-entity visualizer.


Problem definition

In the expression ''
named entity In information extraction, a named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include ...
'', the word ''named'' restricts the task to those entities for which one or many strings, such as words or phrases, stands (fairly) consistently for some referent. This is closely related to
rigid designator In modal logic and the philosophy of language, a term is said to be a rigid designator or absolute substantial term when it designates (picks out, denotes, refers to) the same thing in ''all possible worlds'' in which that thing exists. A designato ...
s, as defined by Kripke, although in practice NER deals with many names and referents that are not philosophically "rigid". For instance, the ''automotive company created by Henry Ford in 1903'' can be referred to as ''Ford'' or ''Ford Motor Company'', although "Ford" can refer to many other entities as well (see
Ford Ford commonly refers to: * Ford Motor Company, an automobile manufacturer founded by Henry Ford * Ford (crossing), a shallow crossing on a river Ford may also refer to: Ford Motor Company * Henry Ford, founder of the Ford Motor Company * Ford F ...
). Rigid designators include proper names as well as terms for certain biological species and substances, but exclude pronouns (such as "it"; see
coreference resolution In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in ''Bill said Alice would arrive soon, and she did'', the words ''Alice'' ...
), descriptions that pick out a referent by its properties (see also
De dicto and de re ''De dicto'' and ''de re'' are two phrases used to mark a distinction in intensional statements, associated with the intensional operators in many such statements. The distinction is used regularly in metaphysics and in philosophy of language. T ...
), and names for kinds of things as opposed to individuals (for example "Bank"). Full named-entity recognition is often broken down, conceptually and possibly also in implementations, as two distinct problems: detection of names, and
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
of the names by the type of entity they refer to (e.g. person, organization, or location). The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that "Bank of America" is a single name, disregarding the fact that inside this name, the substring "America" is itself a name. This segmentation problem is formally similar to chunking. The second phase requires choosing an
ontology In metaphysics, ontology is the philosophical study of being, as well as related concepts such as existence, becoming, and reality. Ontology addresses questions like how entities are grouped into categories and which of these entities exis ...
by which to organize categories of things. Temporal expressions and some numerical expressions (e.g., money, percentages, etc.) may also be considered as named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year ''2001'' refers to the ''2001st year of the Gregorian calendar''. In the second case, the month ''June'' may refer to the month of an undefined year (''past June'', ''next June'', ''every June'', etc.). It is arguable that the definition of ''named entity'' is loosened in such cases for practical reasons. The definition of the term ''named entity'' is therefore not strict and often has to be explained in the context in which it is used. Certain
hierarchies A hierarchy (from Greek: , from , 'president of sacred rites') is an arrangement of items (objects, names, values, categories, etc.) that are represented as being "above", "below", or "at the same level as" one another. Hierarchy is an important ...
of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for ''
question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
'' and consists of 29 types and 64 subtypes. Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.Sekine's Extended Named Entity Hierarchy
Nlp.cs.nyu.edu. Retrieved on 2013-07-21.
More recently, in 2011 Ritter used a hierarchy based on common
Freebase Freebase may refer to: *Free base or freebase, the pure basic form of an amine, as opposed to its salt form *Freebase (database), a former online database service *Freebase (mixtape), ''Freebase'' (mixtape), 2014 mixtape by 2 Chainz *An original ...
entity types in ground-breaking experiments on NER over
social media Social media are interactive media technologies that facilitate the creation and sharing of information, ideas, interests, and other forms of expression through virtual communities and networks. While challenges to the definition of ''social medi ...
text.


Formal evaluation

To evaluate the quality of an NER system's output, several measures have been defined. The usual measures are called precision, recall, and
F1 score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the nu ...
. However, several issues remain in just how to calculate those values. These statistical measures work reasonably well for the obvious cases of finding or missing a real entity exactly; and for finding a non-entity. However, NER can fail in many other ways, many of which are arguably "partially correct", and should not be counted as complete success or failures. For example, identifying a real entity, but: * with fewer tokens than desired (for example, missing the last token of "John Smith, M.D.") * with more tokens than desired (for example, including the first word of "The University of MD") * partitioning adjacent entities differently (for example, treating "Smith, Jones Robinson" as 2 vs. 3 entities) * assigning it a completely wrong type (for example, calling a personal name an organization) * assigning it a related but inexact type (for example, "substance" vs. "drug", or "school" vs. "organization") * correctly identifying an entity, when what the user wanted was a smaller- or larger-scope entity (for example, identifying "James Madison" as a personal name, when it's part of "James Madison University"). Some NER systems impose the restriction that entities may never overlap or nest, which means that in some cases one must make arbitrary or task-specific choices. One overly simple method of measuring accuracy is merely to count what fraction of all tokens in the text were correctly or incorrectly identified as part of entity references (or as being entities of the correct type). This suffers from at least two problems: first, the vast majority of tokens in real-world text are not part of entity names, so the baseline accuracy (always predict "not an entity") is extravagantly high, typically >90%; and second, mispredicting the full span of an entity name is not properly penalized (finding only a person's first name when his last name follows might be scored as ½ accuracy). In academic conferences such as CoNLL, a variant of the
F1 score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the nu ...
has been defined as follows: *
Precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
is the number of predicted entity name spans that line up ''exactly'' with spans in the
gold standard A gold standard is a monetary system in which the standard economic unit of account is based on a fixed quantity of gold. The gold standard was the basis for the international monetary system from the 1870s to the early 1920s, and from the la ...
evaluation data. I.e. when sub>Person Hans sub>Person Blickis predicted but sub>Person Hans Blickwas required, precision for the predicted name is zero. Precision is then averaged over all predicted entity names. * Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions. * F1 score is the
harmonic mean In mathematics, the harmonic mean is one of several kinds of average, and in particular, one of the Pythagorean means. It is sometimes appropriate for situations when the average rate is desired. The harmonic mean can be expressed as the recipro ...
of these two. It follows from the above definition that any prediction that misses a single token, includes a spurious token, or has the wrong class, is a hard error and does not contribute positively to either precision or recall. Thus, this measure may be said to be pessimistic: it can be the case that many "errors" are close to correct, and might be adequate for a given purpose. For example, one system might always omit titles such as "Ms." or "Ph.D.", but be compared to a system or ground-truth data that expects titles to be included. In that case, every such name is treated as an error. Because of such issues, it is important actually to examine the kinds of errors, and decide how important they are given one's goals and requirements. Evaluation models based on a token-by-token matching have been proposed. Such models may given partial credit for overlapping matches (such as using the Intersection over Union criterion). They allow a finer grained evaluation and comparison of extraction systems.


Approaches

NER systems have been created that use linguistic
grammar In linguistics, the grammar of a natural language is its set of structure, structural constraints on speakers' or writers' composition of clause (linguistics), clauses, phrases, and words. The term can also refer to the study of such constraint ...
-based techniques as well as
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
s such as
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually
annotated An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
training data. Semisupervised approaches have been suggested to avoid part of the annotation effort. Many different classifier types have been used to perform machine-learned NER, with
conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid ...
s being a typical choice.


Problem domains

In 2001, research indicated that even state-of-the-art NER systems were brittle, meaning that NER systems developed for one domain did not typically perform well on other domains. Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems. Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the
automatic content extraction Automatic content extraction (ACE) is a research program for developing advanced information extraction technologies convened by the NIST from 1999 to 2008, succeeding MUC and precedinText Analysis Conference Goals and efforts In general objecti ...
(ACE) evaluation also included several types of informal text styles, such as
weblog A blog (a truncation of "weblog") is a discussion or informational website published on the World Wide Web consisting of discrete, often informal diary-style text entries (posts). Posts are typically displayed in reverse chronological order ...
s and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the
molecular biology Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and physi ...
,
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, and medical
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
communities. The most common entity of interest in that domain has been names of
gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
s and gene products. There has been also considerable interest in the recognition of chemical entities and drugs in the context of the CHEMDNER competition, with 27 teams participating in this task.


Current challenges and research

Despite high F1 numbers reported on the MUC-7 dataset, the problem of named-entity recognition is far from being solved. The main efforts are directed to reducing the annotations labor by employing
semi-supervised learning Weak supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting. This approach alleviates the burden of o ...
, robust performance across domains and scaling up to fine-grained entity types. In recent years, many projects have turned to
crowdsourcing Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digita ...
, which is a promising solution to obtain high-quality aggregate human judgments for supervised and semi-supervised machine learning approaches to NER. Another challenging task is devising models to deal with linguistically complex contexts such as Twitter and search queries. There are some researchers who did some comparisons about the NER performances from different statistical models such as HMM (
hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
), ME ( maximum entropy), and CRF (
conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid ...
s), and feature sets. And some researchers recently proposed graph-based semi-supervised learning model for language specific NER tasks. A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia can be seen as an instance of extremely fine-grained named-entity recognition, where the types are the actual Wikipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system: Michael Jordan is a professor at Berkeley Another field that has seen progress but remains challenging is the application of NER to
Twitter Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
and other microblogs, considered "noisy" due to non-standard orthography, shortness and informality of texts. NER challenges in English Tweets have been organized by research communities to compare performances of various approaches, such as bidirectional LSTMs, Learning-to-Search, or CRFs.


See also

*
Controlled vocabulary Control may refer to: Basic meanings Economics and business * Control (management), an element of management * Control, an element of management accounting * Comptroller (or controller), a senior financial officer in an organization * Controlling ...
*
Coreference resolution In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in ''Bill said Alice would arrive soon, and she did'', the words ''Alice'' ...
*
Entity linking In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a uni ...
(aka named entity normalization, entity disambiguation) *
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
*
Knowledge extraction Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must r ...
*
Onomastics Onomastics (or, in older texts, onomatology) is the study of the etymology, history, and use of proper names. An ''orthonym'' is the proper name of the object in question, the object of onomastic study. Onomastics can be helpful in data mining, w ...
*
Record linkage Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and da ...
*
Smart tag (Microsoft) Smart tags are an early selection-based search feature, found in later versions of Microsoft Word and beta versions of the Internet Explorer 6 web browser, by which the application recognizes certain words or types of data and converts it to a hyp ...


References

{{Natural Language Processing Computational linguistics Tasks of natural language processing