HOME

TheInfoList



OR:

In
geographic information systems A geographic information system (GIS) is a type of database containing geographic data (that is, descriptions of phenomena for which location is relevant), combined with software tools for managing, analyzing, and visualizing those data. In a br ...
, toponym resolution is the relationship process between a
toponym Toponymy, toponymics, or toponomastics is the study of '' toponyms'' (proper names of places, also known as place names and geographic names), including their origins, meanings, usage and types. Toponym is the general term for a proper name of ...
, i.e. the mention of a place, and an unambiguous spatial footprint of the same place. The places mentioned in digitized text collections constitute a rich data source for researchers in many disciplines. However, toponyms in language use are ambiguous, and difficult to assign a definite real-world
referent A referent () is a person or thing to which a name – a linguistic expression or other symbol – refers. For example, in the sentence ''Mary saw me'', the referent of the word ''Mary'' is the particular person called Mary who is being spoken of, ...
. Over time, established geographic names may change (as in "Byzantium" > "Constantinople" > "Istanbul"); or they may be reused verbatim (("Boston" in England, UK vs. "Boston" in Massachusetts, USA), or with modifications (as in "York" vs. "New York"). To map a set of place names or toponyms that occur in a document to their corresponding
latitude In geography, latitude is a coordinate that specifies the north– south position of a point on the surface of the Earth or another celestial body. Latitude is given as an angle that ranges from –90° at the south pole to 90° at the north pol ...
/
longitude Longitude (, ) is a geographic coordinate that specifies the east–west position of a point on the surface of the Earth, or another celestial body. It is an angular measurement, usually expressed in degrees and denoted by the Greek letter l ...
coordinates, a polygon, or any other spatial footprint, a disambiguation step is necessary. A toponym resolution algorithm is an automatic method that performs a mapping from a toponym to a spatial footprint. Some methods for toponym resolution employ a
gazetteer A gazetteer is a geographical index or directory used in conjunction with a map or atlas.Aurousseau, 61. It typically contains information concerning the geographical makeup, social statistics and physical features of a country, region, or co ...
of possible mappings between names and spatial footprints.


Resolution process

The "unambiguous spatial footprint of the same place" of definition can be in fact unambiguous, or "not so unambiguous". There are some different ''contexts of
uncertainty Uncertainty refers to epistemic situations involving imperfect or unknown information. It applies to predictions of future events, to physical measurements that are already made, or to the unknown. Uncertainty arises in partially observable or ...
'' where the resolution process can occur: * When the evidence is geographical and with no uncertainty. For example, to obtain the country name of a photo place, when the place is a GPS position (10 meters of error), at 1000 km far from country borders. * When the evidence is geographical, but with considerable uncertainty. Imagine a similar scenario where the GPS error is 100 meters and the place is near from, ~100 meters, of the country borders. * When the evidence is only textual. Imagine a letter where the narrator is a tourist telling about his trip after he returned from vacation. The only evidences are textual, in the narrative. * Mixed sources of evidence: more than one evidence, no one precise.


From geographical evidence

The toponym resolution sometimes is a simple conversion from name to abbreviation, in special when the abbreviation is used as standard
geocode A geocode is a code that represents a geographic entity (location or object). It is a unique identifier of the entity, to distinguish it from others in a finite set of geographic entities. In general the ''geocode'' is a human-readable and ...
. For example, converting the official country name
Afghanistan Afghanistan, officially the Islamic Emirate of Afghanistan,; prs, امارت اسلامی افغانستان is a landlocked country located at the crossroads of Central Asia and South Asia. Referred to as the Heart of Asia, it is bordere ...
into an
ISO country code ISO 3166 is a standard published by the International Organization for Standardization (ISO) that defines codes for the names of countries, dependent territories, special areas of geographical interest, and their principal subdivisions (e.g., ...
, AF. In annotating media and
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
, the conversion using a
map A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes. Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Although ...
and the geographical evidence (e.g. GPS), is the most usual approach to obtain toponym, or a
geocode A geocode is a code that represents a geographic entity (location or object). It is a unique identifier of the entity, to distinguish it from others in a finite set of geographic entities. In general the ''geocode'' is a human-readable and ...
that represents the toponym.


From textual evidence

In contrast to
geocoding Address geocoding, or simply geocoding, is the process of taking a text-based description of a location, such as an address or the name of a place, and returning geographic coordinates, frequently latitude/longitude pair, to identify a locatio ...
of postal addresses, which are typically stored in structured
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
records, toponym resolution is typically applied to large unstructured text document collections to associate the locations mentioned in them with maps. If some of those text documents are geotagged --- e.g. because they are micro-blog posts with latitude and longitude automatically added --- they can be used to infer the varying geographical specificity of arbitrary terms, e.g. "cable car" or "high tide" . The process of annotating media (e.g., image, text, video) using spatial footprints is known as
Geotagging Geotagging, or GeoTagging, is the process of adding geographical identification metadata to various media such as a geotagged photograph or video, websites, SMS messages, QR Codes or RSS feeds and is a form of geospatial metadata. This data u ...
. In order to automatically geotag a text document, the following steps are usually undertaken: ''toponym recognition'' (i.e., spotting textual references to geographic locations) and ''toponym resolution'' (i.e., selecting an appropriate location interpretation for each geographic reference). ''Toponym recognition'' can be considered as a special case of
named-entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
where the objective is to merely derive location entities. However, the result of named-entity recognition can be further improved using hand-crafted rules or statistical rules. For obtaining location interpretations, ''resolution'' models tend to leverage
gazetteer A gazetteer is a geographical index or directory used in conjunction with a map or atlas.Aurousseau, 61. It typically contains information concerning the geographical makeup, social statistics and physical features of a country, region, or co ...
s (i.e., huge databases of locations) such as
GeoNames GeoNames (or GeoNames.org) is a user editable geographical database available and accessible through various web services, under a Creative Commons attribution license. The project was founded in late 2005. The GeoNames dataset differs from ...
and
OpenStreetMap OpenStreetMap (OSM) is a free, open geographic database updated and maintained by a community of volunteers via open collaboration. Contributors collect data from surveys, trace from aerial imagery and also import from other freely licensed g ...
. A naive approach to resolve toponyms is to pick the most populated interpretation from the list of candidates. For example, in the following excerpt: The naive approach seems viable since toponyms ''Toronto'' and ''London'' refer to their most common interpretation, located in Canada and Britain respectively, whereas in the following piece from a news article: This approach fails to pinpoint toponym ''London'' as the city located in
Ontario, Canada Ontario ( ; ) is one of the thirteen provinces and territories of Canada.Ontario is located in the geographic eastern half of Canada, but it has historically and politically been considered to be part of Central Canada. Located in Central C ...
. Hence, selecting the highest population cannot work well for toponyms in a localized context. Additionally, ''toponym resolution'' does not address
metonymy Metonymy () is a figure of speech in which a concept is referred to by the name of something closely associated with that thing or concept. Etymology The words ''metonymy'' and ''metonym'' come from grc, μετωνυμία, 'a change of name' ...
in general. Nonetheless, a resolution technique can still disambiguate a metonymy reference as long as it is identified as a toponym in the recognition phase. For instance, in the following excerpt: ''Canada'' indicates a
metonymy Metonymy () is a figure of speech in which a concept is referred to by the name of something closely associated with that thing or concept. Etymology The words ''metonymy'' and ''metonym'' come from grc, μετωνυμία, 'a change of name' ...
and refers to "the government of Canada". However, it can be identified as a location by a generic named-entity recognizer and thus, a toponym resolver is able to disambiguate it.


Approaches

Toponym resolution methods can be generally divided into supervised and
unsupervised ''Unsupervised'' is an American adult animated sitcom created by David Hornsby, Rob Rosell, and Scott Marder which ran on FX from January 19 to December 20, 2012. The show was created, and for the most part, written by David Hornsby, Scott Marder ...
models. Supervised methods typically cast the problem as a learning task wherein the model first extracts contextual and non-contextual features and then, a classifier is trained on a labelled dataset. Adaptive model is one of the prominent models proposed in resolving toponyms. For each interpretation of a toponym, the model derives context-sensitive features based on geographical proximity and sibling relationships with other interpretations. In addition to context related features, the model benefits from context-free features including population, and audience location. On the other hand, unsupervised models do not warrant annotated data. They are superior to supervised models when the annotated corpus is not sufficiently large, and supervised models may not generalize well. Unsupervised models tend to better exploit the interplay of toponyms mentioned in a document. The Context-Hierarchy Fusion model estimates the geographic scope of documents and leverages the connections between nearby place names as evidence to resolve toponyms. By means of mapping the problem to a conflict-free
set cover problem The set cover problem is a classical question in combinatorics, computer science, operations research, and complexity theory. It is one of Karp's 21 NP-complete problems shown to be NP-complete in 1972. Given a set of elements (called the univ ...
, this model achieves a coherent and robust resolution. Furthermore, adopting Wikipedia and knowledge bases have been shown effective in toponym resolution. TopoCluster models the geographical senses of words by incorporating Wikipedia pages of locations and disambiguates toponyms using the spatial senses of the words in the text.


Geoparsing

''Geoparsing'' is a special toponym resolution process of converting free-text descriptions of places (such as "twenty miles northeast of Jalalabad") into unambiguous geographic identifiers, such as
geographic coordinates The geographic coordinate system (GCS) is a spherical or ellipsoidal coordinate system for measuring and communicating positions directly on the Earth as latitude and longitude. It is the simplest, oldest and most widely used of the various ...
expressed as
latitude In geography, latitude is a coordinate that specifies the north– south position of a point on the surface of the Earth or another celestial body. Latitude is given as an angle that ranges from –90° at the south pole to 90° at the north pol ...
-
longitude Longitude (, ) is a geographic coordinate that specifies the east–west position of a point on the surface of the Earth, or another celestial body. It is an angular measurement, usually expressed in degrees and denoted by the Greek letter l ...
. One can also geoparse location references from other forms of media, for examples audio content in which a speaker mentions a place. With geographic coordinates the features can be mapped and entered into
Geographic information system A geographic information system (GIS) is a type of database containing Geographic data and information, geographic data (that is, descriptions of phenomena for which location is relevant), combined with Geographic information system software, sof ...
s. Two primary uses of the geographic coordinates derived from unstructured content are to plot portions of the content on maps and to search the content using a map as a filter. Geoparsing goes beyond
geocoding Address geocoding, or simply geocoding, is the process of taking a text-based description of a location, such as an address or the name of a place, and returning geographic coordinates, frequently latitude/longitude pair, to identify a locatio ...
. Geocoding analyzes unambiguous structured location references, such as postal addresses and rigorously formatted numerical coordinates. Geoparsing handles ambiguous references in unstructured discourse, such as "Al Hamra," which is the name of several places, including towns in both Syria and Yemen. A geoparser is a piece of software or a (web) service that helps in this process. Some examples:
GEOLocate
automated georeferencing
BioGeomancer
– Semi-automatic georeferencing
GEOnet Names Server
– Freely available GIS information for areas outside of the U.S.A. and Antarctica, updated monthly by the National Geospatial-Intelligence Agency (NGA) and the U.S. Board on Geographic Names (US BGN)
Geographic Names Information System (GNIS)
– Freely available database containing information on almost 2 million physical features, places, and landmarks in the U.S.A.
CLAVIN
– CLAVIN (Cartographic Location And Vicinity INdexer) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution.
Geoparser.io
– Geoparser.io is a web service that identifies places mentioned in text, disambiguates those places, and returns GeoJSON with detailed metadata about the places found in the text.
Geocode.xyz
– Geocode.xyz is a web service that identifies both place names and street addresses mentioned in text.
geoparsepy
– geoparsepy is a free Python geoparsing library supporting free text location identification and disambiguation using the OpenStreetMap database
Tagbox.ai
– Tagbox is a geoparser API service


References

{{reflist


See also

*
Geographic information system A geographic information system (GIS) is a type of database containing Geographic data and information, geographic data (that is, descriptions of phenomena for which location is relevant), combined with Geographic information system software, sof ...
*
Information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
*
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
Geographic information systems