Toponym resolution

Last updated

In geographic information systems, toponym resolution is the relationship process between a toponym, i.e. the mention of a place, and an unambiguous spatial footprint of the same place. [1]

Contents

The places mentioned in digitized text collections constitute a rich data source for researchers in many disciplines. However, toponyms in language use are ambiguous, and difficult to assign a definite real-world referent. Over time, established geographic names may change (as in "Byzantium" > "Constantinople" > "Istanbul"); or they may be reused verbatim (("Boston" in England, UK vs. "Boston" in Massachusetts, USA), or with modifications (as in "York" vs. "New York"). To map a set of place names or toponyms that occur in a document to their corresponding latitude/longitude coordinates, a polygon, or any other spatial footprint, a disambiguation step is necessary. A toponym resolution algorithm is an automatic method that performs a mapping from a toponym to a spatial footprint.

Some methods for toponym resolution employ a gazetteer of possible mappings between names and spatial footprints. [2]

Resolution process

The "unambiguous spatial footprint of the same place" [1] of definition can be in fact unambiguous, or "not so unambiguous". There are some different contexts of uncertainty where the resolution process can occur:

From geographical evidence

The toponym resolution sometimes is a simple conversion from name to abbreviation, in special when the abbreviation is used as standard geocode. For example, converting the official country name Afghanistan into an ISO country code, AF.

In annotating media and metadata, the conversion using a map and the geographical evidence (e.g. GPS), is the most usual approach to obtain toponym, or a geocode that represents the toponym.

From textual evidence

In contrast to geocoding of postal addresses, which are typically stored in structured database records, toponym resolution is typically applied to large unstructured text document collections to associate the locations mentioned in them with maps. If some of those text documents are geotagged --- e.g. because they are micro-blog posts with latitude and longitude automatically added --- they can be used to infer the varying geographical specificity of arbitrary terms, e.g. "cable car" or "high tide" [3] .

The process of annotating media (e.g., image, text, video) using spatial footprints is known as Geotagging. In order to automatically geotag a text document, the following steps are usually undertaken: toponym recognition (i.e., spotting textual references to geographic locations) and toponym resolution (i.e., selecting an appropriate location interpretation for each geographic reference).

Toponym recognition can be considered as a special case of named-entity recognition where the objective is to merely derive location entities. However, the result of named-entity recognition can be further improved using hand-crafted rules or statistical rules. [4]

For obtaining location interpretations, resolution models tend to leverage gazetteers (i.e., huge databases of locations) such as GeoNames and OpenStreetMap. A naive approach to resolve toponyms is to pick the most populated interpretation from the list of candidates. For example, in the following excerpt:

Toronto man living, working in London 'uncertain of future' in U.K. after Brexit

CBC

The naive approach seems viable since toponyms Toronto and London refer to their most common interpretation, located in Canada and Britain respectively, whereas in the following piece from a news article:

High-speed rail between Toronto and London by 2025

CBC

This approach fails to pinpoint toponym London as the city located in Ontario, Canada. Hence, selecting the highest population cannot work well for toponyms in a localized context.

Additionally, toponym resolution does not address metonymy in general. Nonetheless, a resolution technique can still disambiguate a metonymy reference as long as it is identified as a toponym in the recognition phase. For instance, in the following excerpt:

Canada is also adjusting its driving laws to account for cannabis DUIs.

Esquire

Canada indicates a metonymy and refers to "the government of Canada". However, it can be identified as a location by a generic named-entity recognizer and thus, a toponym resolver is able to disambiguate it.

Approaches

Toponym resolution methods can be generally divided into supervised and unsupervised models. Supervised methods typically cast the problem as a learning task wherein the model first extracts contextual and non-contextual features and then, a classifier is trained on a labelled dataset. Adaptive model [5] is one of the prominent models proposed in resolving toponyms. For each interpretation of a toponym, the model derives context-sensitive features based on geographical proximity and sibling relationships with other interpretations. In addition to context related features, the model benefits from context-free features including population, and audience location. On the other hand, unsupervised models do not warrant annotated data. They are superior to supervised models when the annotated corpus is not sufficiently large, and supervised models may not generalize well. [6]

Unsupervised models tend to better exploit the interplay of toponyms mentioned in a document. The Context-Hierarchy Fusion [6] model estimates the geographic scope of documents and leverages the connections between nearby place names as evidence to resolve toponyms. By means of mapping the problem to a conflict-free set cover problem, this model achieves a coherent and robust resolution.

Furthermore, adopting Wikipedia and knowledge bases have been shown effective in toponym resolution. TopoCluster [7] models the geographical senses of words by incorporating Wikipedia pages of locations and disambiguates toponyms using the spatial senses of the words in the text.

Geoparsing

Geoparsing is a special toponym resolution process of converting free-text descriptions of places (such as "twenty miles northeast of Jalalabad") into unambiguous geographic identifiers, such as geographic coordinates expressed as latitude-longitude. One can also geoparse location references from other forms of media, for examples audio content in which a speaker mentions a place. With geographic coordinates the features can be mapped and entered into Geographic information systems. Two primary uses of the geographic coordinates derived from unstructured content are to plot portions of the content on maps and to search the content using a map as a filter.

Geoparsing goes beyond geocoding. Geocoding analyzes unambiguous structured location references, such as postal addresses and rigorously formatted numerical coordinates. Geoparsing handles ambiguous references in unstructured discourse, such as "Al Hamra," which is the name of several places, including towns in both Syria and Yemen.

A geoparser is a piece of software or a (web) service that helps in this process. Some examples:

Related Research Articles

<span class="mw-page-title-main">Geographic information system</span> System to capture, manage and present geographic data

A geographic information system (GIS) consists of integrated computer hardware and software that store, manage, analyze, edit, output, and visualize geographic data. Much of this often happens within a spatial database, however, this is not essential to meet the definition of a GIS. In a broader sense, one may consider such a system also to include human users and support staff, procedures and workflows, the body of knowledge of relevant concepts and methods, and institutional organizations.

Ground truth is information that is known to be real or true, provided by direct observation and measurement as opposed to information provided by inference.

A GIS file format is a standard for encoding geographical information into a computer file, as a specialized type of file format for use in geographic information systems (GIS) and other geospatial applications. Since the 1970s, dozens of formats have been created based on various data models for various purposes. They have been created by government mapping agencies, GIS software vendors, standards bodies such as the Open Geospatial Consortium, informal user communities, and even individual developers.

A geocode is a code that represents a geographic entity. It is a unique identifier of the entity, to distinguish it from others in a finite set of geographic entities. In general the geocode is a human-readable and short identifier.

<span class="mw-page-title-main">Geotagged photograph</span>

A geotagged photograph is a photograph which is associated with a geographic position by geotagging. Usually this is done by assigning at least a latitude and longitude to the image, and optionally elevation, compass bearing and other fields may also be included.

<span class="mw-page-title-main">Geotagging</span> Act of associating geographic coordinates to digital media

Geotagging, or GeoTagging, is the process of adding geographical identification metadata to various media such as a geotagged photograph or video, websites, SMS messages, QR Codes or RSS feeds and is a form of geospatial metadata. This data usually consists of latitude and longitude coordinates, though they can also include altitude, bearing, distance, accuracy data, and place names, and perhaps a time stamp.

Address geocoding, or simply geocoding, is the process of taking a text-based description of a location, such as an address or the name of a place, and returning geographic coordinates, frequently latitude/longitude pair, to identify a location on the Earth's surface. Reverse geocoding, on the other hand, converts geographic coordinates to a description of a location, usually the name of a place or an addressable location. Geocoding relies on a computer representation of address points, the street / road network, together with postal and administrative boundaries.

<span class="mw-page-title-main">Spatial reference system</span> System to specify locations on Earth

A spatial reference system (SRS) or coordinate reference system (CRS) is a framework used to precisely measure locations on the surface of the Earth as coordinates. It is thus the application of the abstract mathematics of coordinate systems and analytic geometry to geographic space. A particular SRS specification comprises a choice of Earth ellipsoid, horizontal datum, map projection, origin point, and unit of measure. Thousands of coordinate systems have been specified for use around the world or in specific regions and for various purposes, necessitating transformations between different SRS.

Georeferencing or georegistration is a type of coordinate transformation that binds a digital raster image or vector database that represents a geographic space to a spatial reference system, thus locating the digital data in the real world. It is thus the geographic form of image registration. The term can refer to the mathematical formulas used to perform the transformation, the metadata stored alongside or within the image file to specify the transformation, or the process of manually or automatically aligning the image to the real world to create such metadata. The most common result is that the image can be visually and analytically integrated with other geographic data in geographic information systems and remote sensing software.

The concept of a Geospatial Web may have first been introduced by Dr. Charles Herring in his US DoD paper, An Architecture of Cyberspace: Spatialization of the Internet, 1994, U.S. Army Construction Engineering Research Laboratory.

C-squares is a system of spatially unique, location-based identifiers (geocodes) for areas on the surface of the earth, represented as cells from a latitude- and longitude-based Discrete Global Grid at a hierarchical set of resolution steps, obtained by progressively subdividing 10×10 degree World Meteorological Organization squares; the term "c-square" is also available for use to designate any component cell of the grid. Individual cell identifiers incorporate literal values of latitude and longitude in an interleaved notation, together with additional digits that support intermediate grid resolutions of 5, 0.5, 0.05 degrees, etc.

Geographic information retrieval (GIR) or geographical information retrieval systems are search tools for searching the Web, enterprise documents, and mobile local search that combine traditional text-based queries with location querying, such as a map or placenames. Like traditional information retrieval systems, GIR systems index text and information from structured and unstructured documents, and also augment those indices with geographic information. The development and engineering of GIR systems aims to build systems that can reliably answer queries that include a geographic dimension, such as "What wars were fought in Greece?" or "restaurants in Beirut". Semantic similarity and word-sense disambiguation are important components of GIR. To identify place names, GIR systems often rely on natural language processing or other metadata to associate text documents with locations. Such georeferencing, geotagging, and geoparsing tools often need databases of location names, known as gazetteers.

<span class="mw-page-title-main">Geohash</span> Public domain geocoding invented in 2008

Geohash is a public domain geocode system invented in 2008 by Gustavo Niemeyer which encodes a geographic location into a short string of letters and digits. Similar ideas were introduced by G.M. Morton in 1966. It is a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves.

A geographic data model, geospatial data model, or simply data model in the context of geographic information systems, is a mathematical and digital structure for representing phenomena over the Earth. Generally, such data models represent various aspects of these phenomena by means of geographic data, including spatial locations, attributes, change over time, and identity. For example, the vector data model represents geography as collections of points, lines, and polygons, and the raster data model represent geography as cell matrices that store numeric values. Data models are implemented throughout the GIS ecosystem, including the software tools for data management and spatial analysis, data stored in a variety of GIS file formats, specifications and standards, and specific designs for GIS installations.

Reverse geocoding is the process of converting a location as described by geographic coordinates to a human-readable address or place name. It is the opposite of forward geocoding, hence the term reverse. Reverse geocoding permits the identification of nearby street addresses, places, and/or areal subdivisions such as neighbourhoods, county, state, or country. Combined with geocoding and routing services, reverse geocoding is a critical component of mobile location-based services and Enhanced 911 to convert a coordinate obtained by GPS to a readable street address which is easier to understand by the end user, but not necessarily with a better accuracy.

The Ricoh 500SE digital compact camera is suitable for outdoor photography and networkability. Capability includes external information such as GPS position or barcode numbers within the image headers. External vendors sell hardware and software for workflows involving GPS positioning or barcode scanning. Most NMEA compliant bluetooth GPS receivers can be used with this camera through its built in bluetooth communication capability. The body is resistant to dust and water, making it robust for many environments.

Geospatial PDF is a set of geospatial extensions to the Portable Document Format (PDF) 1.7 specification to include information that relates a region in the document page to a region in physical space — called georeferencing. A geospatial PDF can contain geometry such as points, lines, and polygons. These, for example, could represent building locations, road networks and city boundaries, respectively. The georeferencing metadata for geospatial PDF is most commonly encoded in one of two ways: the OGC best practice; and as Adobe's proposed geospatial extensions to ISO 32000. The specifications also allow geometry to have attributes, such as a name or identifying type.

GeoReader is a landmark locating software application and website for mobile iPhone and Android based devices. Users travel into the vicinity of a “talking point”, and the software enables the phone to read text aloud that is linked with a GPS location. In addition, users can create their own 200 character count text to add to the database, and choose to share these talking points publicly or privately. The system is hands free and requires no physical interaction. Once the app is installed in the mobile device, the user simply opens the app and starts their trip. The application automatically then starts to search for any GPS tagged talking points within range of the user.

<span class="mw-page-title-main">Discrete global grid</span> Partition of Earths surface into subdivided cells

A discrete global grid (DGG) is a mosaic that covers the entire Earth's surface. Mathematically it is a space partitioning: it consists of a set of non-empty regions that form a partition of the Earth's surface. In a usual grid-modeling strategy, to simplify position calculations, each region is represented by a point, abstracting the grid as a set of region-points. Each region or region-point in the grid is called a cell.

Geomessaging is a technology that allows a person or system to send a message based on any media to a device that enters or exits one or more regions. Those regions can be created by using geofences, based on Latitude and Longitude, or adding beacons to the system associating those beacons with named locations. The device will receive the message according to the rules defined by the campaign administrator.

References

  1. 1 2 Leidner, Jochen L. (2007). Toponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding (PhD). University of Edinburgh. hdl:1842/1849.
  2. Hill, Linda L. (2006). Georeferencing: The geographic associations of information. The MIT Press. ISBN   978-0262083546.
  3. Berggren, Max; Karlgren, Jussi; Östling, Robert; Parkvall, Mikael (2016). "Inferring the location of authors from words in their texts". Proceedings of the Nordic Conference on Computational Linguistics. arXiv: 1612.06671 .
  4. Lieberman, Michael D.; Samet, Hanan (2011). Multifaceted toponym recognition for streaming news (PDF). Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. pp. 843–852. doi:10.1145/2009916.2010029.
  5. Lieberman, Michael D.; Samet, Hanan (2012). Adaptive context features for toponym resolution in streaming news (PDF). Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. pp. 731–740. doi:10.1145/2348283.2348381.
  6. 1 2 Kamalloo, Ehsan; Rafiei, Davood (2018). A Coherent Unsupervised Model for Toponym Resolution. Proceedings of the 2018 World Wide Web Conference. pp. 1287–1296. arXiv: 1805.01952 . doi:10.1145/3178876.3186027.
  7. DeLozier, Grant; Baldridge, Jason; London, Loretta (2015). Gazetteer-Independent Toponym Resolution Using Geographic Word Profiles. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. pp. 2382–2388.
  8. "Perl Advent Calendar 2016 - A Geo Parser for vast amounts of Text".

See also