Geohash

Last updated
The 6g cell and its sub-grid. Geohash-grid.png
The 6g cell and its sub-grid.

Geohash is a public domain geocode system invented in 2008 by Gustavo Niemeyer [1] which encodes a geographic location into a short string of letters and digits. Similar ideas were introduced by G.M. Morton in 1966. [2] It is a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves.

Contents

Geohashes offer properties like arbitrary precision and the possibility of gradually removing characters from the end of the code to reduce its size (and gradually lose precision). Geohashing guarantees that the longer a shared prefix between two geohashes is, the spatially closer they are together. The reverse of this is not guaranteed, as two points can be very close but have a short or no shared prefix.

History

The core part of the Geohash algorithm and the first initiative to similar solution was documented in a report of G.M. Morton in 1966, "A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing". [2] The Morton work was used for efficient implementations of Z-order curve, like in this modern (2014) Geohash-integer version (based on directly interleaving 64-bit integers), but his geocode proposal was not human-readable and was not popular.

Apparently, in the late 2000s, G. Niemeyer still didn't know about Morton's work, and reinvented it, adding the use of base32 representation. In February 2008, together with the announcement of the system, [1] he launched the website http://geohash.org , which allows users to convert geographic coordinates to short URLs which uniquely identify positions on the Earth, so that referencing them in emails, forums, and websites is more convenient.

Many variations have been developed, including OpenStreetMap's short link [3] (using base64 instead of base32) in 2009, the 64-bit Geohash [4] in 2014, the exotic Hilbert-Geohash [5] in 2016, and others.

Typical and main usages

To obtain the Geohash, the user provides an address to be geocoded, or latitude and longitude coordinates, in a single input box (most commonly used formats for latitude and longitude pairs are accepted), and performs the request.

Besides showing the latitude and longitude corresponding to the given Geohash, users who navigate to a Geohash at geohash.org are also presented with an embedded map, and may download a GPX file, or transfer the waypoint directly to certain GPS receivers. Links are also provided to external sites that may provide further details around the specified location.

For example, the coordinate pair 57.64911,10.40744 (near the tip of the peninsula of Jutland, Denmark) produces a slightly shorter hash of u4pruydqqvj.

The main usages of Geohashes are:

Geohashes have also been proposed to be used for geotagging.

When used in a database, the structure of geohashed data has two advantages. First, data indexed by geohash will have all points for a given rectangular area in contiguous slices (the number of slices depends on the precision required and the presence of geohash "fault lines"). This is especially useful in database systems where queries on a single index are much easier or faster than multiple-index queries. Second, this index structure can be used for a quick-and-dirty proximity search: the closest points are often among the closest geohashes.

Technical description

A formal description for Computational and Mathematical views.

Textual representation

For exact latitude and longitude translations Geohash is a spatial index of base 4, because it transforms the continuous latitude and longitude space coordinates into a hierarchical discrete grid, using a recurrent four-partition of the space. To be a compact code it uses base 32 and represents its values by the following alphabet, that is the "standard textual representation".

Decimal0123456789101112131415
Base 320123456789bcdefg
 
Decimal16171819202122232425262728293031
Base 32hjkmnpqrstuvwxyz

The "Geohash alphabet" (32ghs) uses all digits 0-9 and all lower case letters except "a", "i", "l" and "o".

For example, using the table above and the constant , the Geohash ezs42 can be converted to a decimal representation by ordinary positional notation:

[ezs42]32ghs =
=
=
= =

Geometrical representation

The geometry of the Geohash has a mixed spatial representation:

Geohash-OddEvenDigits.png
The curve of the grid of 32 cells was obtained merging 2 by 2 cells of the "next level" (64 cells grid illustrated here) to obtain a geometrical representation of the "odd-digit Geohash". MortonCurve64grid-mergingCells.png
The curve of the grid of 32 cells was obtained merging 2 by 2 cells of the "next level" (64 cells grid illustrated here) to obtain a geometrical representation of the "odd-digit Geohash".

It is possible to build the "И-order curve" from the Z-order curve by merging neighboring cells and indexing the resulting rectangular grid by the function . The illustration shows how to obtain the grid of 32 rectangular cells from the grid of 64 square cells.

The most important property of Geohash for humans is that it preserves spatial hierarchy in the code prefixes.
For example, in the "1 Geohash digit grid" illustration of 32 rectangles, above, the spatial region of the code e (rectangle of greyish blue circle at position 4,3) is preserved with prefix e in the "2 digit grid" of 1024 rectangles (scale showing em and greyish green to blue circles at grid).

Algorithm and example

Using the hash ezs42 as an example, here is how it is decoded into a decimal latitude and longitude. The first step is decoding it from textual "base 32ghs", as showed above, to obtain the binary representation:

.

This operation results in the bits 0110111111110000010000010. Starting to count from the left side with the digit 0 in the first position, the digits in the odd positions form the longitude code (0111110000000), while the digits in the even positions form the latitude code (101111001001).

Each binary code is then used in a series of divisions, considering one bit at a time, again from the left to the right side. For the latitude value, the interval -90 to +90 is divided by 2, producing two intervals: -90 to 0, and 0 to +90. Since the first bit is 1, the higher interval is chosen, and becomes the current interval. The procedure is repeated for all bits in the code. Finally, the latitude value is the center of the resulting interval. Longitudes are processed in an equivalent way, keeping in mind that the initial interval is -180 to +180.

For example, in the latitude code 101111001001, the first bit is 1, so we know our latitude is somewhere between 0 and 90. Without any more bits, we'd guess the latitude was 45, giving us an error of ±45. Since more bits are available, we can continue with the next bit, and each subsequent bit halves this error. This table shows the effect of each bit. At each stage, the relevant half of the range is highlighted in green; a low bit selects the lower range, a high bit selects the upper range.

The column "mean value" shows the latitude, simply the mean value of the range. Each subsequent bit makes this value more precise.

Latitude code 101111001001
bit positionbit valueminmidmaxmean valuemaximum error
01-90.0000.00090.00045.00045.000
100.00045.00090.00022.50022.500
210.00022.50045.00033.75011.250
3122.50033.75045.00039.3755.625
4133.75039.37545.00042.1882.813
5139.37542.18845.00043.5941.406
6042.18843.59445.00042.8910.703
7042.18842.89143.59442.5390.352
8142.18842.53942.89142.7150.176
9042.53942.71542.89142.6270.088
10042.53942.62742.71542.5830.044
11142.53942.58342.62742.6050.022
Longitude code 0111110000000
bit positionbit valueminmidmaxmean valuemaximum error
00-180.0000.000180.000-90.00090.000
11-180.000-90.0000.000-45.00045.000
21-90.000-45.0000.000-22.50022.500
31-45.000-22.5000.000-11.25011.250
41-22.500-11.2500.000-5.6255.625
51-11.250-5.6250.000-2.8132.813
60-5.625-2.8130.000-4.2191.406
70-5.625-4.219-2.813-4.9220.703
80-5.625-4.922-4.219-5.2730.352
90-5.625-5.273-4.922-5.4490.176
100-5.625-5.449-5.273-5.5370.088
110-5.625-5.537-5.449-5.5810.044
120-5.625-5.581-5.537-5.6030.022

(The numbers in the above table have been rounded to 3 decimal places for clarity)

Final rounding should be done carefully in a way that

So while rounding 42.605 to 42.61 or 42.6 is correct, rounding to 43 is not.

Digits and precision in km

geohash lengthlat bitslng bitslat errorlng errorkm error
123±23±23±2,500 km (1,600 mi)
255±2.8±5.6±630 km (390 mi)
378±0.70±0.70±78 km (48 mi)
41010±0.087±0.18±20 km (12 mi)
51213±0.022±0.022±2.4 km (1.5 mi; 2,400 m)
61515±0.0027±0.0055±0.61 km (0.38 mi; 610 m)
71718±0.00068±0.00068±0.076 km (0.047 mi; 76 m)
82020±0.000085±0.00017±0.019 km (0.012 mi; 19 m)

Limitations when used for deciding proximity

Edge cases

Geohashes can be used to find points in proximity to each other based on a common prefix. However, edge case locations close to each other but on opposite sides of the 180 degree meridian will result in Geohash codes with no common prefix (different longitudes for near physical locations). Points close to the North and South poles will have very different geohashes (different longitudes for near physical locations).

Two close locations on either side of the Equator (or Greenwich meridian) will not have a long common prefix since they belong to different 'halves' of the world. Put simply, one location's binary latitude (or longitude) will be 011111... and the other 100000...., so they will not have a common prefix and most bits will be flipped. This can also be seen as a consequence of relying on the Z-order curve (which could more appropriately be called an N-order visit in this case) for ordering the points, as two points close by might be visited at very different times. However, two points with a long common prefix will be close by.

In order to do a proximity search, one could compute the southwest corner (low geohash with low latitude and longitude) and northeast corner (high geohash with high latitude and longitude) of a bounding box and search for geohashes between those two. This search will retrieve all points in the z-order curve between the two corners, which can be far too many points. This method also breaks down at the 180 meridians and the poles. Solr uses a filter list of prefixes, by computing the prefixes of the nearest squares close to the geohash .

Non-linearity

Since a geohash (in this implementation) is based on coordinates of longitude and latitude the distance between two geohashes reflects the distance in latitude/longitude coordinates between two points, which does not translate to actual distance, see Haversine formula.

Example of non-linearity for latitude-longitude system:

Note that these limitations are not due to geohashing, and not due to latitude-longitude coordinates, but due to the difficulty of mapping coordinates on a sphere (non linear and with wrapping of values, similar to modulo arithmetic) to two dimensional coordinates and the difficulty of exploring a two dimensional space uniformly. The first is related to Geographical coordinate system and Map projection, and the other to Hilbert curve and z-order curve. Once a coordinate system is found that represents points linearly in distance and wraps up at the edges, and can be explored uniformly, applying geohashing to those coordinates will not suffer from the limitations above.

While it is possible to apply geohashing to an area with a Cartesian coordinate system, it would then only apply to the area where the coordinate system applies.

Despite those issues, there are possible workarounds, and the algorithm has been successfully used in Elasticsearch, [6] MongoDB, [7] HBase, Redis, [8] and Accumulo [9] to implement proximity searches.

Similar indexing systems

It is possible to use same base32-Geohash codes in different indexing curves. For quadrilateral tiling the Hilbert curve is the best alternative for Morton curve, used for example in the S2-geometry.
Codes with even number of digits (2, 4, ...) are mapped to regular grids, but codes of odd number (1, 3, ...) must be mapped to an irregular intermediary grid, with cells indexed by degenerated curves. Comparing-Geoash-Hilbert.png
It is possible to use same base32-Geohash codes in different indexing curves. For quadrilateral tiling the Hilbert curve is the best alternative for Morton curve, used for example in the S2-geometry.
Codes with even number of digits (2, 4, ...) are mapped to regular grids, but codes of odd number (1, 3, ...) must be mapped to an irregular intermediary grid, with cells indexed by degenerated curves.

An alternative to storing Geohashes as strings in a database are Locational codes, which are also called spatial keys and similar to QuadTiles. [10] [11]

In some geographical information systems and Big Data spatial databases, a Hilbert curve based indexation can be used as an alternative to Z-order curve, like in the S2 Geometry library. [12]

In 2019 a front-end was designed by QA Locate [13] in what they called GeohashPhrase [14] to use phrases to code Geohashes for easier communication via spoken English language. There were plans to make GeohashPhrase open source. [15]

Licensing

The Geohash algorithm was put in the public domain by its inventor in a public announcement on February 26, 2008. [16]

While comparable algorithms have been successfully patented [17] and had copyright claimed upon, [18] [19] GeoHash is based on an entirely different algorithm and approach.

Formal Standard

Geohash is standardized as CTA-5009 [20] .  This standard follows the Wikipedia article as of the 2023 version but provides additional detail in a formal (normative) reference. In the absence of an official specification since the creation of Geohash, the CTA WAVE organization published CTA-5009 to aid in broader adoption and compatibility across implementers in the industry.

See also

Related Research Articles

Elliptic-curve cryptography (ECC) is an approach to public-key cryptography based on the algebraic structure of elliptic curves over finite fields. ECC allows smaller keys compared to non-EC cryptography to provide equivalent security.

<span class="mw-page-title-main">Latitude</span> Geographic coordinate specifying north–south position

In geography, latitude is a coordinate that specifies the north–south position of a point on the surface of the Earth or another celestial body. Latitude is given as an angle that ranges from −90° at the south pole to 90° at the north pole, with 0° at the Equator. Lines of constant latitude, or parallels, run east–west as circles parallel to the equator. Latitude and longitude are used together as a coordinate pair to specify a location on the surface of the Earth.

<span class="mw-page-title-main">Geographic coordinate system</span> System to specify locations on Earth

The geographic coordinate system (GCS) is a spherical or geodetic coordinate system for measuring and communicating positions directly on the Earth as latitude and longitude. It is the simplest, oldest and most widely used of the various spatial reference systems that are in use, and forms the basis for most others. Although latitude and longitude form a coordinate tuple like a cartesian coordinate system, the geographic coordinate system is not cartesian because the measurements are angles and are not on a planar surface.

<span class="mw-page-title-main">Projected coordinate system</span> Cartesian geographic coordinate system

A projected coordinate system – also called a projected coordinate reference system, planar coordinate system, or grid reference system – is a type of spatial reference system that represents locations on Earth using Cartesian coordinates (x, y) on a planar surface created by a particular map projection. Each projected coordinate system, such as "Universal Transverse Mercator WGS 84 Zone 26N," is defined by a choice of map projection (with specific parameters), a choice of geodetic datum to bind the coordinate system to real locations on the earth, an origin point, and a choice of unit of measure. Hundreds of projected coordinate systems have been specified for various purposes in various regions.

<span class="mw-page-title-main">Ordnance Survey National Grid</span> System of geographic grid references used in Great Britain

The Ordnance Survey National Grid reference system (OSGB), also known as British National Grid (BNG), is a system of geographic grid references used in Great Britain, distinct from latitude and longitude.

In geodesy, conversion among different geographic coordinate systems is made necessary by the different geographic coordinate systems in use across the world and over time. Coordinate conversion is composed of a number of different types of conversion: format change of geographic coordinates, conversion of coordinate systems, or transformation to different geodetic datums. Geographic coordinate conversion has applications in cartography, surveying, navigation and geographic information systems.

The Maidenhead Locator System is a geocode system used by amateur radio operators to succinctly describe their geographic coordinates, which replaced the deprecated QRA locator, which was limited to European contacts. Its purpose is to be concise, accurate, and robust in the face of interference and other adverse transmission conditions. The Maidenhead Locator System can describe locations anywhere in the world.

A geocode is a code that represents a geographic entity. It is a unique identifier of the entity, to distinguish it from others in a finite set of geographic entities. In general the geocode is a human-readable and short identifier.

Base32 is an encoding method based on the base-32 numeral system. It uses an alphabet of 32 digits, each of which represents a different combination of 5 bits (25). Since base32 is not very widely adopted, the question of notation—which characters to use to represent the 32 digits—is not as settled as in the case of more well-known numeral systems (such as hexadecimal), though RFCs and unofficial and de-facto standards exist. One way to represent Base32 numbers in human-readable form is using digits 0–9 followed by the twenty-two upper-case letters A–V. However, many other variations are used in different contexts. Historically, Baudot code could be considered a modified (stateful) base32 code.

<span class="mw-page-title-main">Military Grid Reference System</span> NATO global coordinate reference system

The Military Grid Reference System (MGRS) is the geocoordinate standard used by NATO militaries for locating points on Earth. The MGRS is derived from the Universal Transverse Mercator (UTM) grid system and the Universal Polar Stereographic (UPS) grid system, but uses a different labeling convention. The MGRS is used as geocode for the entire Earth.

<span class="mw-page-title-main">Z-order curve</span> Mapping function that preserves data point locality

In mathematical analysis and computer science, functions which are Z-order, Lebesgue curve, Morton space-filling curve, Morton order or Morton code map multidimensional data to one dimension while preserving locality of the data points. It is named in France after Henri Lebesgue, who studied it in 1904, and named in the United States after Guy Macdonald Morton, who first applied the order to file sequencing in 1966. The z-value of a point in multidimensions is simply calculated by interleaving the binary representations of its coordinate values. Once the data are sorted into this ordering, any one-dimensional data structure can be used, such as simple one dimensional arrays, binary search trees, B-trees, skip lists or hash tables. The resulting ordering can equivalently be described as the order one would get from a depth-first traversal of a quadtree or octree.

Address geocoding, or simply geocoding, is the process of taking a text-based description of a location, such as an address or the name of a place, and returning geographic coordinates, frequently latitude/longitude pair, to identify a location on the Earth's surface. Reverse geocoding, on the other hand, converts geographic coordinates to a description of a location, usually the name of a place or an addressable location. Geocoding relies on a computer representation of address points, the street / road network, together with postal and administrative boundaries.

<span class="mw-page-title-main">Universal Transverse Mercator coordinate system</span> Map projection system

The Universal Transverse Mercator (UTM) is a map projection system for assigning coordinates to locations on the surface of the Earth. Like the traditional method of latitude and longitude, it is a horizontal position representation, which means it ignores altitude and treats the earth surface as a perfect ellipsoid. However, it differs from global latitude/longitude in that it divides earth into 60 zones and projects each to the plane as a basis for its coordinates. Specifying a location means specifying the zone and the x, y coordinate in that plane. The projection from spheroid to a UTM zone is some parameterization of the transverse Mercator projection. The parameters vary by nation or region or mapping system.

<span class="mw-page-title-main">United States National Grid</span> Multi-purpose grid reference system used in the United States

The United States National Grid (USNG) is a multi-purpose location system of grid references used in the United States. It provides a nationally consistent "language of location", optimized for local applications, in a compact, user friendly format. It is similar in design to the national grid reference systems used in other countries. The USNG was adopted as a national standard by the Federal Geographic Data Committee (FGDC) of the US Government in 2001.

World Meteorological Organization (WMO) squares is a system of geocodes that divides a world map with latitude-longitude gridlines into grid cells of 10° latitude by 10° longitude, each with a unique, 4-digit numeric identifier. On the plate carrée projection, the grid cells appear square; however, if the Mercator projection is used, the grid cells appear 'stretched' vertically nearer the tops and bottoms of the map. On the actual surface of the Globe, the cells are approximately "square" only adjacent to the Equator, and become progressively narrower and tapered as they approach the poles, and cells adjoining the poles are unique in possessing three faces rather than four.

C-squares is a system of spatially unique, location-based identifiers (geocodes) for areas on the surface of the earth, represented as cells from a latitude- and longitude-based Discrete Global Grid at a hierarchical set of resolution steps, obtained by progressively subdividing 10×10 degree World Meteorological Organization squares; the term "c-square" is also available for use to designate any component cell of the grid. Individual cell identifiers incorporate literal values of latitude and longitude in an interleaved notation, together with additional digits that support intermediate grid resolutions of 5, 0.5, 0.05 degrees, etc.

The Geohash-36 geocode is an opensource compression algorithm for world coordinate data. It was developed as a variation of the OpenPostcode format developed as a candidate geolocation postcode for the Republic of Ireland. It is calculated differently and uses a more concise base 36 representation rather than other geocodes that adopted base 32.

<span class="mw-page-title-main">Discrete global grid</span> Partition of Earths surface into subdivided cells

A discrete global grid (DGG) is a mosaic that covers the entire Earth's surface. Mathematically it is a space partitioning: it consists of a set of non-empty regions that form a partition of the Earth's surface. In a usual grid-modeling strategy, to simplify position calculations, each region is represented by a point, abstracting the grid as a set of region-points. Each region or region-point in the grid is called a cell.

The Open Location Code (OLC) is a geocode based in a system of regular grids for identifying an area anywhere on the Earth. It was developed at Google's Zürich engineering office, and released late October 2014. Location codes created by the OLC system are referred to as "plus codes".

The isoazimuth is the locus of the points on the Earth's surface whose initial orthodromic course with respect to a fixed point is constant.

References

  1. 1 2 Evidences at the Wayback Machine:
  2. 1 2 G. M. Morton (1966)"A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing" Archived 2019-01-25 at the Wayback Machine . Report in IBM Canada.
  3. The "Geohash binary 64 bits" have classic solutions, as yinqiwen/geohash-int, and optimized solutions, as mmcloughlin/geohash-assembly.
  4. Vukovic, Tibor (2016). Hilbert-Geohash - Hashing Geographical Point Data Using the Hilbert Space-Filling Curve. 70 (Thesis). hdl: 11250/2404058 .
  5. geo_shape Datatype in Elasticsearch
  6. Geospatial Indexing in MongoDB
  7. Redis-commands Guide
  8. Spatio-temporal Indexing in Non-relational Distributed Databases
  9. Spatial Keys
  10. QuadTiles
  11. "S2 Geometry Library" for optimized spatial indexation, https://s2geometry.io
  12. "QA Locate | The Power of Precision Location Intelligence". QA Locate. Retrieved 2020-06-10.
  13. "GeohashPhrase". QA Locate. 2019-09-17. Retrieved 2020-06-10.
  14. thelittlenag (2019-11-11). "At QA Locate we've been working on a solution that we call GeohashPhrase | Hacker News". news.ycombinator.com. Retrieved 2020-06-10.
  15. geohash.org announcement post in groundspeak.com forum. See also Wayback of 2018 at https://web.archive.org/web/20180309054335/https://forums.geocaching.com/GC/index.php?/topic/186412-geohashorg/ web.archive.org/web/20180309054335
  16. Compact text encoding of latitude/longitude coordinates - Patent 20050023524
  17. Does Microsoft Infringe the Natural Area Coding System? Archived 2010-12-28 at the Wayback Machine
  18. "The Natural Area Coding System - Legal and Licensing". Archived from the original on 2019-05-23. Retrieved 2008-02-26.
  19. "Fast and Readable Geographical Hashing (CTA-5009)". Consumer Technology Association®. Retrieved 2024-03-04.