Geohash-36

Last updated

The Geohash-36 geocode is an open-source compression algorithm for world coordinate data. It was developed as a variation of the OpenPostcode format developed as a candidate geolocation postcode for the Republic of Ireland. [1] It is calculated differently and uses a more concise base 36 representation rather than other geocodes that adopted base 32. [2]

Contents

Despite the name, there are no algorithmic (not use Z-order curve) or typological relationship with Geohash. It is a publicity strategy to relate to a popular geocode of base 32. The encode/decode functions are not mathematically-similar to Geohash functions.

Coding Method

Designed for URLs and electronic storage and communication rather than human memory and conversation, it is case-sensitive, using a 36 character alphabet: "23456789bBCdDFgGhHjJKlLMnNPqQrRtTVWX".

Character Conversion:

Decimal01234567891011121314151617
Geohash-3623456789bBCdDFgGhH
 
Decimal181920212223242526272829303132333435
Geohash-36jJKlLMnNPqQrRtTVWX

Characters are chosen to avoid vowels, vowel-like numbers, character confusion, and to use lowercase characters which are generally distinct from their uppercase equivalents in standard typefaces.

The code can be of varying length and thus precision. Each character represents a further subdivision in a 6 by 6 grid - starting at the North-West (top-left) coordinate and continuing, row by row, to the South-East (bottom-right). Neighbouring coordinates have largely similar encodings and generally vary at the rightmost characters only; however extreme edge cases exist where neighbouring coordinates are on opposing sides of a grid division. Codes sort logically but not in ordinary coordinate order.

Without vowels, unintended English-language words are avoided that may appear in the original Geohash code. As vowels are not used, an altitude component of encoded meters is optional with a prefixing "A" character (below sea-level prefixed by a lowercase "a").

An optional checksum is represented using the lowercase English alphabet. It confirms the code as a Geohash-36 and provides a check for incorrect or transposed characters. It is calculated as modulus 26 of the sum of each character value (the altitude delimiters of "A" or "a" are valued at zero) multiplied by its position reading from left to right.

Efficiency

Compared to storing GPS coordinates using the Decimal datatype in SQL the Geohash-36 does not save significantly on database bytes. Using DECIMAL(8,5) and DECIMAL(7,5) requires 10-bytes [3] and is accurate to about 1.1 metre squared (or better further from the equator). An equivalent 10-bytes of the Geohash-36 code is accurate to approximately a 6th of square meter. [4]

The Statue of Liberty, at coordinates 40.689167, −74.044444, is encoded as 9LVB4BH89g-m. The reverse decoding equates to 40.689168, −74.044445.

The Shard building, London, at coordinates 51.504444, −0.086667 is encoded as bdrdC26BqH-m (decodes to 51.504444, −0.086666), or may be successfully shorted to bdrdC26B. [5]

Implementations

C

Ruby

See also

Related Research Articles

The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit of memory in many computer architectures. To disambiguate arbitrarily sized bytes from the common 8-bit definition, network protocol documents such as the Internet Protocol refer to an 8-bit byte as an octet. Those bits in an octet are usually counted with numbering from 0 to 7 or 7 to 0 depending on the bit endianness.

In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is an invention of language, which enabled a person, through speech, to communicate what they thought, saw, heard, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time.

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

In mathematics and computing, the hexadecimal numeral system is a positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbols, hexadecimal uses sixteen distinct symbols, most often the symbols "0"–"9" to represent values 0 to 9 and "A"–"F" to represent values from ten to fifteen.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, München is encoded as Mnchen-3ya.

A geocode is a code that represents a geographic entity. It is a unique identifier of the entity, to distinguish it from others in a finite set of geographic entities. In general the geocode is a human-readable and short identifier.

The Natural Area Code, or Universal Address, is a proprietary geocode system for identifying an area anywhere on the Earth, or a volume of space anywhere around the Earth. The use of thirty alphanumeric characters instead of only ten digits makes a NAC shorter than its numerical latitude/longitude equivalent.

Base32 is an encoding method based on the base-32 numeral system. It uses an alphabet of 32 digits, each of which represents a different combination of 5 bits (25). Since base32 is not very widely adopted, the question of notation—which characters to use to represent the 32 digits—is not as settled as in the case of more well-known numeral systems (such as hexadecimal), though RFCs and unofficial and de-facto standards exist. One way to represent Base32 numbers in human-readable form is using digits 0–9 followed by the twenty-two upper-case letters A–V. However, many other variations are used in different contexts. Historically, Baudot code could be considered a modified (stateful) base32 code.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms Improvements to Soundex are the basis for many modern phonetic algorithms.

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data.

Address geocoding, or simply geocoding, is the process of taking a text-based description of a location, such as an address or the name of a place, and returning geographic coordinates, frequently latitude/longitude pair, to identify a location on the Earth's surface. Reverse geocoding, on the other hand, converts geographic coordinates to a description of a location, usually the name of a place or an addressable location. Geocoding relies on a computer representation of address points, the street / road network, together with postal and administrative boundaries.

<span class="mw-page-title-main">Geohash</span> Public domain geocoding invented in 2008

Geohash is a public domain geocode system invented in 2008 by Gustavo Niemeyer which encodes a geographic location into a short string of letters and digits. Similar ideas were introduced by G.M. Morton in 1966. It is a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves.

The mapcode system is an open-source geocode system consisting of two groups of letters and digits, separated by a dot. It represents a location on the surface of the Earth, within the context of a separately specified country or territory. For example, the entrance to the elevator of the Eiffel Tower in Paris is “France 4J.Q2”. As with postal addresses, it is often unnecessary to explicitly mention the country.

<span class="mw-page-title-main">Discrete global grid</span> Partition of Earths surface into subdivided cells

A discrete global grid (DGG) is a mosaic that covers the entire Earth's surface. Mathematically it is a space partitioning: it consists of a set of non-empty regions that form a partition of the Earth's surface. In a usual grid-modeling strategy, to simplify position calculations, each region is represented by a point, abstracting the grid as a set of region-points. Each region or region-point in the grid is called a cell.

The Open Location Code (OLC) is a geocode based in a system of regular grids for identifying an area anywhere on the Earth. It was developed at Google's Zürich engineering office, and released late October 2014. Location codes created by the OLC system are referred to as "plus codes".

Several mutually incompatible versions of the Extended Binary Coded Decimal Interchange Code (EBCDIC) have been used to represent the Japanese language on computers, including variants defined by Hitachi, Fujitsu, IBM and others. Some are variable-width encodings, employing locking shift codes to switch between single-byte and double-byte modes. Unlike other EBCDIC locales, the lowercase basic Latin letters are often not preserved in their usual locations.

References

  1. "DCENR Postcodes" . Retrieved 26 June 2012.
  2. "Geohash Tips & Tricks" . Retrieved 26 June 2012.
  3. "MSDN "decimal and numeric (Transact-SQL)"" . Retrieved 26 June 2012.
  4. "Geohash-36". Archived from the original on 27 December 2012. Retrieved 26 June 2012.
  5. "Geo36.org" . Retrieved 26 June 2012.

6.^ "Geohashes" . Retrieved 05 June 2024.