Geohash-36

Last updated

The Geohash-36 geocode is an opensource compression algorithm for world coordinate data. It was developed as a variation of the OpenPostcode format developed as a candidate geolocation postcode for the Republic of Ireland. [1] It is calculated differently and uses a more concise base 36 representation rather than other geocodes that adopted base 32. [2]

Contents

Despite the name, there are no algorithmic (not use Z-order curve) or typological relationship with Geohash. It is a publicity strategy to relate to a popular geocode of base 32. The encode/decode functions are not mathematically-similar to Geohash functions.

Coding Method

Designed for URLs and electronic storage and communication rather than human memory and conversation, it is case-sensitive, using a 36 character alphabet: "23456789bBCdDFgGhHjJKlLMnNPqQrRtTVWX".

Character Conversion:

Decimal01234567891011121314151617
Geohash-3623456789bBCdDFgGhH
 
Decimal181920212223242526272829303132333435
Geohash-36jJKlLMnNPqQrRtTVWX

Characters are chosen to avoid vowels, vowel-like numbers, character confusion, and to use lowercase characters which are generally distinct from their uppercase equivalents in standard typefaces.

The code can be of varying length and thus precision. Each character represents a further subdivision in a 6 by 6 grid - starting at the North-West (top-left) coordinate and continuing, row by row, to the South-East (bottom-right). Neighbouring coordinates have largely similar encodings and generally vary at the rightmost characters only; however extreme edge cases exist where neighbouring coordinates are on opposing sides of a grid division. Codes sort logically but not in ordinary coordinate order.

Without vowels, unintended English-language words are avoided that may appear in the original Geohash code. As vowels are not used, an altitude component of encoded meters is optional with a prefixing "A" character (below sea-level prefixed by a lowercase "a").

An optional checksum is represented using the lowercase English alphabet. It confirms the code as a Geohash-36 and provides a check for incorrect or transposed characters. It is calculated as modulus 26 of the sum of each character value (the altitude delimiters of "A" or "a" are valued at zero) multiplied by its position reading from left to right.

Efficiency

Compared to storing GPS coordinates using the Decimal datatype in SQL the Geohash-36 does not save significantly on database bytes. Using DECIMAL(8,5) and DECIMAL(7,5) requires 10-bytes [3] and is accurate to about 1.1 metre squared (or better further from the equator). An equivalent 10-bytes of the Geohash-36 code is accurate to approximately a 6th of square meter. [4]

The Statue of Liberty, at coordinates 40.689167, −74.044444, is encoded as 9LVB4BH89g-m. The reverse decoding equates to 40.689168, −74.044445.

The Shard building, London, at coordinates 51.504444, −0.086667 is encoded as bdrdC26BqH-m (decodes to 51.504444, −0.086666), or may be successfully shorted to bdrdC26B. [5]

Implementations

C

Ruby

See also

Related Research Articles

In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is an invention of language, which enabled a person, through speech, to communicate what they thought, saw, heard, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time.

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

In mathematics and computing, the hexadecimal numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexadecimal uses 16 distinct symbols, most often the symbols "0"–"9" to represent values 0 to 9, and "A"–"F" to represent values from 10 to 15.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data in sequences of 24 bits that can be represented by four 6-bit Base64 digits.

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, München is encoded as Mnchen-3ya.

A geocode is a code that represents a geographic entity. It is a unique identifier of the entity, to distinguish it from others in a finite set of geographic entities. In general the geocode is a human-readable and short identifier.

The Natural Area Code is a proprietary geocode system for identifying an area anywhere on the Earth, or a volume of space anywhere around the Earth. The use of thirty alphanumeric characters instead of only ten digits makes a NAC shorter than its numerical latitude/longitude equivalent.

Base32 is the base-32 numeral system. It uses a set of 32 digits, each of which can be represented by 5 bits (25). One way to represent Base32 numbers in a human-readable way is by using a standard 32-character set, such as the twenty-two upper-case letters A–V and the digits 0-9. However, many other variations are used in different contexts.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms Improvements to Soundex are the basis for many modern phonetic algorithms.

The move-to-front (MTF) transform is an encoding of data designed to improve the performance of entropy encoding techniques of compression. When efficiently implemented, it is fast enough that its benefits usually justify including it as an extra step in data compression algorithm.

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data.

Geohash is a public domain geocode system invented in 2008 by Gustavo Niemeyer which encodes a geographic location into a short string of letters and digits. Similar ideas were introduced by G.M. Morton in 1966. It is a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves.

The mapcode system is an open-source geocode system consisting of two groups of letters and digits, separated by a dot. It represents a location on the surface of the Earth, within the context of a separately specified country or territory. For example, the entrance to the elevator of the Eiffel Tower in Paris is “France 4J.Q2”. As with postal addresses, it is often unnecessary to explicitly mention the country.

<span class="mw-page-title-main">Discrete global grid</span> Partition of Earths surface into subdivided cells

A discrete global grid (DGG) is a mosaic that covers the entire Earth's surface. Mathematically it is a space partitioning: it consists of a set of non-empty regions that form a partition of the Earth's surface. In a usual grid-modeling strategy, to simplify position calculations, each region is represented by a point, abstracting the grid as a set of region-points. Each region or region-point in the grid is called a cell.

The Open Location Code (OLC) is a geocode based in a system of regular grids for identifying an area anywhere on the Earth. It was developed at Google's Zürich engineering office, and released late October 2014. Location codes created by the OLC system are referred to as "plus codes".

Several mutually incompatible versions of the Extended Binary Coded Decimal Interchange Code (EBCDIC) have been used to represent the Japanese language on computers, including variants defined by Hitachi, Fujitsu, IBM and others. Some are variable-width encodings, employing locking shift codes to switch between single-byte and double-byte modes. Unlike other EBCDIC locales, the lowercase basic Latin letters are often not preserved in their usual locations.

References

  1. "DCENR Postcodes" . Retrieved 26 June 2012.
  2. "Geohash Tips & Tricks" . Retrieved 26 June 2012.
  3. "MSDN "decimal and numeric (Transact-SQL)"" . Retrieved 26 June 2012.
  4. "Geohash-36". Archived from the original on 27 December 2012. Retrieved 26 June 2012.
  5. "Geo36.org" . Retrieved 26 June 2012.