Ascii85

Last updated

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data (making the encoded size 14 larger than the original, assuming eight bits per ASCII character), it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data (13 increase, assuming eight bits per ASCII character).

Contents

Its main modern uses are in Adobe's PostScript and Portable Document Format file formats, as well as in the patch encoding for binary files used by Git. [1]

Overview

The basic need for a binary-to-text encoding comes from a need to communicate arbitrary binary data over preexisting communications protocols that were designed to carry only English language human-readable text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require line breaks at certain maximum intervals, and may not maintain whitespace. Thus, only the 94 printable ASCII characters are "safe" to use to convey data.

Eighty-five is the minimum integer value of n such that n5 ≥ 2564; so any sequence of 4 bytes can be encoded as 5 symbols, as long as at least 85 distinct symbols are available. (Five radix-85 digits can represent the integers from 0 to 4,437,053,124 inclusive, which suffice to represent all 4,294,967,296 possible 4-byte sequences.)

Encoding

When encoding, each group of 4 bytes is taken as a 32-bit binary number, most significant byte first (Ascii85 uses a big-endian convention). This is converted, by repeatedly dividing by 85 and taking the remainder, into 5 radix-85 digits. Then each digit (again, most significant first) is encoded as an ASCII printable character by adding 33 to it, giving the ASCII characters 33 (!) through 117 (u).

Because all-zero data is quite common, an exception is made for the sake of data compression, and an all-zero group is encoded as a single character z instead of !!!!!.

Groups of characters that decode to a value greater than 232 − 1 (encoded as s8W-!) will cause a decoding error, as will z characters in the middle of a group. White space between the characters is ignored and may occur anywhere to accommodate line-length limitations.

Limitations

The original specification only allows a stream that is a multiple of 4 bytes to be encoded.

Encoded data may contain characters that have special meaning in many programming languages and in some text-based protocols, such as left-angle-bracket <, backslash \, and the single and double quotes ' & ". Other base-85 encodings like Z85 and RFC   1924 are designed to be safe in source code. [2]

History

btoa version

The original btoa program always encoded full groups (padding the source as necessary), with a prefix line of "xbtoa Begin", and suffix line of "xbtoa End", followed by the original file length (in decimal and hexadecimal) and three 32-bit checksums. The decoder needs to use the file length to see how much of the group was padding. The initial proposal for btoa encoding used an encoding alphabet starting at the ASCII space character through "t" inclusive, but this was replaced with an encoding alphabet of "!" to "u" to avoid "problems with some mailers (stripping off trailing blanks)". [3] This program also introduced the special "z" short form for an all-zero group. Version 4.2 added a "y" exception for a group of all ASCII space characters (0x20202020).

ZMODEM version

"ZMODEM Pack-7 encoding" encodes groups of 4 octets into groups of 5 printable ASCII characters in a similar, or possibly in the same way as Ascii85 does. When a ZMODEM program sends pre-compressed 8-bit data files over 7-bit data channels, it uses "ZMODEM Pack-7 encoding". [4]

Adobe version

Adobe adopted the basic btoa encoding, but with slight changes, and gave it the name Ascii85. The characters used are the ASCII characters 33 (!) through 117 (u) inclusive (to represent the base-85 digits 0 through 84), together with the letter z (as a special case to represent a 32-bit 0 value), and white space is ignored. Adobe uses the delimiter "~>" to mark the end of an Ascii85-encoded string and represents the length by truncating the final group: If the last block of source bytes contains fewer than 4 bytes, the block is padded with up to 3 null bytes before encoding. After encoding, as many bytes as were added as padding are removed from the end of the output.

The reverse is applied when decoding: The last block is padded to 5 bytes with the Ascii85 character u, and as many bytes as were added as padding are omitted from the end of the output (see example).

The padding is not arbitrary. Converting from binary to base 64 only regroups bits and does not change them or their order (a high bit in binary does not affect the low bits in the base64 representation). In converting a binary number to base85 (85 is not a power of two) high bits do affect the low order base85 digits and conversely. Padding the binary low (with zero bits) while encoding and padding the base85 value high (with us) in decoding assures that the high order bits are preserved (the zero padding in the binary gives enough room so that a small addition is trapped and there is no "carry" to the high bits).

In Ascii85-encoded blocks, whitespace and line-break characters may be present anywhere, including in the middle of a 5-character block, but they must be silently ignored.

Adobe's specification does not support the y exception.

Example for Ascii85

A quote from Thomas Hobbes's Leviathan :

Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.

If this is initially encoded using US-ASCII, it can be reencoded in Ascii85 as follows:

9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,O< DJ+*.@<*K0@<6L(Df-\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKYi( DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIal( DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G>u D.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c 
Text content Man...
ASCII 77 97 110 32 ...
Bit pattern 01001101011000010110111000100000 ...
32-bit Value 1,298,230,816 = 24×854 + 73×853 + 80×852 + 78×85 + 61 ...
Base 85 (+33) 24 (57) 73 (106) 80 (113) 78 (111) 61 (94) ...
ASCII 9 j q o ^ ...
Text content sure
ASCII 115 117 114 101
Bit pattern01110011011101010111001001100101
32-bit Value 1,937,076,837 = 37×854 + 9×853 + 17×852 + 44×85 + 22
Base 85 (+33) 37 (70) 9 (42) 17 (50) 44 (77) 22 (55)
ASCII F * 2 M 7


Since the last 4-tuple is incomplete, it must be padded with three zero bytes:

Text content .\0\0\0
ASCII 46 0 0 0
Bit pattern00101110000000000000000000000000
32-bit Value 771,751,936 = 14×854 + 66×853 + 56×852 + 74×85 + 46
Base 85 (+33) 14 (47) 66 (99) 56 (89) 74 (107) 46 (79)
ASCII / c YkO

Since three bytes of padding had to be added, the three final characters 'YkO' are omitted from the output.

Decoding is done inversely, except that the last 5-tuple is padded with 'u' characters:

ASCII / c uuu
Base 85 (+33) 14 (47) 66 (99) 84 (117) 84 (117) 84 (117)
32-bit Value 771,955,124 = 14×854 + 66×853 + 84×852 + 84×85 + 84
Bit pattern00101110000000110001100110110100
ASCII 46 3 25 180
Text content .[ ETX ][ EM ]´ (Extended ASCII)

Since the input had to be padded with three 'u' bytes, the last three bytes of the output are ignored and we end up with the original period.

The input sentence does not contain 4 consecutive zero bytes, so the example does not show the use of the 'z' abbreviation.

Compatibility

The Ascii85 encoding is compatible with 7-bit and 8-bit MIME, while having less overhead than Base64.

One potential compatibility issue of Ascii85 is that some of the characters it uses are significant in markup languages such as XML or SGML. To include ascii85 data in these documents, it may be necessary to escape the quote, angle brackets, and ampersands.

RFC 1924 version

Published on April 1, 1996, informational RFC   1924: "A Compact Representation of IPv6 Addresses" by Robert Elz suggests a base-85 encoding of IPv6 addresses. This differs from the scheme used above in that he proposes a different set of 85 ASCII characters, and proposes to do all arithmetic on the 128-bit number, converting it to a single 20-digit base-85 number (internal whitespace not allowed), rather than breaking it into four 32-bit groups.

The proposed character set is, in order, 09, AZ, az, and then the 23 characters !#$%&()*+-;<=>?@^_`{|}~. The highest possible representable address, 2128−1 = 74×8519 + 53×8518 + 5×8517 + ..., would be encoded as =r54lj&NUUO~Hi%c2ym0.

This character set excludes the characters "',./:[\] , making it suitable for use in JSON strings (where " and \ would require escaping). However, for SGML-based protocols, notably including XML, string escapes may still be required (to accommodate <, > and &).

See also

Related Research Articles

In computing and telecommunication, a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. They are used as in-band signaling to cause effects other than the addition of a symbol to the text. All other characters are mainly graphic characters, also known as printing characters, except perhaps for "space" characters. In the ASCII standard there are 33 control characters, such as code 7, BEL, which rings a terminal bell.

In mathematics and computing, the hexadecimal numeral system is a positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbols, hexadecimal uses sixteen distinct symbols, most often the symbols "0"–"9" to represent values 0 to 9 and "A"–"F" to represent values from ten to fifteen.

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

8-bit clean is an attribute of computer systems, communication channels, and other devices and software, that process 8-bit character encodings without treating any byte as an in-band control code.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

The null character is a control character with the value zero. It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646, the C0 control code, the Universal Coded Character Set, and EBCDIC. It is available in nearly all mainstream programming languages. It is often abbreviated as NUL. In 8-bit codes, it is known as a null byte.

uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, for encoding binary data for transmission in email systems.

Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Historically, because of the wide range of systems and protocols that could be used to transfer messages, e-mail was often assumed to be non-8-bit-clean – however, modern SMTP servers are in most cases 8-bit clean and support 8BITMIME extension. It can also be used with data that contains non-permitted octets or line lengths exceeding SMTP limits. It is defined as a MIME content transfer encoding for use in e-mail.

Base32 is an encoding method based on the base-32 numeral system. It uses an alphabet of 32 digits, each of which represents a different combination of 5 bits (25). Since base32 is not very widely adopted, the question of notation—which characters to use to represent the 32 digits—is not as settled as in the case of more well-known numeral systems (such as hexadecimal), though RFCs and unofficial and de-facto standards exist. One way to represent Base32 numbers in human-readable form is using digits 0–9 followed by the twenty-two upper-case letters A–V. However, many other variations are used in different contexts. Historically, Baudot code could be considered a modified (stateful) base32 code.

yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit encoding method. yEnc's overhead is often as little as 1–2%, compared to 33–40% overhead for 6-bit encoding methods like uuencode and Base64. yEnc was initially developed by Jürgen Helbing, and its first release was early 2001. By 2003 yEnc became the de facto standard encoding system for binary files on Usenet. The name yEncode is a wordplay on "Why encode?", since the idea is to only encode characters if it is absolutely required to adhere to the message format standard.

<span class="mw-page-title-main">Aztec Code</span> Type of matrix barcode

The Aztec Code is a matrix code invented by Andrew Longacre, Jr. and Robert Hussey in 1995. The code was published by AIM, Inc. in 1997. Although the Aztec Code was patented, that patent was officially made public domain. The Aztec Code is also published as ISO/IEC 24778:2008 standard. Named after the resemblance of the central finder pattern to an Aztec pyramid, Aztec Code has the potential to use less space than other matrix barcodes because it does not require a surrounding blank "quiet zone".

BinHex, originally short for "binary-to-hexadecimal", is a binary-to-text encoding system that was used on the classic Mac OS for sending binary files through e-mail. Originally a hexadecimal encoding, subsequent versions of BinHex are more similar to uuencode, but combined both "forks" of the Mac file system together along with extended file information. BinHexed files take up more space than the original files, but will not be corrupted by non-"8-bit clean" software.

<span class="mw-page-title-main">Binary file</span> Non-human-readable computer file encoded in binary form

A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files containing formatted text, such as older Microsoft Word document files, contain the text of the document but also contain formatting information in binary form.

Bencode is the encoding used by the peer-to-peer file sharing system BitTorrent for storing and transmitting loosely structured data.

The data URI scheme is a uniform resource identifier (URI) scheme that provides a way to include data in-line in Web pages as if they were external resources. It is a form of file literal or here document. This technique allows normally separate elements such as images and style sheets to be fetched in a single Hypertext Transfer Protocol (HTTP) request, which may be more efficient than multiple HTTP requests, and used by several browser extensions to package images as well as other multimedia content in a single HTML file for page saving. As of 2024, data URIs are fully supported by all major browsers.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.

<span class="mw-page-title-main">DotCode</span> Type of matrix barcode

DotCode is two-dimensional (2D) matrix barcode invented in 2008 by Hand Held Products company to replace outdated Code 128. At this time, it is issued by Association for Automatic Identification and Mobility (AIM) as “ISS DotCode Symbology Specification 4.0”. DotCode consists of sparse black round dots and white spaces on white background. In case of black background round dots, creating barcode, can be white. DotCode was developed to use with high-speed industrial printers where printing accuracy can be low. Because DotCode by the standard does not require complicated elements like continuous lines or special shapes it can be applied with laser engraving or industrial drills.

References

  1. Hamano, Junio C (May 5, 2006). "[PATCH] binary patch". git. Archived from the original on 2020-07-26.
  2. "32/Z85" on ZeroMQ RFC
  3. Orost, Joe (Mar 26, 1991). "Re: COMPRESSING of binary data into mailable ASCII Re: Encoding of binary data into mailable ASCII". Google Groups. Retrieved 11 April 2015.
  4. Chuck Forsberg. "Recent Developments in ZMODEM". omen.com. Archived from the original on 2015-09-24. Retrieved 2013-05-14.. "ZMODEM Pack-7 packs 4 bytes into 5 printing characters."