This article has multiple issues. Please help improve it or discuss these issues on the talk page . (Learn how and when to remove these messages)
|
A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data (such as email or NNTP) or is not 8-bit clean. PGP documentation ( RFC 4880) uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.
The basic need for a binary-to-text encoding comes from a need to communicate arbitrary binary data over preexisting communications protocols that were designed to carry only English language human-readable text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may require line breaks at certain maximum intervals, and may not maintain whitespace. Thus, only the 94 printable ASCII characters are "safe" to use to convey data.
The ASCII text-encoding standard uses 7 bits to encode characters. With this it is possible to encode 128 (i.e. 27) unique values (0–127) to represent the alphabetic, numeric, and punctuation characters commonly used in English, plus a selection of Control characters which do not represent printable characters. For example, the capital letter A is represented in 7 bits as 100 00012, 0x41 (1018) , the numeral 2 is 011 00102 0x32 (628), the character } is 111 11012 0x7D (1758), and the Control character RETURN is 000 11012 0x0D (158).
In contrast, most computers store data in memory organized in eight-bit bytes. Files that contain machine-executable code and non-textual data typically contain all 256 possible eight-bit byte values. Many computer programs came to rely on this distinction between seven-bit text and eight-bit binary data, and would not function properly if non-ASCII characters appeared in data that was expected to include only ASCII text. For example, if the value of the eighth bit is not preserved, the program might interpret a byte value above 127 as a flag telling it to perform some function.
It is often desirable, however, to be able to send non-textual data through text-based systems, such as when one might attach an image file to an e-mail message. To accomplish this, the data is encoded in some way, such that eight-bit data is encoded into seven-bit ASCII characters (generally using only alphanumeric and punctuation characters—the ASCII printable characters). Upon safe arrival at its destination, it is then decoded back to its eight-bit form. This process is referred to as binary to text encoding. Many programs perform this conversion to allow for data-transport, such as PGP and GNU Privacy Guard.
Binary-to-text encoding methods are also used as a mechanism for encoding plain text. For example:
By using a binary-to-text encoding on messages that are already plain text, then decoding on the other end, one can make such systems appear to be completely transparent. This is sometimes referred to as 'ASCII armoring'. For example, the ViewState component of ASP.NET uses base64 encoding to safely transmit text via HTTP POST, in order to avoid delimiter collision.
The table below compares the most used forms of binary-to-text encodings. The efficiency listed is the ratio between the number of bits in the input and the number of bits in the encoded output.
Encoding | Data type | Efficiency | Programming language implementations | Comments |
---|---|---|---|---|
Ascii85 | Arbitrary | 80% | awk Archived 2014-12-29 at the Wayback Machine , C, C (2), C#, F#, Go, Java Perl, Python, Python (2) | There exist several variants of this encoding, Base85, btoa, etc. |
Base32 | Arbitrary | 62.5% | ANSI C, Delphi, Go, Java, C# F#, Python | |
Base36 | Integer | ~64% | bash, C, C++, C#, Java, Perl, PHP, Python, Visual Basic, Swift, many others | Uses the Arabic numerals 0–9 and the Latin letters A–Z (the ISO basic Latin alphabet). Commonly used by URL redirection systems like TinyURL or SnipURL/Snipr as compact alphanumeric identifiers. |
Base45 | Arbitrary | ~67% (97% [lower-alpha 1] ) | Go, Python | Defined in IETF Specification RFC 9285 for including binary data compactly in a QR code. [1] |
Base56 | Integer | — | PHP, Python, Go | A variant of Base58 encoding which further sheds the '1' and the lowercase 'o' characters in order to minimise the risk of fraud and human-error. [2] |
Base58 | Integer | ~73% | C, C++, Python, C#, Java | Similar to Base64, but modified to avoid both non-alphanumeric characters (+ and /) and letters that might look ambiguous when printed (0 – zero, I – capital i, O – capital o and l – lower-case L). Base58 is used to represent bitcoin addresses.[ citation needed ] Some messaging and social media systems break lines on non-alphanumeric strings. This is avoided by not using URI reserved characters such as +. For SegWit, it was replaced by Bech32, see below. |
Base62 | Arbitrary | ~74% | Rust, Python | Similar to Base64, but contains only alphanumeric characters. |
Base64 | Arbitrary | 75% | awk Archived 2014-12-29 at the Wayback Machine , C, C (2), Delphi, Go, Python, many others | An early and still-popular encoding, first specified as part of RFC 989 in 1987 |
Base85 | Arbitrary | 80% | C, Python, Python (2) | Revised version of Ascii85. |
Base91 [3] | Arbitrary | 81% | C# F# | Constant width variant |
basE91 [4] | Arbitrary | 81% | C, Java, PHP, 8086 Assembly, AWK C#, F#, Rust | Variable width variant |
Base94 [5] | Arbitrary | 82% | Python, C, Rust | |
Base122 [6] | Arbitrary | 87.5% | JavaScript, Python, Java, Base125 Python and Javascript, Go, C | |
BaseXML [7] | Arbitrary | 83.5% | C Python JavaScript | |
Bech32 | Arbitrary | 62.5% + at least 8 chars (label, separator, 6-char ECC) | C, C++, JavaScript, Go, Python, Haskell, Ruby, Rust | Specification. [8] Used in Bitcoin and the Lightning Network. [9] The data portion is encoded like Base32 with the possibility to check and correct up to 6 mistyped characters using the 6-character BCH code at the end, which also checks/corrects the Human Readable Part. The Bech32m variant has a subtle change that makes it more resilient to changes in length. [10] |
BinHex | Arbitrary | 75% | Perl, C, C (2) | MacOS Classic |
Decimal | Integer | ~42% | Most languages | Usually the default representation for input/output from/to humans. |
Hexadecimal (Base16) | Arbitrary | 50% | Most languages | Exists in uppercase and lowercase variants |
Intel HEX | Arbitrary | ≲50% | C library, C++ | Typically used to program EPROM, NOR flash memory chips |
MIME | Arbitrary | See Quoted-printable and Base64 | See Quoted-printable and Base64 | Encoding container for e-mail-like formatting |
Percent-encoding | Text (URIs), Arbitrary (RFC1738) | ~40% [lower-alpha 2] (33–70% [lower-alpha 3] ) | C, Python, probably many others | |
Quoted-printable | Text | ~33–100% [lower-alpha 4] | Probably many | Preserves line breaks; cuts lines at 76 characters |
S-record (Motorola hex) | Arbitrary | 49.6% | C library, C++ | Typically used to program EPROM, NOR flash memory chips. 49.6% assumes 255 binary bytes per record. |
Tektronix hex | Arbitrary | Typically used to program EPROM, NOR flash memory chips. | ||
TxMS | Hexadecimal | ~32% | Node.js (and CLI) | Used to transmit Blockchain transactions via SMS using UTF-16BE. |
Uuencoding | Arbitrary | ~60% (up to 70%) | Perl, C, Delphi, Java, Python, probably many others | An early encoding developed in 1980 for Unix-to-Unix Copy. Largely replaced by MIME and yEnc |
Xxencoding | Arbitrary | ~75% (similar to Uuencoding) | C, Delphi | Proposed (and occasionally used) as replacement for Uuencoding to avoid character set translation problems between ASCII and the EBCDIC systems that could corrupt Uuencoded data |
z85 (ZeroMQ spec:32/Z85) | Binary & ASCII | 80% (similar to Ascii85/Base85) | C (original), C#, Dart, Erlang, Go, Lua, Ruby, Rust and others | Specifies a subset of ASCII similar to Ascii85, omitting a few characters that may cause program bugs (` \ " ' _ , ; ). The format conforms to ZeroMQ spec:32/Z85. |
RFC 1751 (S/KEY) | Arbitrary | 33% | C, [11] Python | "A Convention for Human-readable 128-bit Keys". A series of small English words is easier for humans to read, remember, and type in than decimal or other binary-to-text encoding systems. [12] Each 64-bit number is mapped to six short words, of one to four characters each, from a public 2048-word dictionary. [11] |
The 95 isprint codes 32 to 126 are known as the ASCII printable characters.
Some older and today uncommon formats include BOO, BTOA, and USR encoding.
Most of these encodings generate text containing only a subset of all ASCII printable characters: for example, the base64 encoding generates text that only contains upper case and lower case letters, (A–Z, a–z), numerals (0–9), and the "+", "/", and "=" symbols.
Some of these encoding (quoted-printable and percent encoding) are based on a set of allowed characters and a single escape character. The allowed characters are left unchanged, while all other characters are converted into a string starting with the escape character. This kind of conversion allows the resulting text to be almost readable, in that letters and digits are part of the allowed characters, and are therefore left as they are in the encoded text. These encodings produce the shortest plain ASCII output for input that is mostly printable ASCII.
Some other encodings (base64, uuencoding) are based on mapping all possible sequences of six bits into different printable characters. Since there are more than 26 = 64 printable characters, this is possible. A given sequence of bytes is translated by viewing it as a stream of bits, breaking this stream in chunks of six bits and generating the sequence of corresponding characters. The different encodings differ in the mapping between sequences of bits and characters and in how the resulting text is formatted.
Some encodings (the original version of BinHex and the recommended encoding for CipherSaber) use four bits instead of six, mapping all possible sequences of 4 bits onto the 16 standard hexadecimal digits. Using 4 bits per encoded character leads to a 50% longer output than base64, but simplifies encoding and decoding—expanding each byte in the source independently to two encoded bytes is simpler than base64's expanding 3 source bytes to 4 encoded bytes.
Out of PETSCII's first 192 codes, 164 have visible representations when quoted: 5 (white), 17–20 and 28–31 (colors and cursor controls), 32–90 (ascii equivalent), 91–127 (graphics), 129 (orange), 133–140 (function keys), 144–159 (colors and cursor controls), and 160–192 (graphics). [13] This theoretically permits encodings, such as base128, between PETSCII-speaking machines.
ASCII, an acronym for American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. ASCII has just 128 code points, of which only 95 are printable characters, which severely limit its scope. The set of available punctuation had significant impact on the syntax of computer languages and text markup. ASCII hugely influenced the design of character sets used by modern computers, including Unicode which has over a million code points, but the first 128 of these are the same as ASCII.
In computing and telecommunications, a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. They are used as in-band signaling to cause effects other than the addition of a symbol to the text. All other characters are mainly graphic characters, also known as printing characters, except perhaps for "space" characters. In the ASCII standard there are 33 control characters, such as code 7, BEL, which rings a terminal bell.
Multipurpose Internet Mail Extensions (MIME) is a standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).
In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.
8-bit clean is an attribute of computer systems, communication channels, and other devices and software, that process 8-bit character encodings without treating any byte as an in-band control code.
A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system.
In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.
UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.
uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, for encoding binary data for transmission in email systems.
In computing, a human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans, resulting in human-readable data. It is often encoded as ASCII or Unicode text, rather than as binary data.
Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Historically, because of the wide range of systems and protocols that could be used to transfer messages, e-mail was often assumed to be non-8-bit-clean – however, modern SMTP servers are in most cases 8-bit clean and support 8BITMIME
extension. It can also be used with data that contains non-permitted octets or line lengths exceeding SMTP limits. It is defined as a MIME content transfer encoding for use in e-mail.
Base32 is an encoding method based on the base-32 numeral system. It uses an alphabet of 32 digits, each of which represents a different combination of 5 bits (25). Since base32 is not very widely adopted, the question of notation—which characters to use to represent the 32 digits—is not as settled as in the case of more well-known numeral systems (such as hexadecimal), though RFCs and unofficial and de-facto standards exist. One way to represent Base32 numbers in human-readable form is using digits 0–9 followed by the twenty-two upper-case letters A–V. However, many other variations are used in different contexts. Historically, Baudot code could be considered a modified (stateful) base32 code.
yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit encoding method. yEnc's overhead is often as little as 1–2%, compared to 33–40% overhead for 6-bit encoding methods like uuencode and Base64. yEnc was initially developed by Jürgen Helbing, and its first release was early 2001. By 2003 yEnc became the de facto standard encoding system for binary files on Usenet. The name yEncode is a wordplay on "Why encode?", since the idea is to only encode characters if it is absolutely required to adhere to the message format standard.
A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files containing formatted text, such as older Microsoft Word document files, contain the text of the document but also contain formatting information in binary form.
A FourCC is a sequence of four bytes used to uniquely identify data formats. It originated from the OSType or ResType metadata system used in classic Mac OS and was adopted for the Amiga/Electronic Arts Interchange File Format and derivatives. The idea was later reused to identify compressed data types in QuickTime and DirectShow.
Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data.
Many email clients now offer some support for Unicode. Some clients will automatically choose between a legacy encoding and Unicode depending on the mail's content, either automatically or when the user requests it.
This article compares Unicode encodings in two types of environments: 8-bit clean environments, and environments that forbid the use of byte values with the high bit set. Originally, such prohibitions allowed for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. The Standard Compression Scheme for Unicode and the Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.
The HZ character encoding is an encoding of GB 2312 that was formerly commonly used in email and USENET postings. It was designed in 1989 by Fung Fung Lee of Stanford University, and subsequently codified in 1995 into RFC 1843.
A six-bit character code is a character encoding designed for use on computers with word lengths a multiple of 6. Six bits can only encode 64 distinct characters, so these codes generally include only the upper-case letters, the numerals, some punctuation characters, and sometimes control characters. The 7-track magnetic tape format was developed to store data in such codes, along with an additional parity bit.
Even in Byte mode, a typical QR code reader tries to interpret a byte sequence as text encoded in UTF-8 or ISO/IEC 8859-1. ... Such data has to be converted into an appropriate text before that text could be encoded as a QR code. ... Base45 ... offers a more compact QR code encoding.