Quoted-printable

Last updated

Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters (alphanumeric and the equals sign =) to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Historically, because of the wide range of systems and protocols that could be used to transfer messages, e-mail was often assumed to be non-8-bit-clean – however, modern SMTP servers are in most cases 8-bit clean and support 8BITMIME extension. It can also be used with data that contains non-permitted octets or line lengths exceeding SMTP limits. It is defined as a MIME content transfer encoding for use in e-mail.

Contents

QP works by using the equals sign = as an escape character. It also limits line length to 76, as some software has limits on line length.

Introduction

MIME defines mechanisms for sending other kinds of information in e-mail, including text in languages other than English, using character encodings other than ASCII. However, these encodings often use byte values outside the ASCII range so they need to be encoded further before they are suitable for use in a non-8-bit-clean environment. Quoted-Printable encoding is one method used for mapping arbitrary bytes into sequences of ASCII characters. So, Quoted-Printable is not a character encoding scheme itself, but a data coding layer to be used under some byte-oriented character encoding. QP encoding is reversible, meaning the original bytes and hence the non-ASCII characters they represent can be identically recovered.

Quoted-Printable and Base64 are the two MIME content transfer encodings, if the trivial "7bit" and "8bit" encoding are not counted. If the text to be encoded does not contain many non-ASCII characters, then Quoted-Printable results in a fairly readable [1] and compact encoded result. On the other hand, if the input has many 8-bit characters, then Quoted-Printable becomes both unreadable and extremely inefficient. Base64 is not human-readable, but has a uniform overhead for all data and is the more sensible choice for binary formats or text in a script other than the Latin script.

Quoted-printable encoding

Any 8-bit byte value may be encoded with 3 characters: an = followed by two hexadecimal digits (0–9 or A–F) representing the byte's numeric value. For example, an ASCII form feed character (decimal value 12) can be represented by =0C, and an ASCII equal sign (decimal value 61) must be represented by =3D. All characters except printable ASCII characters or end of line characters (but also =) must be encoded in this fashion.

All printable ASCII characters (decimal values between 33 and 126) may be represented by themselves, except = (decimal 61, hexadecimal 3D, therefore =3D).

ASCII tab and space characters, decimal values 9 and 32, may be represented by themselves, except if these characters would appear at the end of the encoded line. In that case, they would need to be escaped as =09 (tab) or =20 (space), or be followed by a = (soft line break) as the last character of the encoded line. This last solution is valid because it prevents the tab or space from being the last character of the encoded line.

If the data being encoded contains meaningful line breaks, they must be encoded as an ASCII CR LF sequence, not as their original byte values, neither directly nor via = signs. Conversely, if byte values 13 and 10 have meanings other than end of line (in media types, [2] for example), then they must be encoded as =0D and =0A respectively.

Lines of Quoted-Printable encoded data must not be longer than 76 characters. To satisfy this requirement without altering the encoded text, soft line breaks may be added as desired. A soft line break consists of an = at the end of an encoded line, and does not appear as a line break in the decoded text. These soft line breaks also allow encoding text without line breaks (or containing very long lines) for an environment where line size is limited, such as the 1000 characters per line limit of some SMTP software, as allowed by RFC 2821.

A slightly modified version of Quoted-Printable is used in message headers; see MIME#Encoded-Word.

Example

The following example is a French text (encoded in UTF-8), with a high frequency of letters with diacritical marks (such as the é).

J'interdis aux marchands de vanter trop leurs marchandises. Car ils se font=  vite p=C3=A9dagogues et t'enseignent comme but ce qui n'est par essence qu= 'un moyen, et te trompant ainsi sur la route =C3=A0 suivre les voil=C3=A0 b= ient=C3=B4t qui te d=C3=A9gradent, car si leur musique est vulgaire ils te = fabriquent pour te la vendre une =C3=A2me vulgaire.     =E2=80=94=E2=80=89Antoine de Saint-Exup=C3=A9ry, Citadelle (1948)

This encodes the following quotation:

J'interdis aux marchands de vanter trop leurs marchandises. Car ils se font vite pédagogues et t'enseignent comme but ce qui n'est par essence qu'un moyen, et te trompant ainsi sur la route à suivre les voilà bientôt qui te dégradent, car si leur musique est vulgaire ils te fabriquent pour te la vendre une âme vulgaire.

Antoine de Saint-Exupéry, Citadelle (1948)

See also

Notes

  1. This implies that an ASCII compatible encoding is used. A QP-encoded text in e.g. EBCDIC would not be readable of course.
  2. Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. November 1996. RFC 2045 # 6.7 Quoted-Printable Content-Transfer-Encoding, part "(4) (Line Breaks)". Retrieved March 18, 2013.

Related Research Articles

<span class="mw-page-title-main">ASCII</span> American character encoding standard

ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope. Many computer systems instead use Unicode, which has millions of code points, but the first 128 of these are the same as the ASCII set.

In computing and telecommunication, a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. They are used as in-band signaling to cause effects other than the addition of a symbol to the text. All other characters are mainly graphic characters, also known as printing characters, except perhaps for "space" characters. In the ASCII standard there are 33 control characters, such as code 7, BEL, which rings a terminal bell.

Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

In computing and telecommunication, an escape character is a character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharacters. Generally, the judgement of whether something is an escape character or not depends on the context.

In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding characters.

8-bit clean is an attribute of computer systems, communication channels, and other devices and software, that process 8-bit character encodings without treating any byte as an in-band control code.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data in sequences of 24 bits that can be represented by four 6-bit Base64 digits.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, for encoding binary data for transmission in email systems.

<span class="mw-page-title-main">Human-readable medium and data</span> Presentation of data for humans to read

In computing, a human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans, resulting in human-readable data. It is often encoded as ASCII or Unicode text, rather than as binary data.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit encoding method. yEnc's overhead is often as little as 1–2%, compared to 33–40% overhead for 6-bit encoding methods like uuencode and Base64. yEnc was initially developed by Jürgen Helbing, and its first release was early 2001. By 2003 yEnc became the de facto standard encoding system for binary files on Usenet. The name yEncode is a wordplay on "Why encode?", since the idea is to only encode characters if it is absolutely required to adhere to the message format standard.

<span class="mw-page-title-main">Binary file</span> Non-human-readable computer file encoded in binary form

A binary file is a computer file that is not a text file. The term "binary file" is often used as a term meaning "non-text file". Many binary file formats contain parts that can be interpreted as text; for example, some computer document files containing formatted text, such as older Microsoft Word document files, contain the text of the document but also contain formatting information in binary form.

Ascii85, also called Base85, is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data, it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data.

Many email clients now offer some support for Unicode. Some clients will automatically choose between a legacy encoding and Unicode depending on the mail's content, either automatically or when the user requests it.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.

International email arises from the combined provision of internationalized domain names (IDN) and email address internationalization (EAI). The result is email that contains international characters, encoded as UTF-8, in the email header and in supporting mail transfer protocols. The most significant aspect of this is the allowance of email addresses in most of the world's writing systems, at both interface and transport levels.