8-bit clean

Last updated

8-bit clean is an attribute of computer systems, communication channels, and other devices and software, that process 8-bit character encodings without treating any byte as an in-band control code.

Contents

History

Until the early 1990s, many programs and data transmission channels were character-oriented and treated some characters, e.g., ETX, as control characters. Others assumed a stream of seven-bit characters, with values between 0 and 127; for example, the ASCII standard used only seven bits per character, avoiding an 8-bit representation in order to save on data transmission costs. On computers and data links using 8-bit bytes this left the top bit of each byte free for use as a parity, flag bit, or meta data control bit. 7-bit systems and data links are unable to directly handle more complex character codes which are commonplace in non-English-speaking countries with larger alphabets.

Binary files of octets cannot be transmitted through 7-bit data channels directly. To work around this, binary-to-text encodings have been devised which use only 7-bit ASCII characters. Some of these encodings are uuencoding, Ascii85, SREC, BinHex, kermit and MIME's Base64. EBCDIC-based systems cannot handle all characters used in UUencoded data. However, the base64 encoding does not have this problem.

SMTP and NNTP 8-bit cleanness

Historically, various media were used to transfer messages, some of them only supporting 7-bit data, so an 8-bit message had high chances to be garbled during transmission in the 20th century. But some implementations really did not care about formal discouraging of 8-bit data and allowed high bit set bytes to pass through. Such implementations are said to be 8-bit clean. In general, a communications protocol is said to be 8-bit clean if it correctly passes through the high bit of each byte in the communication process.

Many early communications protocol standards, such as RFC   780 , 788 , 821 , 2821 , 5321 (for SMTP), RFC   977 (for NNTP) and RFC   1056, were designed to work over such "7-bit" communication links. They specifically require the use of ASCII character set "transmitted as an 8-bit byte with the high-order bit cleared to zero" and some of these [1] explicitly restrict all data to 7-bit characters.

For the first few decades of email networks (1971 to the early 1990s), most email messages were plain text in the 7-bit US-ASCII character set. [2]

The RFC   788 definition of SMTP, like its predecessor RFC   780, limits Internet Mail to lines (1000 characters or less) of 7-bit US-ASCII characters. [3] [4] [5] [6]

Later the format of email messages was re-defined in order to support messages that are not entirely US-ASCII text (text messages in character sets other than US-ASCII, and non-text messages, such as audio and images). [6] The header field Content-Transfer-Encoding=binary [lower-alpha 1] requires an 8-bit clean transport.

RFC   3977 [7] specifies "NNTP operates over any reliable bi-directional 8-bit-wide data stream channel." and changes the character set for commands to UTF-8. However, RFC   5536 [8] still limits the character set to ASCII, including RFC   2047 [9] and RFC   2231 [10] MIME encoding of non-ASCII data.

The Internet community generally adds features by extension, allowing communication in both directions between upgraded machines and not-yet-upgraded machines, rather than declaring formerly standards-compliant legacy software to be "broken" and insisting that all software worldwide be upgraded to the latest standard. In the mid-1990s, people[ who? ] objected to "just send 8 bits (to RFC   821 SMTP servers)", perhaps because of a perception that "just send 8 bits" is an implicit declaration that ISO 8859-1 become the new "standard encoding", forcing everyone in the world to use the same character set.[ original research? ] Instead, the recommended way to take advantage of 8-bit-clean links between machines is to use the ESMTP ( RFC   1869) 8BITMIME extension [11] [12] for message bodies and the SMTP SMTPUTF8 [13] extension for message headers. Despite this, some mail transfer agents, notably Exim and qmail, relay mail to servers that do not advertise 8BITMIME without performing the conversion to 7-bit MIME (typically quoted-printable, "Q-P conversion") required by RFC   6152. This "just-send-8" attitude does not in fact cause problems in practice, since virtually all modern email servers are 8-bit clean. [14]

See also

Notes

  1. The header field Content-Transfer-Encoding=8BIT does not designate 8-bit clean, since CRLF has special significance.

Related Research Articles

<span class="mw-page-title-main">Email</span> Mail sent using electronic means

Electronic mail is a method of transmitting and receiving messages using electronic devices. It was conceived in the late–20th century as the digital version of, or counterpart to, mail. Email is a ubiquitous and very widely used communication medium; in current use, an email address is often treated as a basic and necessary part of many processes in business, commerce, government, education, entertainment, and other spheres of daily life in most countries.

Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).

The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typically use SMTP only for sending messages to a mail server for relaying, and typically submit outgoing email to the mail server on port 587 or 465 per RFC 8314. For retrieving messages, IMAP is standard, but proprietary servers also often implement proprietary protocols, e.g., Exchange ActiveSync.

<span class="mw-page-title-main">Email client</span> Computer program used to access and manage a users email

An email client, email reader or, more formally, message user agent (MUA) or mail user agent is a computer program used to access and manage a user's email.

In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.

An email address identifies an email box to which messages are delivered. While early messaging systems used a variety of formats for addressing, today, email addresses follow a set of specific rules originally standardized by the Internet Engineering Task Force (IETF) in the 1980s, and updated by RFC 5322 and 6854. The term email address in this article refers to just the addr-spec in Section 3.4 of RFC 5322. The RFC defines address more broadly as either a mailbox or group. A mailbox value can be either a name-addr, which contains a display-name and addr-spec, or the more common addr-spec alone.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

uuencoding is a form of binary-to-text encoding that originated in the Unix programs uuencode and uudecode written by Mary Ann Horton at the University of California, Berkeley in 1980, for encoding binary data for transmission in email systems.

Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Historically, because of the wide range of systems and protocols that could be used to transfer messages, e-mail was often assumed to be non-8-bit-clean – however, modern SMTP servers are in most cases 8-bit clean and support 8BITMIME extension. It can also be used with data that contains non-permitted octets or line lengths exceeding SMTP limits. It is defined as a MIME content transfer encoding for use in e-mail.

yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit encoding method. yEnc's overhead is often as little as 1–2%, compared to 33–40% overhead for 6-bit encoding methods like uuencode and Base64. yEnc was initially developed by Jürgen Helbing, and its first release was early 2001. By 2003 yEnc became the de facto standard encoding system for binary files on Usenet. The name yEncode is a wordplay on "Why encode?", since the idea is to only encode characters if it is absolutely required to adhere to the message format standard.

Email authentication, or validation, is a collection of techniques aimed at providing verifiable information about the origin of email messages by validating the domain ownership of any message transfer agents (MTA) who participated in transferring and possibly modifying a message.

Many email clients now offer some support for Unicode. Some clients will automatically choose between a legacy encoding and Unicode depending on the mail's content, either automatically or when the user requests it.

Sieve is a programming language that can be used for email filtering. It owes its creation to the CMU Cyrus Project, creators of Cyrus IMAP server.

A binary-to-text encoding is encoding of data in plain text. More precisely, it is an encoding of binary data in a sequence of printable characters. These encodings are necessary for transmission of data when the communication channel does not allow binary data or is not 8-bit clean. PGP documentation uses the term "ASCII armor" for binary-to-text encoding when referring to Base64.

Half-width kana are katakana characters displayed compressed at half their normal width, instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ka is カ while the half-width form is カ. Half-width hiragana is included in Unicode, and it is usable on Web or in e-books via CSS's font-feature-settings: "hwid" 1 with Adobe-Japan1-6 based OpenType fonts. Half-width kanji is usable on modern computers, and is used in some receipt printers, electric bulletin board and old computers.

Keith Moore is the author and co-author of several IETF RFCs related to the MIME and SMTP protocols for electronic mail, among others:

DomainKeys Identified Mail (DKIM) is an email authentication method designed to detect forged sender addresses in email, a technique often used in phishing and email spam.

Jakarta Mail is a Jakarta EE API used to send and receive email via SMTP, POP3 and IMAP. Jakarta Mail is built into the Jakarta EE platform, but also provides an optional package for use in Java SE.

Ned Freed was an IETF participant and Request for Comments author who contributed to a significant number of Internet Protocol standards, mostly related to email. He is best known as the co-inventor of email MIME attachments, with Nathaniel Borenstein.

International email arises from the combined provision of internationalized domain names (IDN) and email address internationalization (EAI). The result is email that contains international characters, encoded as UTF-8, in the email header and in supporting mail transfer protocols. The most significant aspect of this is the allowance of email addresses in most of the world's writing systems, at both interface and transport levels.

References

  1. RFC   780: Appendix A, RFC   788: 4.5.2., RFC   821: Appendix B, RFC   1056: 4.
  2. John Beck. "Email Explained". 2011.
  3. Jonathan B. Postel (November 1981). "4.5.3. SIZES". SIMPLE MAIL TRANSFER PROTOCOL. doi: 10.17487/RFC0788 . RFC 788. The maximum total length of a text line including the <CRLF> is 1000 characters (but not counting the leading dot duplicated for transparency).
  4. G. Vaudreuil (February 1993). "2. The Problem". Transition of Internet Mail from Just-Send-8 to 8bit-SMTP/MIME. doi: 10.17487/RFC1428 . RFC 1428. SMTP as defined in RFC 821 limits the sending of Internet Mail to US-ASCII characters.
  5. Dan Sugalski. "E-mail with Attachments". "The Perl Journal". Summer 1999. "When mail was standardized way back in 1982 with RFC822, ... The only limits placed on the body were the character set (7-bit ASCII) and the maximum line length (1000 characters)."
  6. 1 2 N. Freed; N. Borenstein (November 1996). "Abstract". Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. doi: 10.17487/RFC2045 . RFC 2045. Multipurpose Internet Mail Extensions, or MIME, redefines the format of messages
  7. C. Feather (October 2006). Network News Transfer Protocol (NNTP). doi: 10.17487/RFC3977 . RFC 3977.
  8. C. Lindsey; D. Kohn (November 2009). K. Murchison (ed.). Netnews Article Format. doi: 10.17487/RFC5536 . RFC 5536.
  9. K. Moore (November 1996). MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text. doi: 10.17487/RFC2047 . RFC 2047.
  10. N. Freed; K. Moore (November 1997). MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations. doi: 10.17487/RFC2231 . RFC 2231.
  11. Theodore Ts'o; Keith Moore; Mark Crispin (12 September 1994). "8-bit transmission in NNTP". IETF-SMTP mail list. Archived from the original on 20 March 2012. Retrieved 3 April 2010.
  12. "comp.mail.mime FAQ, part 3 'What's ESMTP, and how does it affect MIME?'". Usenet FAQs. 8 August 1997. Archived from the original on 18 January 2012. Retrieved 3 April 2010.
  13. J. Yao; W. Mao (February 2012). SMTP Extension for Internationalized Email. doi: 10.17487/RFC8531 . RFC 8531.
  14. "The 8BITMIME extension".