Unicode and email

Last updated

Many email clients now offer some support for Unicode. Some clients will automatically choose between a legacy encoding and Unicode depending on the mail's content, either automatically [1] or when the user requests it. [2]

Contents

Technical requirements for sending of messages containing non-ASCII characters by email include

If the sender's or recipient's email address contains non-ASCII characters, sending of a message requires also encoding of these to a format that can be understood by mail servers.

Unicode support in protocols

Unicode support in message header

To use Unicode in certain email header fields, e.g. subject lines, sender and recipient names, the Unicode text has to be encoded using a MIME "Encoded-Word" with a Unicode encoding as the charset. To use Unicode in domain part of email addresses, IDNA encoding must traditionally be used. Alternatively, SMTPUTF8 [3] allows the use of UTF-8 encoding in email addresses (both in a local part and in domain name) as well as in a mail header section. Various standards had been created to retrofit the handling of non-ASCII data to the originally ASCII-only email protocol:

Unicode support in message bodies

As with all encodings apart from US-ASCII, when using Unicode text in email, MIME must be used to specify that a Unicode transformation format is being used for the text.

UTF-7, an obsolete encoding, had an advantage over Unicode encodings, on obsolete non-8bit-clean networks, in that it does not require a transfer encoding to fit within the seven-bit limits of legacy Internet mail servers. On the other hand, UTF-16 must be transfer encoded to fit SMTP data format. Although not strictly required, UTF-8 is usually also transfer encoded to avoid problems across seven-bit mail servers. MIME transfer encoding of UTF-8 makes it either unreadable as a plain text (in the case of base64) or, for some languages and types of text, heavily size inefficient (in the case of quoted-printable).

Some document formats, such as HTML, PostScript and Rich Text Format have their own 7-bit encoding schemes for non-ASCII characters and can thus be sent without using any special email encodings. E.g. HTML email can use HTML entities to use characters from anywhere in Unicode even if the HTML source text for the email is in a legacy encoding (e.g. 7-bit ASCII). For details of this see Unicode and HTML.

See also

Related Research Articles

<span class="mw-page-title-main">Email</span> Mail sent using electronic means

Electronic mail is a method of transmitting and receiving messages using electronic devices. It was conceived in the late–20th century as the digital version of, or counterpart to, mail. Email is a ubiquitous and very widely used communication medium; in current use, an email address is often treated as a basic and necessary part of many processes in business, commerce, government, education, entertainment, and other spheres of daily life in most countries.

Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).

The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typically use SMTP only for sending messages to a mail server for relaying, and typically submit outgoing email to the mail server on port 587 or 465 per RFC 8314. For retrieving messages, IMAP is standard, but proprietary servers also often implement proprietary protocols, e.g., Exchange ActiveSync.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, which is maintained by the Unicode Consortium, defines as of the current version (15.0) 149,186 characters covering 161 modern and historic scripts, as well as symbols, 3664 emoji, and non-visual control and formatting codes.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.

8-bit clean is an attribute of computer systems, communication channels, and other devices and software, that handle 8-bit character encodings correctly. Such encoding include the ISO 8859 series and the UTF-8 encoding of Unicode.

The byte order mark (BOM) is a particular usage of the special Unicode character, U+FEFFBYTE ORDER MARK, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data in sequences of 24 bits that can be represented by four 6-bit Base64 digits.

An email address identifies an email box to which messages are delivered. While early messaging systems used a variety of formats for addressing, today, email addresses follow a set of specific rules originally standardized by the Internet Engineering Task Force (IETF) in the 1980s, and updated by RFC 5322 and 6854. The term email address in this article refers to just the addr-spec in Section 3.4 of RFC 5322. The RFC defines address more broadly as either a mailbox or group. A mailbox value can be either a name-addr, which contains a display-name and addr-spec, or the more common addr-spec alone.

UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, München is encoded as Mnchen-3ya.

<span class="mw-page-title-main">Internationalized domain name</span> Type of Internet domain name

An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-latin script or alphabet or in the Latin alphabet-based characters with diacritics or ligatures. These writing systems are encoded by computers in multibyte Unicode. Internationalized domain names are stored in the Domain Name System (DNS) as ASCII strings using Punycode transcription.

Quoted-Printable, or QP encoding, is a binary-to-text encoding system using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean. Historically, because of the wide range of systems and protocols that could be used to transfer messages, e-mail was often assumed to be non-8-bit-clean – however, modern SMTP servers are in most cases 8-bit clean and support 8BITMIME extension. It can also be used with data that contains non-permitted octets or line lengths exceeding SMTP limits. It is defined as a MIME content transfer encoding for use in e-mail.

An email attachment is a computer file sent along with an email message. One or more files can be attached to any email message, and be sent along with it to the recipient. This is typically used as a simple method to share documents and images.

A bounce message or just "bounce" is an automated message from an email system, informing the sender of a previous message that the message has not been delivered. The original message is said to have "bounced".

The following tables compare general and technical features of notable email client programs.

Email authentication, or validation, is a collection of techniques aimed at providing verifiable information about the origin of email messages by validating the domain ownership of any message transfer agents (MTA) who participated in transferring and possibly modifying a message.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

DomainKeys Identified Mail (DKIM) is an email authentication method designed to detect forged sender addresses in email, a technique often used in phishing and email spam.

International email arises from the combined provision of internationalized domain names (IDN) and email address internationalization (EAI). The result is email that contains international characters, encoded as UTF-8, in the email header and in supporting mail transfer protocols. The most significant aspect of this is the allowance of email addresses in most of the world's writing systems, at both interface and transport levels.

References

  1. "wanderlust/apel". GitHub. Retrieved 2018-09-05.
  2. "Setting Outlook to Use UTF-8" . Retrieved 2018-09-05.
  3. 1 2 Jiankang, Yao; Wei, Mao (February 2012). "SMTP Extension for Internationalized Email". tools.ietf.org. Retrieved 2018-09-05.
  4. Moore, Keith (November 1996). "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text". tools.ietf.org. Retrieved 2018-09-05.
  5. Klensin, John C (August 2010). "Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework". tools.ietf.org. Retrieved 2018-09-05.
  6. Abel, Yang; Shawn, Steele (February 2012). "Internationalized Email Headers". tools.ietf.org. Retrieved 2018-09-05.