International email arises from the combined provision of internationalized domain names (IDN) [1] and email address internationalization (EAI). [2] The result is email that contains international characters (characters which do not exist in the ASCII character set), encoded as UTF-8, in the email header and in supporting mail transfer protocols. The most significant aspect of this is the allowance of email addresses (also known as email identities) in most of the world's writing systems, at both interface and transport levels.
Traditional email addresses are limited to characters from the English alphabet and a few other special characters. [3]
The following are valid traditional email addresses:
stellyamburrr985@gmail.com (English, ASCII) Abc.123@example.com (English, ASCII) user+mailbox/department=shipping@example.com (English, ASCII) !#$%&'*+-/=?^_`.{|}~@example.com (English, ASCII) "Abc@def"@example.com (English, ASCII) "Fred\ Bloggs"@example.com (English, ASCII) "Joe.\\Blow"@example.com (English, ASCII)
A Russian might wish to use иван.сергеев@пример.рф as their identifier but be forced to use a transcription such as ivan.sergeev@example.ru or even some other completely unrelated identifier instead. The same is true of Chinese, Japanese, and other nationalities that do not use Latin scripts, but also applies to users from non-English-speaking European countries whose desired addresses might contain diacritics (e.g. André or Płużyna). As a result, email users are forced to identify themselves using non-native scripts, which may result in errors due to ambiguity of transliteration (for example, иван.сергеев
may become ivan.sergeev
, ivan.sergeyev
, or something else). Alternatively, developers of email systems must compensate for this by converting identifiers from their native scripts to ASCII scripts and back again at the user interface layer.
International email, by contrast, uses Unicode characters encoded as UTF-8—allowing for the encoding the text of addresses in most of the world's writing systems. [4]
The following are all valid international email addresses :
用户@例子.广告 (Chinese, Unicode) ಬೆಂಬಲ@ಡೇಟಾಮೇಲ್.ಭಾರತ (Kannada, Unicode) अजय@डाटा.भारत (Hindi, Unicode) квіточка@пошта.укр (Ukrainian, Unicode) χρήστης@παράδειγμα.ελ (Greek, Unicode) Dörte@Sörensen.example.com (German, Unicode) коля@пример.рф (Russian, Unicode)
Although the traditional format for email header section allows non-ASCII characters to be included in the value portion of some of the header fields using MIME-encoded words (e.g. in display names or in a Subject header field), MIME-encoding must not be used to encode other information in a header, such as an email address, or header fields like Message-ID or Received. Moreover, the MIME-encoding requires extra processing of the header to convert the data to and from its MIME-encoded word representation, and harms readability of a header section.
The 2012 standards RFC 6532 and RFC 6531 allow the inclusion of Unicode characters in a header content using UTF-8 encoding, and their transmission via SMTP—but in practice support is only slowly rolling out. [5]
Domain internationalization works by downgrading. UTF-8 parts, known as U-Labels, are transformed into A-Labels via an ad-hoc method called IDNA. For example, sörensen.example.com
is encoded as xn--srensen-90a.example.com
. In 2003, when the need was addressed, that seemed easier than checking that all DNS software could comply with UTF-8 strings, although in theory DNS can transport binary data. This encoding is needed before issuing DNS queries.
Since traditional email standards constrain all email header values to ASCII only characters, it is possible that the presence of UTF-8 characters in email headers decreases the stability and reliability of transporting such email. This is because some email servers do not support these characters. Checking compliance with UTF-8 strings must be done software package by software package (see #Adoption below.) There was an experimental method proposed by the IETF, by which email could be somehow downgraded into the legacy all-ASCII format which all standard email servers support. [2] [6] This proposal was deemed too cumbersome; the meaning of the left hand side part of an email address is local to the target server, and so there is no way to check whether xn--something
is a valid user name, used in some domain. It was later obsoleted in 2012. [7]
The set of Internet RFC documents RFC 6530, RFC 6531, RFC 6532, and RFC 6533, all of them published in February 2012, define mechanisms and protocol extensions needed to fully support internationalized email addresses. These changes include an SMTP extension and extension of email header syntax to accommodate UTF-8 data. The document set also includes discussion of key assumptions and issues in deploying fully internationalized email.
Unicode also has recommended Email Security Profiles for Identifiers.
The ICANN-sponsored Universal Acceptance Working Group is working make EAI accepted in more places and publishes annual reports on acceptance.
email is a method of transmitting and receiving messages using electronic devices. It was conceived in the late–20th century as the digital version of, or counterpart to, mail. Email is a ubiquitous and very widely used communication medium; in current use, an email address is often treated as a basic and necessary part of many processes in business, commerce, government, education, entertainment, and other spheres of daily life in most countries.
Multipurpose Internet Mail Extensions (MIME) is a standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message bodies may consist of multiple parts, and header information may be specified in non-ASCII character sets. Email messages with MIME formatting are typically transmitted with standard protocols, such as the Simple Mail Transfer Protocol (SMTP), the Post Office Protocol (POP), and the Internet Message Access Protocol (IMAP).
The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typically use SMTP only for sending messages to a mail server for relaying, and typically submit outgoing email to the mail server on port 587 or 465 per RFC 8314. For retrieving messages, IMAP is standard, but proprietary servers also often implement proprietary protocols, e.g., Exchange ActiveSync.
Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.
8-bit clean is an attribute of computer systems, communication channels, and other devices and software, that process 8-bit character encodings without treating any byte as an in-band control code.
In computer programming, Base64 is a group of binary-to-text encoding schemes that transforms binary data into a sequence of printable characters, limited to a set of 64 unique characters. More specifically, the source binary data is taken 6 bits at a time, then this group of 6 bits is mapped to one of 64 unique characters.
An email address identifies an email box to which messages are delivered. While early messaging systems used a variety of formats for addressing, today, email addresses follow a set of specific rules originally standardized by the Internet Engineering Task Force (IETF) in the 1980s, and updated by RFC 5322 and 6854. The term email address in this article refers to just the addr-spec in Section 3.4 of RFC 5322. The RFC defines address more broadly as either a mailbox or group. A mailbox value can be either a name-addr, which contains a display-name and addr-spec, or the more common addr-spec alone.
UTF-7 is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.
Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, München is encoded as Mnchen-3ya.
An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-Latin script or alphabet or in the Latin alphabet-based characters with diacritics or ligatures. These writing systems are encoded by computers in multibyte Unicode. Internationalized domain names are stored in the Domain Name System (DNS) as ASCII strings using Punycode transcription.
The Internationalized Resource Identifier (IRI) is an internet protocol standard which builds on the Uniform Resource Identifier (URI) protocol by greatly expanding the set of permitted characters. It was defined by the Internet Engineering Task Force (IETF) in 2005 in RFC 3987. While URIs are limited to a subset of the US-ASCII character set, IRIs may additionally contain most characters from the Universal Character Set, including Chinese, Japanese, Korean, and Cyrillic characters.
The following tables compare general and technical features of notable email client programs.
Email authentication, or validation, is a collection of techniques aimed at providing verifiable information about the origin of email messages by validating the domain ownership of any message transfer agents (MTA) who participated in transferring and possibly modifying a message.
Many email clients now offer some support for Unicode. Some clients will automatically choose between a legacy encoding and Unicode depending on the mail's content, either automatically or when the user requests it.
This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.
Sieve is a programming language that can be used for email filtering. It owes its creation to the CMU Cyrus Project, creators of Cyrus IMAP server.
DomainKeys Identified Mail (DKIM) is an email authentication method designed to detect forged sender addresses in email, a technique often used in phishing and email spam.
A mailbox is the destination to which electronic mail messages are delivered. It is the equivalent of a letter box in the postal system.
An emoji domain is a domain name with one or more emoji in it, for example 😉.tld
.
Universal Acceptance (UA) is a term coined by Ram Mohan to represent the principle that every top-level domain (TLD) should function within all applications regardless of script, number of characters, or how new it is.