Nameprep

Last updated

Nameprep is the process of case-folding a string to lowercase and removal of some generally invisible code points before it is suitable to represent a domain name, or other such canonical name. It is used by the Internationalizing Domain Names in Applications (IDNA) standard, using the Unicode standard for NFKC normalization.

Nameprep is defined in RFC 3491, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)" [1] , as a profile of stringprep, which is described in RFC 3454, "Preparation of Internationalized Strings ("stringprep")". [2]

It does not map lookalike characters to a single character nor prohibit the use of lookalike characters. There are good reasons for this, such as the fact that same sets of characters may be lookalikes in some fonts but not in others, and the fact that any decision on which character to map to will obviously provide a bias towards users of one script; but it also has potentially grave implications for security if not considered by the designers and administrators of systems based on nameprep (the best known example[ which? ][ citation needed ] of this being VeriSign's handling of IDNA names in .com and .net).

See also

Related Research Articles

The Domain Name System (DNS) is a hierarchical and distributed naming system for computers, services, and other resources in the Internet or other Internet Protocol (IP) networks. It associates various information with domain names assigned to each of the associated entities. Most prominently, it translates readily memorized domain names to the numerical IP addresses needed for locating and identifying computer services and devices with the underlying network protocols. The Domain Name System has been an essential component of the functionality of the Internet since 1985.

A top-level domain (TLD) is one of the domains at the highest level in the hierarchical Domain Name System of the Internet after the root domain. The top-level domain names are installed in the root zone of the name space. For all domains in lower levels, it is the last part of the domain name, that is, the last non empty label of a fully qualified domain name. For example, in the domain name www.example.com, the top-level domain is .com. Responsibility for management of most top-level domains is delegated to specific organizations by the ICANN, an Internet multi-stakeholder community, which operates the Internet Assigned Numbers Authority (IANA), and is in charge of maintaining the DNS root zone.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.

<span class="mw-page-title-main">Domain name</span> Identification string in the Internet

In the Internet, a domain name is a string that identifies a realm of administrative autonomy, authority or control. Domain names are often used to identify services provided through the Internet, such as websites, email services and more. As of December 2023, 359.8 million domain names had been registered. Domain names are used in various networking contexts and for application-specific naming and addressing purposes. In general, a domain name identifies a network domain or an Internet Protocol (IP) resource, such as a personal computer used to access the Internet, or a server computer.

An email address identifies an email box to which messages are delivered. While early messaging systems used a variety of formats for addressing, today, email addresses follow a set of specific rules originally standardized by the Internet Engineering Task Force (IETF) in the 1980s, and updated by RFC 5322 and 6854. The term email address in this article refers to just the addr-spec in Section 3.4 of RFC 5322. The RFC defines address more broadly as either a mailbox or group. A mailbox value can be either a name-addr, which contains a display-name and addr-spec, or the more common addr-spec alone.

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, München is encoded as Mnchen-3ya.

<span class="mw-page-title-main">Internationalized domain name</span> Type of Internet domain name

An internationalized domain name (IDN) is an Internet domain name that contains at least one label displayed in software applications, in whole or in part, in non-Latin script or alphabet or in the Latin alphabet-based characters with diacritics or ligatures. These writing systems are encoded by computers in multibyte Unicode. Internationalized domain names are stored in the Domain Name System (DNS) as ASCII strings using Punycode transcription.

In computer networking, a hostname is a label that is assigned to a device connected to a computer network and that is used to identify the device in various forms of electronic communication, such as the World Wide Web. Hostnames may be simple names consisting of a single word or phrase, or they may be structured. Each hostname usually has at least one numeric network address associated with it for routing packets for performance and other reasons.

The Internationalized Resource Identifier (IRI) is an internet protocol standard which builds on the Uniform Resource Identifier (URI) protocol by greatly expanding the set of permitted characters. It was defined by the Internet Engineering Task Force (IETF) in 2005 in RFC 3987. While URIs are limited to a subset of the US-ASCII character set, IRIs may additionally contain most characters from the Universal Character Set, including Chinese, Japanese, Korean, and Cyrillic characters.

<span class="mw-page-title-main">Homoglyph</span> Different glyphs which are visually similar

In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequences of characters sharing these properties.

The internationalized domain name (IDN) homograph attack is a way a malicious party may deceive computer users about what remote system they are communicating with, by exploiting the fact that many different characters look alike

Many email clients now offer some support for Unicode. Some clients will automatically choose between a legacy encoding and Unicode depending on the mail's content, either automatically or when the user requests it.

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments, and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

A whitespace character is a character data element that represents white space when text is rendered for display by a computer.

ThaiURL is a technology enabling the use of Thai domain names in applications that have been modified to support this technology. It is one of several such systems that were marketed before the advent of IDNA.

International email arises from the combined provision of internationalized domain names (IDN) and email address internationalization (EAI). The result is email that contains international characters, encoded as UTF-8, in the email header and in supporting mail transfer protocols. The most significant aspect of this is the allowance of email addresses in most of the world's writing systems, at both interface and transport levels.

<span class="mw-page-title-main">Web typography</span> Publishing considerations for the Web

Web typography, like typography generally, is the design of pages – their layout and typeface choices. Unlike traditional print-based typography, pages intended for display on the World Wide Web have additional technical challenges and – given its ability to change the presentation dynamically – additional opportunities. Early web page designs were very simple due to technology limitations; modern designs use Cascading Style Sheets (CSS), JavaScript and other techniques to deliver the typographer's and the client's vision.

The Arabic name امارات, romanized as emarat, is the internationalized country code top-level domain for the United Arab Emirates. The ASCII name of this domain in the Domain Name System of the Internet is xn--mgbaam7a8h, using the Internationalizing Domain Names in Applications (IDNA) procedure in the translation of the Unicode representation of the script version. The domain was installed in the Domain Name System on 5 May 2010.

A uniform resource locator (URL), colloquially known as an address on the Web, is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI), although many people use the two terms interchangeably. URLs occur most commonly to reference web pages (HTTP/HTTPS) but are also used for file transfer (FTP), email (mailto), database access (JDBC), and many other applications.

ICANN created an experimental set of top-level internationalized domain names in October 2007 for the purpose of testing the use of Internationalizing Domain Names in Applications (IDNA) in the root zone and within those domains. ICANN announced that these domains would be removed from the root zone of the Domain Name System effective on 31 October 2013. They no longer resolve but are still listed in the IANA Root Zone Database. Each of these TLDs names encoded a word meaning "test" in the respective language.

References

  1. "RFC 3491" . Retrieved 8 March 2024.
  2. "RFC 3454" . Retrieved 8 March 2024.