This article's use of external links may not follow Wikipedia's policies or guidelines.(August 2020) |
An IETF BCP 47 language tag is a standardized code that is used to identify human languages on the Internet. [1] The tag structure has been standardized by the Internet Engineering Task Force (IETF) [1] in Best Current Practice (BCP) 47; [1] the subtags are maintained by the IANA Language Subtag Registry. [2] [3] [4]
To distinguish language variants for countries, regions, or writing systems (scripts), IETF language tags combine subtags from other standards such as ISO 639, ISO 15924, ISO 3166-1 and UN M.49. For example, the tag en
stands for English; es-419
for Latin American Spanish; rm-sursilv
for Romansh Sursilvan; sr-Cyrl
for Serbian written in Cyrillic script; nan-Hant-TW
for Min Nan Chinese using traditional Han characters, as spoken in Taiwan; yue-Hant-HK
for Cantonese using traditional Han characters, as spoken in Hong Kong; and gsw-u-sd-chzh
for Zürich German.
It is used by computing standards such as HTTP, [5] HTML, [6] XML [7] and PNG. [8]
IETF language tags were first defined in RFC 1766, edited by Harald Tveit Alvestrand, published in March 1995. The tags used ISO 639 two-letter language codes and ISO 3166 two-letter country codes, and allowed registration of whole tags that included variant or script subtags of three to eight letters.
In January 2001, this was updated by RFC 3066, which added the use of ISO 639-2 three-letter codes, permitted subtags with digits, and adopted the concept of language ranges from HTTP/1.1 to help with matching of language tags.
The next revision of the specification came in September 2006 with the publication of RFC 4646 (the main part of the specification), edited by Addison Philips and Mark Davis and RFC 4647 [9] (which deals with matching behaviour). RFC 4646 introduced a more structured format for language tags, added the use of ISO 15924 four-letter script codes and UN M.49 three-digit geographical region codes, and replaced the old registry of tags with a new registry of subtags. The small number of previously defined tags that did not conform to the new structure were grandfathered in order to maintain compatibility with RFC 3066.
The current version of the specification, RFC 5646, [10] was published in September 2009. The main purpose of this revision was to incorporate three-letter codes from ISO 639-3 and 639-5 into the Language Subtag Registry, in order to increase the interoperability between ISO 639 and BCP 47. [11]
Each language tag is composed of one or more "subtags" separated by hyphens (-). Each subtag is composed of basic Latin letters or digits only.
With the exceptions of private-use language tags beginning with an x- prefix and grandfathered language tags (including those starting with an i- prefix and those previously registered in the old Language Tag Registry), subtags occur in the following order:
Subtags are not case-sensitive, but the specification recommends using the same case as in the Language Subtag Registry, where region subtags are UPPERCASE, script subtags are Title Case, and all other subtags are lowercase. This capitalization follows the recommendations of the underlying ISO standards.
Optional script and region subtags are preferred to be omitted when they add no distinguishing information to a language tag. For example, es is preferred over es-Latn, as Spanish is fully expected to be written in the Latin script; ja is preferred over ja-JP, as Japanese as used in Japan does not differ markedly from Japanese as used elsewhere.
Not all linguistic regions can be represented with a valid region subtag: the subnational regional dialects of a primary language are registered as variant subtags. For example, the valencia variant subtag for the Valencian variant of the Catalan is registered in the Language Subtag Registry with the prefix ca. As this dialect is spoken almost exclusively in Spain, the region subtag ES can normally be omitted.
Furthermore, there are script tags that do not refer to traditional scripts such as Latin, or even scripts at all, and these usually begin with a Z. For example, Zsye refers to emojis, Zmth to mathematical notation, Zxxx to unwritten documents and Zyyy to undetermined scripts.
IETF language tags have been used as locale identifiers in many applications. It may be necessary for these applications to establish their own strategy for defining, encoding and matching locales if the strategy described in RFC 4647 is not adequate.
The use, interpretation and matching of IETF language tags is currently defined in RFC 5646 and RFC 4647. The Language Subtag Registry lists all currently valid public subtags. Private-use subtags are not included in the Registry as they are implementation-dependent and subject to private agreements between third parties using them. These private agreements are out of scope of BCP 47.
The following is a list of some of the more commonly used primary language subtags. The list represents only a small subset (less than 2 percent) of primary language subtags; for full information, the Language Subtag Registry should be consulted directly.
English name | Native name | Subtag |
---|---|---|
Afrikaans | Afrikaans | af |
Amharic | አማርኛ | am |
Arabic | العربية | ar |
Mapudungun | Mapudungun | arn |
Moroccan Arabic | الدارجة المغربية | ary |
Assamese | অসমীয়া | as |
Azerbaijani | Azərbaycan | az |
Bashkir | Башҡорт | ba |
Belarusian | беларуская | be |
Bulgarian | български | bg |
Bengali | বাংলা | bn |
Tibetan | བོད་ཡིག | bo |
Breton | brezhoneg | br |
Bosnian | bosanski/босански | bs |
Catalan | català | ca |
Central Kurdish | کوردیی ناوەندی | ckb |
Corsican | Corsu | co |
Czech | čeština | cs |
Welsh | Cymraeg | cy |
Danish | dansk | da |
German | Deutsch | de |
Lower Sorbian | dolnoserbšćina | dsb |
Divehi | ދިވެހިބަސް | dv |
Greek | Ελληνικά | el |
English | English | en |
Spanish | español | es |
Estonian | eesti | et |
Basque | euskara | eu |
Persian | فارسى | fa |
Finnish | suomi | fi |
Filipino | Filipino | fil |
Faroese | føroyskt | fo |
French | français | fr |
Frisian | Frysk | fy |
Irish | Gaeilge | ga |
Scottish Gaelic | Gàidhlig | gd |
Gilbertese | Taetae ni Kiribati | gil |
Galician | galego | gl |
Swiss German | Schweizerdeutsch | gsw |
Gujarati | ગુજરાતી | gu |
Hausa | Hausa | ha |
Hebrew | עברית | he |
Hindi | हिंदी | hi |
Croatian | hrvatski | hr |
Serbo-Croatian | srpskohrvatski/српскохрватски | hrv |
Upper Sorbian | hornjoserbšćina | hsb |
Hungarian | magyar | hu |
Armenian | Հայերեն | hy |
Indonesian | Bahasa Indonesia | id |
Igbo | Igbo | ig |
Yi | ꆈꌠꁱꂷ | ii |
Icelandic | íslenska | is |
Italian | italiano | it |
Inuktitut | Inuktitut /ᐃᓄᒃᑎᑐᑦ (ᑲᓇᑕ) | iu |
Japanese | 日本語 | ja |
Georgian | ქართული | ka |
Kazakh | Қазақша | kk |
Greenlandic | kalaallisut | kl |
Khmer | ខ្មែរ | km |
Kannada | ಕನ್ನಡ | kn |
Korean | 한국어 | ko |
Konkani | कोंकणी | kok |
Kurdish | Kurdî/کوردی | ku |
Kyrgyz | Кыргыз | ky |
Luxembourgish | Lëtzebuergesch | lb |
Lao | ລາວ | lo |
Lithuanian | lietuvių | lt |
Latvian | latviešu | lv |
Maori | Reo Māori | mi |
Macedonian | македонски јазик | mk |
Malayalam | മലയാളം | ml |
Mongolian | Монгол хэл/ᠮᠤᠨᠭᠭᠤᠯ ᠬᠡᠯᠡ | mn |
Mohawk | Kanien'kéha | moh |
Marathi | मराठी | mr |
Malay | Bahasa Malaysia | ms |
Maltese | Malti | mt |
Burmese | မြန်မာဘာသာ | my |
Norwegian (Bokmål) | norsk (bokmål) | nb |
Nepali | नेपाली (नेपाल) | ne |
Dutch | Nederlands | nl |
Norwegian (Nynorsk) | norsk (nynorsk) | nn |
Norwegian | norsk | no |
Occitan | occitan | oc |
Odia | ଓଡ଼ିଆ | or |
Papiamento | Papiamentu | pap |
Punjabi | ਪੰਜਾਬੀ / پنجابی | pa |
Polish | polski | pl |
Dari | درى | prs |
Pashto | پښتو | ps |
Portuguese | português | pt |
K'iche | K'iche | quc |
Quechua | runasimi | qu |
Romansh | Rumantsch | rm |
Romanian | română | ro |
Russian | русский | ru |
Kinyarwanda | Kinyarwanda | rw |
Sanskrit | संस्कृत | sa |
Yakut | саха | sah |
Sami (Northern) | davvisámegiella | se |
Sinhala | සිංහල | si |
Slovak | slovenčina | sk |
Slovenian | slovenski | sl |
Sami (Southern) | åarjelsaemiengiele | sma |
Sami (Lule) | julevusámegiella | smj |
Sami (Inari) | sämikielâ | smn |
Sami (Skolt) | sääʹmǩiõll | sms |
Albanian | shqip | sq |
Serbian | srpski/српски | sr |
Sesotho | Sesotho sa Leboa | st |
Swedish | svenska | sv |
Kiswahili | Kiswahili | sw |
Syriac | ܣܘܪܝܝܐ | syc |
Tamil | தமிழ் | ta |
Telugu | తెలుగు | te |
Tajik | Тоҷикӣ | tg |
Thai | ไทย | th |
Turkmen | türkmençe | tk |
Tswana | Setswana | tn |
Turkish | Türkçe | tr |
Tatar | Татарча | tt |
Tamazight | Tamazight | tzm |
Uyghur | ئۇيغۇرچە | ug |
Ukrainian | українська | uk |
Urdu | اُردو | ur |
Uzbek | Uzbek/Ўзбек | uz |
Vietnamese | Tiếng Việt | vi |
Wolof | Wolof | wo |
Xhosa | isiXhosa | xh |
Yoruba | Yoruba | yo |
Chinese | 中文 | zh |
Zulu | isiZulu | zu |
Although some types of subtags are derived from ISO or UN core standards, they do not follow these standards absolutely, as this could lead to the meaning of language tags changing over time. In particular, a subtag derived from a code assigned by ISO 639, ISO 15924, ISO 3166, or UN M49 remains a valid (though deprecated) subtag even if the code is withdrawn from the corresponding core standard. If the standard later assigns a new meaning to the withdrawn code, the corresponding subtag will still retain its old meaning.
This stability was introduced in RFC 4646.
RFC 4646 defined the concept of an "extended language subtag" (sometimes referred to as extlang), although no such subtags were registered at that time. [13] [ failed verification ] [14] [ failed verification ]
RFC 5645 and RFC 5646 added primary language subtags corresponding to ISO 639-3 codes for all languages that did not already exist in the Registry. In addition, codes for languages encompassed by certain macrolanguages were registered as extended language subtags. Sign languages were also registered as extlangs, with the prefix sgn. These languages may be represented either with the subtag for the encompassed language alone (cmn for Mandarin) or with a language-extlang combination (zh-cmn). The first option is preferred for most purposes. The second option is called "extlang form" and is new in RFC 5646.
Whole tags that were registered prior to RFC 4646 and are now classified as "grandfathered" or "redundant" (depending on whether they fit the new syntax) are deprecated in favor of the corresponding ISO 639-3–based language subtag, if one exists. To list a few examples, nan is preferred over zh-min-nan for Min Nan Chinese; hak is preferred over i-hak and zh-hakka for Hakka Chinese; and ase is preferred over sgn-US for American Sign Language.
Windows Vista and later versions of Microsoft Windows have RFC 4646 support. [15]
ISO 639-5 defines language collections with alpha-3 codes in a different way than they were initially encoded in ISO 639-2 (including one code already present in ISO 639-1, Bihari coded inclusively as bh in ISO 639-1 and bih in ISO 639-2). Specifically, the language collections are now all defined in ISO 639-5 as inclusive, rather than some of them being defined exclusively. This means that language collections have a broader scope than before, in some cases where they could encompass languages that were already encoded separately within ISO 639-2.
For example, the ISO 639-2 code afa was previously associated with the name "Afro-Asiatic (Other)", excluding languages such as Arabic that already had their own code. In ISO 639-5, this collection is named "Afro-Asiatic languages" and includes all such languages. ISO 639-2 changed the exclusive names in 2009 to match the inclusive ISO 639-5 names. [16]
To avoid breaking implementations that may still depend on the older (exclusive) definition of these collections, ISO 639-5 defines a grouping type attribute for all collections that were already encoded in ISO 639-2 (such grouping type is not defined for the new collections added only in ISO 639-5).
BCP 47 defines a "Scope" property to identify subtags for language collections. However, it does not define any given collection as inclusive or exclusive, and does not use the ISO 639-5 grouping type attribute, although the description fields in the Language Subtag Registry for these subtags match the ISO 639-5 (inclusive) names. As a consequence, BCP 47 language tags that include a primary language subtag for a collection may be ambiguous as to whether the collection is intended to be inclusive or exclusive.
ISO 639-5 does not define precisely which languages are members of these collections; only the hierarchical classification of collections is defined, using the inclusive definition of these collections. Because of this, RFC 5646 does not recommend the use of subtags for language collections for most applications, although they are still preferred over subtags whose meaning is even less specific, such as "Multiple languages" and "Undetermined".
In contrast, the classification of individual languages within their macrolanguage is standardized, in both ISO 639-3 and the Language Subtag Registry.
Script subtags were first added to the Language Subtag Registry when RFC 4646 was published, from the list of codes defined in ISO 15924. They are encoded in the language tag after primary and extended language subtags, but before other types of subtag, including region and variant subtags.
Some primary language subtags are defined with a property named "Suppress-Script" which indicates the cases where a single script can usually be assumed by default for the language, even if it can be written with another script. When this is the case, it is preferable to omit the script subtag, to improve the likelihood of successful matching. A different script subtag can still be appended to make the distinction when necessary. For example, yi is preferred over yi-Hebr in most contexts, because the Hebrew script subtag is assumed for the Yiddish language.
As another example, zh-Hans-SG may be considered equivalent to zh-Hans, because the region code is probably not significant; the written form of Chinese used in Singapore uses the same simplified Chinese characters as in other countries where Chinese is written. However, the script subtag is maintained because it is significant.
ISO 15924 includes some codes for script variants (for example, Hans and Hant for simplified and traditional forms of Chinese characters) that are unified within Unicode and ISO/IEC 10646. These script variants are most often encoded for bibliographic purposes, but are not always significant from a linguistic point of view (for example, Latf and Latg script codes for the Fraktur and Gaelic variants of the Latin script, which are mostly encoded with regular Latin letters in Unicode and ISO/IEC 10646). They may occasionally be useful in language tags to expose orthographic or semantic differences, with different analysis of letters, diacritics, and digraphs/trigraphs as default grapheme clusters, or differences in letter casing rules.
Two-letter region subtags are based on codes assigned, or "exceptionally reserved", in ISO 3166-1. If the ISO 3166 Maintenance Agency were to reassign a code that had previously been assigned to a different country, the existing BCP 47 subtag corresponding to that code would retain its meaning, and a new region subtag based on UN M.49 would be registered for the new country. UN M.49 is also the source for numeric region subtags for geographical regions, such as 005
for South America. The UN M.49 codes for economic regions are not allowed.
Region subtags are used to specify the variety of a language "as used in" a particular region. They are appropriate when the variety is regional in nature, and can be captured adequately by identifying the countries involved, as when distinguishing British English (en-GB) from American English (en-US). When the difference is one of script or script variety, as for simplified versus traditional Chinese characters, it should be expressed with a script subtag instead of a region subtag; in this example, zh-Hans and zh-Hant should be used instead of zh-CN/zh-SG/zh-MY and zh-TW/zh-HK/zh-MO.
When a distinct language subtag exists for a language that could be considered a regional variety, it is often preferable to use the more specific subtag instead of a language-region combination. For example, ar-DZ (Arabic as used in Algeria) may be better expressed as arq for Algerian Spoken Arabic.
Disagreements about language identification may extend to BCP 47 and to the core standards that inform it. For example, some speakers of Punjabi believe that the ISO 639-3 distinction between [pan] "Panjabi" and [pnb] "Western Panjabi" is spurious (i.e. they feel the two are the same language); that sub-varieties of the Arabic script should be encoded separately in ISO 15924 (as, for example, the Fraktur and Gaelic styles of the Latin script are); and that BCP 47 should reflect these views and/or overrule the core standards with regard to them.
BCP 47 delegates this type of judgment to the core standards, and does not attempt to overrule or supersede them. Variant subtags and (theoretically) primary language subtags may be registered individually, but not in a way that contradicts the core standards. [17]
Extension subtags (not to be confused with extended language subtags) allow additional information to be attached to a language tag that does not necessarily serve to identify a language. One use for extensions is to encode locale information, such as calendar and currency.
Extension subtags are composed of multiple hyphen-separated character strings, starting with a single character (other than x), called a singleton. Each extension is described in its own IETF RFC, which identifies a Registration Authority to manage the data for that extension. IANA is responsible for allocating singletons.
Two extensions have been assigned as of January 2014.
Extension T allows a language tag to include information on how the tagged data was transliterated, transcribed, or otherwise transformed. For example, the tag en-t-jp could be used for content in English that was translated from the original Japanese. Additional substrings could indicate that the translation was done mechanically, or in accordance with a published standard.
Extension T is described in the informational RFC 6497, published in February 2012. [18] The Registration Authority is the Unicode Consortium.
Extension U allows a wide variety of locale attributes found in the Common Locale Data Repository (CLDR) to be embedded in language tags. These attributes include country subdivisions, calendar and time zone data, collation order, currency, number system, and keyboard identification.
Some examples include:
Extension U is described in the informational RFC 6067, published in December 2010. [19] The Registration Authority is the Unicode Consortium.
Harald Tveit Alvestrand is a Norwegian computer scientist. He was chair of the Internet Engineering Task Force (IETF) from 2001 until 2005, succeeding Fred Baker. Within the IETF, Alvestrand was earlier the chair of the Areas for Applications from 1995 until 1997, and of Operations and Management in 1998.
The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Internet Protocol–related symbols and Internet numbers.
ISO 639 is a standard by the International Organization for Standardization (ISO) concerned with representation of languages and language groups. It currently consists of four sets of code, named after each part which formerly described respective set ; a part 6 was published but withdrawn. It was first approved in 1967 as a single-part ISO Recommendation, ISO/R 639, superseded in 2002 by part 1 of the new series, ISO 639-1, followed by additional parts. All existing parts of the series were consolidated into a single standard in 2023, largely based on the text of ISO 639-4.
A language code is a code that assigns letters or numbers as identifiers or classifiers for languages. These codes may be used to organize library collections or presentations of data, to choose the correct localizations and translations in computing, and as a shorthand designation for longer forms of language names.
ISO 3166-1 alpha-2 codes are two-letter country codes defined in ISO 3166-1, part of the ISO 3166 standard published by the International Organization for Standardization (ISO), to represent countries, dependent territories, and special areas of geographical interest. They are the most widely used of the country codes published by ISO, and are used most prominently for the Internet's country code top-level domains. They are also used as country identifiers extending the postal code when appropriate within the international postal system for paper mail, and have replaced the previous one consisting one-letter codes. They were first included as part of the ISO 3166 standard in its first edition in 1974.
ISO 639-1:2002, Codes for the representation of names of languages—Part 1: Alpha-2 code, is the first part of the ISO 639 series of international standards for language codes. Part 1 covers the registration of "set 1" two-letter codes. There are 183 two-letter codes registered as of June 2021. The registered codes cover the world's major languages.
In computing, a locale is a set of parameters that defines the user's language, region and any special variant preferences that the user wants to see in their user interface. Usually a locale identifier consists of at least a language code and a country/region code. Locale is an important aspect of i18n.
Europanto is a macaronic language concept with a fluid vocabulary from European languages of the user's choice or need. It was conceived in 1996 by Diego Marani based on the common practice of word-borrowing usage of many European languages. Marani used it in response to the perceived dominance of the English language; it is an emulation of the effect that non-native speakers struggling to learn a language typically add words and phrases from their native language to express their meanings clearly.
The Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) is a variant of SAMPA developed in 1995 by John C. Wells, professor of phonetics at University College London. It is designed to unify the individual language SAMPA alphabets, and extend SAMPA to cover the entire range of characters in the 1993 version of International Phonetic Alphabet (IPA). The result is a SAMPA-inspired remapping of the IPA into 7-bit ASCII.
Web standards are the formal, non-proprietary standards and other technical specifications that define and describe aspects of the World Wide Web. In recent years, the term has been more frequently associated with the trend of endorsing a set of standardized best practices for building web sites, and a philosophy of web design and development that includes those methods.
ISO 15924, Codes for the representation of names of scripts, is an international standard defining codes for writing systems or scripts. Each script is given both a four-letter code and a numeric code.
ISO 639-3:2007, Codes for the representation of names of languages – Part 3: Alpha-3 code for comprehensive coverage of languages, is an international standard for language codes in the ISO 639 series. It defines three-letter codes for identifying languages. The standard was published by International Organization for Standardization (ISO) on 1 February 2007.
ISO 639-6, Codes for the representation of names of languages — Part 6: Alpha-4 code for comprehensive coverage of language variants, was a proposed international standard in the ISO 639 series, developed by ISO/TC 37/SC 2. It contained four-letter codes that denote variants of languages and language families. This allowed one to differentiate between, for example, historical (glvx
) versus revived (rvmx
) Manx, while ISO 639-3 only includes glv
for Manx.
Language localisation is the process of adapting a product's translation to a specific country or region. It is the second phase of a larger process of product translation and cultural adaptation to account for differences in distinct markets, a process known as internationalisation and localisation.
UN M49 or the Standard Country or Area Codes for Statistical Use is a standard for area codes used by the United Nations for statistical purposes, developed and maintained by the United Nations Statistics Division. Each area code is a 3-digit number which can refer to a wide variety of geographical and political regions, like a continent and a country. Codes assigned in the system generally do not change when the country or area's name changes, but instead change when the territorial extent of the country or area changes significantly, although there have been exceptions to this rule.
This is a list of ISO 639 codes and IETF language tags for individual constructed languages, complete as of January 2023.
Many Unicode characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the null character is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string, since the string ends once the program reads the null character.
The Metelko alphabet was a Slovene writing system developed by Franc Serafin Metelko. It was used by a small group of authors from 1825 to 1833 but it was never generally accepted.
The Unicode Standard assigns various properties to each Unicode character and code point.
ISO-IR-111 or KOI8-E is an 8-bit character set. It is a multinational extension of KOI-8 for Belarusian, Macedonian, Serbian, and Ukrainian. The name "ISO-IR-111" refers to its registration number in the ISO-IR registry, and denotes it as a set usable with ISO/IEC 2022.