Locale (computer software)

Last updated

In computing, a locale is a set of parameters that defines the user's language, region and any special variant preferences that the user wants to see in their user interface. Usually a locale identifier consists of at least a language code and a country/region code. Locale is an important aspect of i18n.

Contents

General locale settings

These settings usually include the following display (output) format settings:

The locale settings are about formatting output given a locale. So, the time zone information and daylight saving time are not usually part of the locale settings. Less usual is the input format setting, which is mostly defined on a per application basis.

Programming and markup language support

In these environments,

and other (nowadays) Unicode-based environments, they are defined in a format similar to BCP 47. They are usually defined with just ISO 639 (language) and ISO 3166-1 alpha-2 (2-letter country) codes.

International standards

In standard C and C++, locale is defined in "categories" of LC_COLLATE (text collation), LC_CTYPE (character class), LC_MONETARY (currency format), LC_NUMERIC (number format), and LC_TIME (time format). The special LC_ALL category can be used to set all locale settings. [1]

There is no standard locale names associated with C and C++ standards besides a "minimal locale" name "C", although the POSIX format is a commonly-used baseline.

POSIX platforms

On POSIX platforms such as Unix, Linux and others, locale identifiers are defined in a way similar to the BCP 47 definition of language tags, but the locale variant modifier is defined differently, and the character set is optionally included as a part of the identifier. The POSIX or "XPG" format is [language[_territory][.codeset][@modifier]]. (For example, Australian English using the UTF-8 encoding is en_AU.UTF-8.) [2] Separately, ISO/IEC 15897 describes a different form, language_territory+audience+application,sponsor_version, though it's highly dubious whether it is used at all. [3]

In the next example there is an output of command locale for Czech language (cs), Czech Republic (CZ) with explicit UTF-8 encoding:

$ locale LANG=cs_CZ.UTF-8 LC_CTYPE="cs_CZ.UTF-8" LC_NUMERIC="cs_CZ.UTF-8" LC_TIME="cs_CZ.UTF-8" LC_COLLATE="cs_CZ.UTF-8" LC_MONETARY="cs_CZ.UTF-8" LC_MESSAGES="cs_CZ.UTF-8" LC_PAPER="cs_CZ.UTF-8" LC_NAME="cs_CZ.UTF-8" LC_ADDRESS="cs_CZ.UTF-8" LC_TELEPHONE="cs_CZ.UTF-8" LC_MEASUREMENT="cs_CZ.UTF-8" LC_IDENTIFICATION="cs_CZ.UTF-8" LC_ALL=

Specifics for Microsoft platforms

Windows uses specific language and territory strings. The locale identifier (LCID) for unmanaged code on Microsoft Windows is a number such as 1033 for English (United States), or 2057 for English (United Kingdom), or 1041 for Japanese (Japan). These numbers consist of a language code (lower 10 bits) and a culture code (upper bits), and are therefore often written in hexadecimal notation, such as 0x0409, 0x0809 or 0x0411. Microsoft is starting to introduce managed code application programming interfaces (APIs) for .NET that use this format. One of the first to be generally released is a function to mitigate issues with internationalized domain names, [4] but more are in Windows Vista Beta 1.

Starting with Windows Vista, new functions [5] that use BCP 47 locale names have been introduced to replace nearly all LCID-based APIs.

A POSIX-like locale name format of language[_country-region[.code-page]] is available in the UCRT (Universal C Run Time) of Windows 10 and 11. [6]

See also

Related Research Articles

ANSI C, ISO C, and Standard C are successive standards for the C programming language published by the American National Standards Institute (ANSI) and ISO/IEC JTC 1/SC 22/WG 14 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Historically, the names referred specifically to the original and best-supported version of the standard. Software developers writing in C are encouraged to conform to the standards, as doing so helps portability between compilers.

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

<span class="mw-page-title-main">ISO/IEC 8859-1</span> Character encoding

ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ISO/IEC 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is the basis for some popular 8-bit character sets and the first two blocks of characters in Unicode.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as "UCS-2" (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

<span class="mw-page-title-main">Character (computing)</span> Primitive data type

In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language.

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

<span class="mw-page-title-main">Windows-1252</span> Windows character set for Latin alphabet

Windows-1252 or CP-1252 is a legacy single-byte character encoding that is used by default in Microsoft Windows throughout the Americas, Western Europe, Oceania, and much of Africa.

The C standard library or libc is the standard library for the C programming language, as specified in the ISO C standard. Starting from the original ANSI C standard, it was developed at the same time as the C library POSIX specification, which is a superset of it. Since ANSI C was adopted by the International Organization for Standardization, the C standard library is also called the ISO C library.

ISO/IEC 8859-9:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 9: Latin alphabet No. 5, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1989. It is designated ECMA-128 by Ecma International and TS 5881 as a Turkish standard. It is informally referred to as Latin-5 or Turkish. It was designed to cover the Turkish language, designed as being of more use than the ISO/IEC 8859-3 encoding. It is identical to ISO/IEC 8859-1 except for the replacement of six Icelandic characters with characters unique to the Turkish alphabet. And the uppercase of i is İ; the lowercase of I is ı.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

Registration authorities (RAs) exist for many standards organizations, such as ANNA, the Object Management Group, W3C, and others. In general, registration authorities all perform a similar function, in promoting the use of a particular standard through facilitating its use. This may be by applying the standard, where appropriate, or by verifying that a particular application satisfies the standard's tenants. Maintenance agencies, in contrast, may change an element in a standard based on set rules – such as the creation or change of a currency code when a currency is created or revalued. The Object Management Group has an additional concept of certified provider, which is deemed an entity permitted to perform some functions on behalf of the registration authority, under specific processes and procedures documented within the standard for such a role.

The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML).

An IETF BCP 47 language tag is a standardized code that is used to identify human languages on the Internet. The tag structure has been standardized by the Internet Engineering Task Force (IETF) in Best Current Practice (BCP) 47; the subtags are maintained by the IANA Language Subtag Registry.

C11 is an informal name for ISO/IEC 9899:2011, a past standard for the C programming language. It replaced C99 and has been superseded by C17. C11 mainly standardizes features already supported by common contemporary compilers, and includes a detailed memory model to better support multiple threads of execution. Due to delayed availability of conforming C99 implementations, C11 makes certain features optional, to make it easier to comply with the core language standard.

Microsoft was one of the first companies to implement Unicode in their products. Windows NT was the first operating system that used "wide characters" in system calls. Using the UCS-2 encoding scheme at first, it was upgraded to the variable-width encoding UTF-16 starting with Windows 2000, allowing a representation of additional planes with surrogate pairs. However Microsoft did not support UTF-8 in its API until May 2019.

References

  1. "LC_ALL, LC_COLLATE, LC_CTYPE, LC_MONETARY, LC_NUMERIC, LC_TIME - cppreference.com". en.cppreference.com.
  2. "Environment Variables". pubs.opengroup.org.
  3. "ISO/IEC JTC1/SC22 N610 [draft ISO/IEC 15897:1998(E)] Information technology — Procedures for registration of cultural elements" (PDF). 1998-11-17. Retrieved 8 June 2023. For Narrative Cultural Specifications and POSIX Locales the token identifier will be: 8_9+11+12,13_14
  4. "DownlevelGetLocaleScripts function (Windows)". MSDN . Microsoft . Retrieved 2017-12-11.
  5. "Locale Names (Windows)". MSDN . Microsoft . Retrieved 2017-12-11.
  6. "Locale Names, Languages, and Country-Region Strings". learn.microsoft.com. 19 October 2022.