Common Locale Data Repository

Last updated
Common Locale Data Repository
Developed by Unicode Consortium
Initial releaseCLDR 1.0
(19 December 2003;21 years ago (2003-12-19) [1] )
Latest release
46.1 [2]   OOjs UI icon edit-ltr-progressive.svg
19 December 2024;17 days ago (19 December 2024)
Container for XML [3]
Website cldr.unicode.org

The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML).

Contents

CLDR is maintained by a technical committee which includes employees from IBM, Apple, Google, Microsoft, and some government-based organizations. The committee is chaired by John Emmons, of IBM; Mark Davis, of Google, is vice-chair. [4]

Details

Among the types of data that CLDR includes are the following:

The information is currently used in International Components for Unicode, Apple's macOS, LibreOffice, MediaWiki, and IBM's AIX, among other applications and operating systems.

CLDR overlaps somewhat with ISO/IEC 15897 (POSIX locales). POSIX locale information can be derived from CLDR by using some of CLDR's conversion tools.

The CLDR covers 400+ languages. [5]

Related Research Articles

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in an HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

In computing, internationalization and localization (American) or internationalisation and localisation (British), often abbreviated i18n and l10n respectively, are means of adapting to different languages, regional peculiarities and technical requirements of a target locale.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system.

<span class="mw-page-title-main">Human-readable medium and data</span> Presentation of data for humans to read

In computing, a human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans, resulting in human-readable data. It is often encoded as ASCII or Unicode text, rather than as binary data.

In computing, a locale is a set of parameters that defines the user's language, region and any special variant preferences that the user wants to see in their user interface. Usually a locale identifier consists of at least a language code and a country/region code. Locale is an important aspect of i18n.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.

Most countries of the world have different names in different languages. Some countries have also undergone name changes for political or other reasons. Countries are listed alphabetically by their most common name in English. Each English name is followed by its most common equivalents in other languages, listed in English alphabetical order by name and by language. Historical and/or alternative versions, where included, are noted as such. Foreign names that are the same as their English equivalents are also listed. See also: List of alternative country names

ISO 15924, Codes for the representation of names of scripts, is an international standard defining codes for writing systems or scripts. Each script is given both a four-letter code and a numeric code.

International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies. ICU has been included as a standard component with Microsoft Windows since Windows 10 version 1703.

A whitespace character is a character data element that represents white space when text is rendered for display by a computer.

Globalize is a cross-platform JavaScript library for internationalization and localization that uses the Unicode Common Locale Data Repository (CLDR).

tz database Collaborative compilation of information about the worlds time zones

The tz database is a collaborative compilation of information about the world's time zones and rules for observing daylight saving time, primarily intended for use with computer programs and operating systems. Paul Eggert has been its editor and maintainer since 2005, with the organizational backing of ICANN. The tz database is also known as tzdata, the zoneinfo database or the IANA time zone database, and occasionally as the Olson database, referring to the founding contributor, Arthur David Olson.

Mark Edward Davis is an American specialist in the internationalization and localization of software and the co-founder and chief technical officer of the Unicode Consortium, previously serving as its president until 2022.

Data Format Description Language is a modeling language for describing general text and binary data in a standard way. It was published as an Open Grid Forum Recommendation in February 2021, and in April 2024 was published as an ISO standard.

The regional indicator symbols are a set of 26 alphabetic Unicode characters (A–Z) intended to be used to encode ISO 3166-1 alpha-2 two-letter country codes in a way that allows optional special treatment.

References

  1. "CLDR Releases/Downloads". cldr.unicode.org.
  2. "Release 46.1". 19 December 2024. Retrieved 22 December 2024.
  3. Updating DTDs, CLDR makes special use of XML because of the way it is structured. In particular, the XML is designed so that you can read in a CLDR XML file and interpret it as an unordered list of <path,value> pairs, called a CLDRFile internally. These path/value pairs can be added to or deleted, and then the CLDRFile can be written back out to disk, resulting in a valid XML file. That is a very powerful mechanism, and also allows for the CLDR inheritance model.
  4. "Unicode CLDR - CLDR Process".
  5. "Locale Coverage".