Developed by | Unicode Consortium |
---|---|
Initial release | CLDR 1.0 (19 December 2003 [1] ) |
Latest release | |
Container for | XML [3] |
Website | cldr |
The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML).
Among the types of data that CLDR includes are the following:
The information is currently used in International Components for Unicode, Apple's macOS, LibreOffice, MediaWiki, and IBM's AIX, among other applications and operating systems.
CLDR overlaps somewhat with ISO/IEC 15897 (POSIX locales). POSIX locale information can be derived from CLDR by using some of CLDR's conversion tools.
CLDR is maintained by a technical committee which includes employees from IBM, Apple, Google, Microsoft, and some government-based organizations. The committee is chaired by John Emmons, of IBM; Mark Davis, of Google, is vice-chair. [4]
The CLDR covers 400+ languages. [5]
In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.
The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":
Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in an HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.
In computing, internationalization and localization (American) or internationalisation and localisation (British), often abbreviated i18n and l10n respectively, are means of adapting computer software to different languages, regional peculiarities and technical requirements of a target locale.
In computing, a human-readable medium or human-readable format is any encoding of data or information that can be naturally read by humans, resulting in human-readable data. It is often encoded as ASCII or Unicode text, rather than as binary data.
In computing, a locale is a set of parameters that defines the user's language, region and any special variant preferences that the user wants to see in their user interface. Usually a locale identifier consists of at least a language code and a country/region code. Locale is an important aspect of i18n.
Comma-separated values (CSV) is a text file format that uses commas to separate values. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.
The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently byte-by-byte compared in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.
In word processing and digital typesetting, a non-breaking space, also called NBSP, required space, hard space, or fixed space, is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space. Non-breaking space characters with other widths also exist.
ISO 15924, Codes for the representation of names of scripts, is an international standard defining codes for writing systems or scripts. Each script is given both a four-letter code and a numeric code.
International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies. ICU has been included as a standard component with Microsoft Windows since Windows 10 version 1703.
JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays. It is a common data format with diverse uses in electronic data interchange, including that of web applications with servers.
In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE represents a blank space punctuation character in text, used as a word divider in Western scripts.
Globalize is a cross-platform JavaScript library for internationalization and localization that uses the Unicode Common Locale Data Repository (CLDR).
The tz database is a collaborative compilation of information about the world's time zones, primarily intended for use with computer programs and operating systems. Paul Eggert has been its editor and maintainer since 2005, with the organizational backing of ICANN. The tz database is also known as tzdata, the zoneinfo database or the IANA time zone database, and occasionally as the Olson database, referring to the founding contributor, Arthur David Olson.
Mark Edward Davis is an American specialist in the internationalization and localization of software and the co-founder and president of the Unicode Consortium.
Data Format Description Language, published as an Open Grid Forum Recommendation in February 2021, is a modeling language for describing general text and binary data in a standard way. A DFDL model or schema allows any text or binary data to be read from its native format and to be presented as an instance of an information set.. The same DFDL schema also allows data to be taken from an instance of an information set and written out to its native format.
The regional indicator symbols are a set of 26 alphabetic Unicode characters (A–Z) intended to be used to encode ISO 3166-1 alpha-2 two-letter country codes in a way that allows optional special treatment.