Unicode collation algorithm

Last updated

The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc. [1]

Contents

Unicode Technical Report #10 also specifies the Default Unicode Collation Element Table (DUCET). This data file specifies a default collation ordering. The DUCET is customizable for different languages, [1] [2] and some such customizations can be found in the Unicode Common Locale Data Repository (CLDR). [3]

An open source implementation of UCA is included with the International Components for Unicode, ICU. [4] [5] ICU supports tailoring, and the collation tailorings from CLDR are included in ICU. [6] [2]

See also

Related Research Articles

Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems, library catalogs, and reference books.

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit. Almost every webpage is stored in UTF-8.

<span class="mw-page-title-main">Sorting</span> Action of arranging objects into order

Sorting refers to ordering data in an increasing or decreasing manner according to some linear relationship among the data items.

  1. ordering: arranging items in a sequence ordered by some criterion;
  2. categorizing: grouping items with similar properties.

Big-5 or Big5 is a Chinese character encoding method used in Taiwan, Hong Kong, and Macau for traditional Chinese characters.

<span class="mw-page-title-main">Windows-1252</span> Windows character set for Latin alphabet

Windows-1252 or CP-1252 is a legacy single-byte character encoding that is used by default in Microsoft Windows throughout the Americas, Western Europe, Oceania, and much of Africa.

In computing, a locale is a set of parameters that defines the user's language, region and any special variant preferences that the user wants to see in their user interface. Usually a locale identifier consists of at least a language code and a country/region code. Locale is an important aspect of i18n.

<span class="mw-page-title-main">GB 18030</span> Official Chinese character encoding

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB/T 2312, CP936, and GBK 1.0.

The European ordering rules define an ordering for strings written in languages that are written with the Latin, Greek and Cyrillic alphabets. The standard covers languages used by the European Union, the European Free Trade Association, and parts of the former Soviet Union. It is a tailoring of the Common Tailorable Template of ISO/IEC 14651. EOR can in turn be tailored for different (European) languages. But in inter-European contexts, EOR can be used without further tailoring.

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character, most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies. ICU has been included as a standard component with Microsoft Windows since Windows 10 version 1703.

The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML).

ISO 11940 is an ISO standard for the transliteration of Thai characters, published in 1998 and updated in September 2003 and confirmed in 2008. An extension to this standard named ISO 11940-2 defines a simplified transcription based on it.

Specials is a short Unicode block of characters allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF, containing these code points:

Mark Edward Davis is an American specialist in the internationalization and localization of software and the co-founder and chief technical officer of the Unicode Consortium, previously serving as its president until 2022.

ISO/IEC 14651:2016, Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering, is an International Organization for Standardization (ISO)/International Electrotechnical Commission (IEC) standard specifying an algorithm that can be used when comparing two strings. This comparison can be used when collating a set of strings. The standard also specifies a data file, the Common Tailorable Template (CTT), which outlines the comparison order. This order is meant to be tailored for different languages, making the CTT a template rather than a default. In many cases, however, the empty tailoring—where no weightings are changed—is appropriate, as different languages have incompatible ordering requirements. One such tailoring is European ordering rules (EOR), which in turn is supposed to be tailored for different European languages.

Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. The technique is recognised to be unreliable and is only used when specific metadata, such as a HTTP Content-Type: header is either not available, or is assumed to be untrustworthy.

The regional indicator symbols are a set of 26 alphabetic Unicode characters (A–Z) intended to be used to encode ISO 3166-1 alpha-2 two-letter country codes in a way that allows optional special treatment.

Kangxi Radicals is a Unicode block. In version 3.0 (1999), this separate Kangxi Radicals block was introduced which encodes the 214 radicals in sequence, at U+2F00–2FD5. These are specific code points intended to represent the radical qua radical, as opposed to the character consisting of the unaugmented radical; thus, U+2F00 represents radical 1 while U+4E00 represents the character meaning "one". In addition, the CJK Radicals Supplement block (2E80–2EFF) was introduced, encoding alternative forms taken by Kangxi radicals as they appear within specific characters. For example, ⺁ "CJK RADICAL CLIFF" (U+2E81) is a variant of ⼚ radical 27 (U+2F1A), itself identical in shape to the character consisting of unaugmented radical 27, 厂 "cliff" (U+5382).

Microsoft Windows code page 932, also called Windows-31J amongst other names, is the Microsoft Windows code page for the Japanese language, which is an extended variant of the Shift JIS Japanese character encoding. It contains standard 7-bit ASCII codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.

IBM code page 936 is a character encoding for Simplified Chinese including 1880 user-defined characters (UDC), which was superseded in 1993. It is a combination of the single-byte Code page 903 and the double-byte Code page 928. Code page 946 uses the same double-byte component, but an extended single-byte component.

References

  1. 1 2 Whistler, Ken; Scherer, Markus; Davis, Mark (2022-08-26). "UTS #10: Unicode Collation Algorithm". Unicode . Retrieved 2023-08-16.
  2. 1 2 Hosken, Martin (2021-09-23). Unicode Sort Tailoring: Tutorial (PDF) (1.3 ed.). SIL Writing Systems Technology. pp. 2–3. Retrieved 2023-08-16.
  3. "CLDR Releases/Downloads". Unicode CLDR . Retrieved 2023-08-16.
  4. "ICU - International Components for Unicode". Unicode . Retrieved 2023-08-16.
  5. "Collations". SyBooks Online. Retrieved 2023-08-16.
  6. "Customization". ICU Documentation. Retrieved 2023-08-16.

Tools