International Components for Unicode

Last updated
Developer(s) Unicode Consortium
Initial release1999
Stable release
75.1 [1]   OOjs UI icon edit-ltr-progressive.svg / 16 April 2024;5 days ago (16 April 2024)
Repository
Written in C/C++ (C++11) and Java 8+
Operating system Cross-platform
Type Libraries for Unicode and internationalization
License Unicode License
Website icu.unicode.org

International Components for Unicode (ICU) is an open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the Unicode Consortium and sponsored, supported, and used by IBM and many other companies. [2] ICU has been included as a standard component with Microsoft Windows since Windows 10 version 1703. [3]

Contents

ICU provides the following services: Unicode text handling, full character properties, and character set conversions; Unicode regular expressions; full Unicode sets; character, word, and line boundaries; language-sensitive collation and searching; normalization, upper and lowercase conversion, and script transliterations; comprehensive locale data and resource bundle architecture via the Common Locale Data Repository (CLDR); multiple calendars and time zones; and rule-based formatting and parsing of dates, times, numbers, currencies, and messages. ICU provided complex text layout service for Arabic, Hebrew, Indic, and Thai historically, but that was deprecated in version 54, and was completely removed in version 58 in favor of HarfBuzz. [4]

ICU provides more extensive internationalization facilities than the standard libraries for C and C++. Future ICU 75 planned for April 2024 will require C++17 (up from C++11) or C11 (up from C99), depending on what languages is used. ICU has historically used UTF-16, and still does only for Java; while for C/C++ UTF-8 is supported, [5] [6] including the correct handling of "illegal UTF-8". [7]

ICU 73.2 has improved significant changes for GB18030-2022 compliance support, i.e. for Chinese (that updated Chinese GB18030 Unicode Transformation Format standard is slightly incompatible); has "a modified character conversion table, mapping some GB18030 characters to Unicode characters that were encoded after GB18030-2005" and has a number of other changes such as improving Japanese and Korean short-text line breaking, and in "English, the name “Türkiye” is now used for the country instead of “Turkey” (the alternate spelling is also available in the data)." [8]

ICU 74 "updates to Unicode 15.1, including new characters, emoji, security mechanisms, and corresponding APIs and implementations. [..] ICU 74 and CLDR 44 are major releases, including a new version of Unicode and major locale data improvements." [9] Of the many changes some are for person name formatting, or for improved language support, e.g. for Low German, and there's e.g. a new spoof checker API, following the (latest version) Unicode 15.1.0 UTS #39: Unicode Security Mechanism.

Older version details

ICU 72 updated to Unicode 15 (and 73.2 to latest 15.1). "In many formatting patterns, ASCII spaces are replaced with Unicode spaces (e.g., a "thin space")." ICU (ICU4J) now requires Java 8 but "Most of the ICU 72 library code should still work with Java 7 / Android API level 21, but we no longer test with Java 7." [10] ICU 71 added e.g. phrase-based line breaking for Japanese (earlier methods didn't work well for short Japanese text, such as in titles and headings) and support for Hindi written in Latin letters (hi_Latn), also referred to as "Hinglish". ICU 70 added e.g. support for emoji properties of strings and can now be built and used with C++20 compilers (and "ICU operator==() and operator!=() functions now return bool instead of UBool, as an adjustment for incompatible changes in C++20"), [11] and as of that version the minimum Windows version is Windows 7. ICU 67 handles removal of Great Britain from the EU. ICU 64.2 added support for Unicode 12.1, i.e. the single new symbol for current Japanese Reiwa era (but support for it has also been backported to older ICU versions down to ICU 4.8.2). ICU 58 (with Unicode 9.0 support) is the last version to support older platforms such as Windows XP and Windows Vista. Support for AIX, Solaris and z/OS may also be limited in later versions (i.e. building depends on compiler support). [12]

Origin and development

After Taligent became part of IBM in early 1996, Sun Microsystems decided that the new Java language should have better support for internationalization. Since Taligent had experience with such technologies and were close geographically, their Text and International group were asked to contribute the international classes to the Java Development Kit as part of the JDK 1.1 internationalization APIs. [13] A large portion of this code still exists in the java.text and java.util packages. Further internationalization features were added with each later release of Java.

The Java internationalization classes were then ported to C++ and C [14] as part of a library known as ICU4C ("ICU for C"). The ICU project also provides ICU4J ("ICU for Java"), which adds features not present in the standard Java libraries. ICU4C and ICU4J are very similar, though not identical; for example, ICU4C includes a Regular Expression API, while ICU4J does not. Both frameworks have been enhanced over time to support new facilities and new features of Unicode and Common Locale Data Repository (CLDR).

ICU was released as an open-source project in 1999 under the name IBM Classes for Unicode. It was later renamed to International Components For Unicode. [15] In May 2016, the ICU project joined the Unicode consortium as technical committee ICU-TC, and the library sources are now distributed under the Unicode license. [16]

MessageFormat

A part of ICU is the MessageFormat class, a formatting system that allows for any number of arguments to control the plural form (plural, selectordinal) or more general switch-case-style selection (select) for things like grammatical gender. These statements can be nested. [17] ICU MessageFormat was created by adding the plural and selection system to an identically-named system in Java SE.

Alternatives

An alternative for using ICU with C++, or to using it directly, is to use Boost.Locale, which is a C++ wrapper for ICU (while also allowing other backends [18] ). The claim for using it rather than ICU directly is that "is absolutely unfriendly to C++ developers. It ignores popular C++ idioms (the STL, RTTI, exceptions, etc), instead mostly mimicking the Java API." [19] [20] Another claim, that ICU only supports UTF-16 (and thus a reason to avoid using ICU) is no longer true with ICU now also supporting UTF-8 for C and C++. [5]

See also

Related Research Articles

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as "UCS-2" (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

<span class="mw-page-title-main">Pango</span> Library for text rendering

Pango is a text layout engine library which works with the HarfBuzz shaping engine for displaying multi-language text.

<span class="mw-page-title-main">Windows-1252</span> Windows character set for Latin alphabet

Windows-1252 or CP-1252 is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa.

In computing, a locale is a set of parameters that defines the user's language, region and any special variant preferences that the user wants to see in their user interface. Usually a locale identifier consists of at least a language code and a country/region code. Locale is an important aspect of i18n.

<span class="mw-page-title-main">GB 18030</span> Official Chinese character encoding

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB/T 2312, CP936, and GBK 1.0.

Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages.

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character, most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

In Unix and Unix-like operating systems, iconv is a command-line program and a standardized application programming interface (API) used to convert between different character encodings. "It can convert from any of these encodings to any other, through Unicode conversion."

Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use the Latin script. It is primarily used by Czech, though Czech has now moved to UTF-8 and mostly abandoned this legacy encoding. It is also used for Polish, Slovak, Hungarian, Slovene, Serbo-Croatian, Romanian, Rotokas and Albanian. It may also be used with the German language, though it's missing uppercase ẞ. German-language texts encoded with Windows-1250 and Windows-1252 are identical.

Windows-1256 is a code page used under Microsoft Windows to write Arabic and other languages that use Arabic script, such as Persian and Urdu.

The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to applications. CLDR is written in the Locale Data Markup Language (LDML).

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

Globalize is a cross-platform JavaScript library for internationalization and localization that uses the Unicode Common Locale Data Repository (CLDR).

tz database Collaborative compilation of information about the worlds time zones

The tz database is a collaborative compilation of information about the world's time zones, primarily intended for use with computer programs and operating systems. Paul Eggert has been its editor and maintainer since 2005, with the organizational backing of ICANN. The tz database is also known as tzdata, the zoneinfo database or the IANA time zone database, and occasionally as the Olson database, referring to the founding contributor, Arthur David Olson.

Mark Edward Davis is an American specialist in the internationalization and localization of software and the co-founder and president of the Unicode Consortium.

Microsoft was one of the first companies to implement Unicode in their products. Windows NT was the first operating system that used "wide characters" in system calls. Using the UCS-2 encoding scheme at first, it was upgraded to the variable-width encoding UTF-16 starting with Windows 2000, allowing a representation of additional planes with surrogate pairs. However Microsoft did not support UTF-8 in its API until May 2019.

A number of text encoding standards are used on the World Wide Web. The same encodings are used in local files, in fact many more, at least historically. Exact measurements for the prevalence of each are not possible, because of privacy reasons, but rather accurate estimates are available for public web sites, and statistics may reflect use in local files. Attempts at measuring encoding popularity may utilize counts of numbers of (web) documents, or counts weighed by actual use or visibility of those documents.

References

  1. "Release ICU 75.1 · unicode-org/icu" . Retrieved 21 April 2024.
  2. "ICU - International Components for Unicode". site.icu-project.org. Archived from the original on 2021-08-27. Retrieved 2011-11-14.
  3. Chen, Raymond (27 May 2021). "How can I convert between IANA time zones and Windows registry-based time zones?". The Old New Thing. Microsoft.
  4. "Layout Engine - ICU User Guide". userguide.icu-project.org.
  5. 1 2 "UTF-8". ICU Documentation. Retrieved 2022-05-24.
  6. "UTF-8 - ICU User Guide". userguide.icu-project.org. Retrieved 2018-04-03.
  7. "#13311 (change illegal-UTF-8 handling to Unicode "best practice")". bugs.icu-project.org. Retrieved 2018-04-03.
  8. "ICU - International Components for Unicode - ICU 73". icu.unicode.org. Retrieved 2023-09-24.
  9. "ICU - International Components for Unicode - ICU 74". icu.unicode.org. Retrieved 2023-11-29.
  10. "ICU - International Components for Unicode - ICU 72". icu.unicode.org. Retrieved 2023-01-24.
  11. "ICU - International Components for Unicode - ICU 70". icu.unicode.org. Retrieved 2023-01-24.
  12. "Download ICU 64 - ICU - International Components for Unicode". site.icu-project.org. Retrieved 2019-10-20.
  13. Laura Werner (1999). "Getting Java ready for the world: A brief history of IBM and Sun's internationalization efforts". Archived from the original on 2021-11-17. Retrieved 2007-05-23.
  14. "ICU User Guide". userguide.icu-project.org.
  15. "ICU Project Management Committee". Archived from the original on 2021-08-28. Retrieved 2012-08-17.
  16. "ICU joins the Unicode Consortium". Unicode, Inc. 2016-05-16. Retrieved 2016-08-01.
  17. "Formatting Messages". ICU User Guide.
  18. "Boost.Locale: Using Localization Backends". www.boost.org. Retrieved 2022-05-24.
  19. "Boost.Locale: Design Rationale". www.boost.org. Retrieved 2022-05-24.
  20. "ICU vs Boost Locale in C++". Stack Overflow. Retrieved 2022-05-24.