Popularity of text encodings

Last updated

A number of text encoding standards are used on the World Wide Web. The same encodings are used in local files (or databases), in fact many more, at least historically. Exact measurements for the prevalence of each are not possible, because of privacy reasons (e.g. for local files, not web accessible), but rather accurate estimates are available for public web sites, and statistics may (or may not accurately) reflect use in local files. Attempts at measuring encoding popularity may utilize counts of numbers of (web) documents, or counts weighed by actual use or visibility of those documents.

Contents

The decision to use any one encoding may depend on the language used for the documents, or the locale that is the source of the document, or the purpose of the document. Text may be ambiguous as to what encoding it is in, for instance pure ASCII text is valid ASCII or ISO-8859-1 or CP1252 or UTF-8. "Tags" may indicate a document encoding, but when this is incorrect this may be silently corrected by display software (for instance the HTML spec says that the tag for ISO-8859-1 should be treated as CP1252), so counts of tags may not be accurate.

Popularity on the World Wide Web

Use of the main encodings on the web from 2001 to 2012 as recorded by Google, with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). The ASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header. Unicode Web growth.svg
Use of the main encodings on the web from 2001 to 2012 as recorded by Google, with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). The ASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header.
Declared character set for the 10 million most popular websites since 2010 UTF-8 takes over.png
Declared character set for the 10 million most popular websites since 2010

UTF-8 has been the most common encoding for the World Wide Web since 2008. [2] As of April 2024, UTF-8 is used by 98.2% of surveyed web sites (and 99.1% of top 100,000 pages and 98.6% of the top 1,000 highest ranked web pages), the next most popular encoding, ISO-8859-1, is used by 1.3% (and only 13 of the top 1000). [3] Although many pages only use ASCII characters to display content, very few websites now declare their encoding to only be ASCII instead of UTF-8. [4]

Virtually all countries and over 97% all of the tracked languages have 95% or more use of UTF-8 encodings on the web. See below for the major alternative encodings:

The second-most popular encoding varies depending on locale, and is typically more efficient for the associated language. One such encoding is the Chinese GB 18030 standard, which is a full Unicode Transformation Format, still 95.6% of websites in China and territories use UTF-8 [5] [6] [7] with it (effectively [8] ) the next popular encoding. Big5 is another popular Chinese (for traditional characters) encoding and is next-most popular in Taiwan after UTF-8 at 96.6%, and it's also second-most used in Hong Kong, while there as elsewhere, UTF-8 is even more dominant. [9] The single-byte Windows-1251 is twice as efficient for the Cyrillic script and still 94.9% of Russian websites use UTF-8 [10] (however e.g. Greek and Hebrew encodings are also twice as efficient, and UTF-8 has over 99% use for those languages). [11] [12] Japanese and Korean language websites also have relatively high non-UTF-8 use compared to most other countries, with Japanese UTF-8 use at 94.8% followed by the legacy Shift JIS (actually decoded as its superset Windows-31J encoding) and then EUC-JP encoding. [13] [14] South Korea has 95.4% UTF-8 use, with the rest of websites mainly using EUC-KR which is more efficient for Korean text.

With the exception of GB 18030 (and UTF-16 and UTF-8), other (legacy) encodings do not support all Unicode characters, since they were designed for specific languages.

As of April 2024, Limburgan (Limburgish) has the lowest UTF-8 use on the web of any tracked language, with 82% use. [15] Well over a third of the languages tracked have 100.0% use of UTF-8 on the web, such as Vietnamese, Marathi, Telugu, Tamil, Javanese, Pañjābī/Punjabi, Gujarati, Farsi/Persian, Hausa, Pashto, Kannada, Lao, Kurdish languages, Tagalog, Somali, Khmer/Cambodian, isiZulu/Zulu, Turkmen (Cyrillic, and Latin-based script since end of the Soviet Union; Arabic-based can also be used.), Tajik (has its own Cyrillic-based script, and Hebrew script used by some, plus 2 other scripts historically), and a lot of the languages with the fewest speakers (often with their own scripts) such as, Armenian, Mongolian (which has a top-to-bottom script [16] plus Cyrillic-based script also used, and more historically), Maldivian (Thaana), Greenlandic (Kalaallisut) and also sign languages. [17]

Popularity for local text files

Local storage on computers has considerably more use of "legacy" single-byte encodings than on the web. Attempts to update to UTF-8 have been blocked by editors that do not display or write UTF-8 unless the first character in a file is a byte order mark, making it impossible for other software to use UTF-8 without being rewritten to ignore the byte order mark on input and add it on output. UTF-16 files are also fairly common on Windows, but not in other systems. [18] [19]

Popularity internally in software

In the memory of a computer program, usage of UTF-16 is very common, particularly in Windows but and also in JavaScript, Qt, and many other cross-platform software libraries. Compatibility with the Windows API is a major reason for this.

At one time it was believed by many (and is still believed today by some) that having fixed-size code units offers computational advantages, which led many systems, in particular Windows, to use the fixed-size UCS-2 with two bytes per character. This is false: strings are almost never randomly accessed, and sequential access is the same speed. In addition, even UCS-2 was not "fixed size" if combining characters are considered, and when Unicode exceeded 65536 code points it had to be replaced with the non-fixed-sized UTF-16 anyway.

Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, vastly overwhelms any savings UTF-16 could offer. So newer software systems are starting to use UTF-8. International Components for Unicode (ICU) has historically only used UTF-16, and still does only for Java, while it now supports UTF-8 (for C/C++ and other languages indirectly), e.g. used that way by Microsoft; supported as the "Default Charset" [20] including the correct handling of "illegal UTF-8". [21] The default string primitive used in newer programing languages, such as Go, [22] Julia, Rust and Swift 5, [23] assume UTF-8 encoding. PyPy is also using UTF-8 for its strings, [24] and Python is looking into storing all strings with UTF-8. [25] Microsoft now recommends the use of UTF-8 for applications using the Windows API, while continuing to maintain a legacy "Unicode" (meaning UTF-16) interface. [26]

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

<span class="mw-page-title-main">ISO/IEC 8859-1</span> Character encoding

ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ISO/IEC 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is the basis for some popular 8-bit character sets and the first two blocks of characters in Unicode.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as "UCS-2" (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

<span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

<span class="mw-page-title-main">Windows-1252</span> Windows character set for Latin alphabet

Windows-1252 or CP-1252 is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa.

ISO/IEC 8859-2:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as "Latin-2". It is generally intended for Central or "Eastern European" languages that are written in the Latin script. Note that ISO/IEC 8859-2 is very different from code page 852 which is also referred to as "Latin-2" in Czech and Slovak regions. Code page 912 is an extension. Almost half the use of the encoding is for Polish, and it's the main legacy encoding for Polish, while virtually all use of it has been replaced by UTF-8.

ISO/IEC 8859-8, Information technology — 8-bit single-byte coded graphic character sets — Part 8: Latin/Hebrew alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings. ISO/IEC 8859-8:1999 from 1999 represents its second and current revision, preceded by the first edition ISO/IEC 8859-8:1988 in 1988. It is informally referred to as Latin/Hebrew. ISO/IEC 8859-8 covers all the Hebrew letters, but no Hebrew vowel signs. IBM assigned code page 916 to it. This character set was also adopted by Israeli Standard SI1311:2002, with some extensions.

ISO/IEC 8859-6:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 6: Latin/Arabic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin/Arabic. It was designed to cover Arabic. Only nominal letters are encoded, no preshaped forms of the letters, so shaping processing is required for display. It does not include the extra letters needed to write most Arabic-script languages other than Arabic itself.

ISO/IEC 8859-9:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 9: Latin alphabet No. 5, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1989. It is designated ECMA-128 by Ecma International and TS 5881 as a Turkish standard. It is informally referred to as Latin-5 or Turkish. It was designed to cover the Turkish language, designed as being of more use than the ISO/IEC 8859-3 encoding. It is identical to ISO/IEC 8859-1 except for the replacement of six Icelandic characters with characters unique to the Turkish alphabet. And the uppercase of i is İ; the lowercase of I is ı.

<span class="mw-page-title-main">GB 18030</span> Official Chinese character encoding

GB 18030 is a Chinese government standard, described as Information Technology — Chinese coded character set and defines the required language and character support necessary for software in China. GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. As a Unicode Transformation Format, GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB/T 2312, CP936, and GBK 1.0.

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese (characters).

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准), whereas the T suffix denotes a non-mandatory standard.

Windows-1251 is an 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Ukrainian, Belarusian, Bulgarian, Serbian Cyrillic, Macedonian and other languages.

Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use the Latin script. It is primarily used by Czech, though Czech has now moved to UTF-8 and mostly abandoned this legacy encoding. It is also used for Polish, Slovak, Hungarian, Slovene, Serbo-Croatian, Romanian, Rotokas and Albanian. It may also be used with the German language, though it's missing uppercase ẞ. German-language texts encoded with Windows-1250 and Windows-1252 are identical.

Windows-1254 is a code page used under Microsoft Windows, to write Turkish that it was designed for. Characters with codepoints A0 through FF are compatible with ISO 8859-9, but the CR range, which is reserved for C1 control codes in ISO 8859, is instead used for additional characters. It is similar to ISO/IEC 8859-1 except for the replacement of six Icelandic characters with characters unique to the Turkish alphabet.

Windows-1255 is a code page used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO-8859-8 – most of the symbols are in the same positions, but Windows-1255 adds vowel-points and other signs in lower positions.

Windows-1256 is a code page used under Microsoft Windows to write Arabic and other languages that use Arabic script, such as Persian and Urdu.

<span class="mw-page-title-main">GBK (character encoding)</span> Simplified Chinese character encoding

GBK is an extension of the GB 2312 character set for Simplified Chinese characters, used in the People's Republic of China. It includes all unified CJK characters found in GB 13000.1-93, i.e. ISO/IEC 10646:1993, or Unicode 1.1. Since its initial release in 1993, GBK has been extended by Microsoft in Code page 936/1386, which was then extended into GBK 1.0. GBK is also the IANA-registered internet name for the Microsoft mapping, which differs from other implementations primarily by the single-byte euro sign at 0x80.

References

  1. Davis, Mark (2012-02-03). "Unicode over 60 percent of the web". Official Google Blog. Archived from the original on 2018-08-09. Retrieved 2020-07-24.
  2. Davis, Mark (2008-05-05). "Moving to Unicode 5.1". Official Google Blog. Retrieved 2023-03-13.
  3. "Usage Survey of Character Encodings broken down by Ranking". W3Techs. Retrieved 2024-04-09.
  4. "Usage statistics and market share of ASCII for websites, January 2024". W3Techs. Retrieved 2024-01-13.
  5. "Distribution of Character Encodings among websites that use China and territories". w3techs.com. Retrieved 2024-04-09.
  6. "Distribution of Character Encodings among websites that use .cn". w3techs.com. Retrieved 2021-11-01.
  7. "Distribution of Character Encodings among websites that use Chinese". w3techs.com. Retrieved 2021-11-01.
  8. The Chinese standard GB 2312 and with its extension GBK (which are both interpreted by web browsers as GB 18030, having support for the same letters as UTF-8)
  9. "Distribution of Character Encodings among websites that use Taiwan". w3techs.com. Retrieved 2024-04-09.
  10. "Distribution of Character Encodings among websites that use .ru". w3techs.com. Retrieved 2024-04-09.
  11. "Distribution of Character Encodings among websites that use Greek". w3techs.com. Retrieved 2024-01-01.
  12. "Distribution of Character Encodings among websites that use Hebrew". w3techs.com. Retrieved 2024-02-02.
  13. "Historical trends in the usage of character encodings" . Retrieved 2024-01-01.
  14. "UTF-8 Usage Statistics". BuiltWith. Retrieved 2011-03-28.
  15. "Usage Report of UTF-8 broken down by Content Languages". w3techs.com. Retrieved 2024-04-09.
  16. "ХҮМҮҮН БИЧИГ" [Human papers]. khumuunbichig.montsame.mn (in Mongolian). Montsame News Agency. Retrieved 2022-10-26.
  17. "Distribution of Character Encodings among websites that use Sign Languages". w3techs.com. Retrieved 2018-12-03.
  18. "Charset". Android Developers. Retrieved 2021-01-02. Android note: The Android platform default is always UTF-8.
  19. Galloway, Matt (9 October 2012). "Character encoding for iOS developers. Or UTF-8 what now?". www.galloway.me.uk. Retrieved 2021-01-02. in reality, you usually just assume UTF-8 since that is by far the most common encoding.
  20. "UTF-8 - ICU User Guide". userguide.icu-project.org. Retrieved 2018-04-03.
  21. "#13311 (change illegal-UTF-8 handling to Unicode "best practice")". bugs.icu-project.org. Retrieved 2018-04-03.
  22. "The Go Programming Language Specification" . Retrieved 2021-02-10.
  23. Tsai, Michael J. "Michael Tsai - Blog - UTF-8 String in Swift 5" . Retrieved 2021-03-15.
  24. Mattip (2019-03-24). "PyPy Status Blog: PyPy v7.1 released; now uses utf-8 internally for unicode strings". PyPy Status Blog. Retrieved 2020-11-21.
  25. "PEP 623 -- Remove wstr from Unicode". Python.org. Retrieved 2020-11-21. Until we drop legacy Unicode object, it is very hard to try other Unicode implementation like UTF-8 based implementation in PyPy.
  26. "Use the Windows UTF-8 code page". UWP applications. docs.microsoft.com. Retrieved 2020-06-06.