Wide character

Last updated

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

Contents

History

During the 1960s, mainframe and mini-computer manufacturers began to standardize around the 8-bit byte as their smallest datatype. The 7-bit ASCII character set became the industry standard method for encoding alphanumeric characters for teletype machines and computer terminals. The extra bit was used for parity, to ensure the integrity of data storage and transmission. As a result, the 8-bit byte became the de facto datatype for computer systems storing ASCII characters in memory.

Later, computer manufacturers began to make use of the spare bit to extend the ASCII character set beyond its limited set of English alphabet characters. 8-bit extensions such as IBM code page 37, PETSCII and ISO 8859 became commonplace, offering terminal support for Greek, Cyrillic, and many others. However, such extensions were still limited in that they were region specific and often could not be used in tandem. Special conversion routines had to be used to convert from one character set to another, often resulting in destructive translation when no equivalent character existed in the target set.

In 1989, the International Organization for Standardization began work on the Universal Character Set (UCS), a multilingual character set that could be encoded using either a 16-bit (2-byte) or 32-bit (4-byte) value. These larger values required the use of a datatype larger than 8-bits to store the new character values in memory. Thus the term wide character was used to differentiate them from traditional 8-bit character datatypes.

Relation to UCS and Unicode

A wide character refers to the size of the datatype in memory. It does not state how each value in a character set is defined. Those values are instead defined using character sets, with UCS and Unicode simply being two common character sets that encode more characters than an 8-bit wide numeric value (255 total) would allow.

Relation to multibyte characters

Just as earlier data transmission systems suffered from the lack of an 8-bit clean data path, modern transmission systems often lack support for 16-bit or 32-bit data paths for character data. This has led to character encoding systems such as UTF-8 that can use multiple bytes to encode a value that is too large for a single 8-bit symbol.

The C standard distinguishes between multibyte encodings of characters, which use a fixed or variable number of bytes to represent each character (primarily used in source code and external files), from wide characters, which are run-time representations of characters in single objects (typically, greater than 8 bits).

Size of a wide character

Early adoption of UCS-2 ("Unicode 1.0") led to common use of UTF-16 in a number of platforms, most notably Microsoft Windows, .NET and Java. In these systems, it is common to have a "wide character" (wchar_t in C/C++; char in Java) type of 16-bits. These types do not always map directly to one "character", as surrogate pairs are required to store the full range of Unicode (1996, Unicode 2.0). [1] [2] [3]

Unix-like generally use a 32-bit wchar_t to fit the 21-bit Unicode code point, as C90 prescribed. [4]

The size of a wide character type does not dictate what kind of text encodings a system can process, as conversions are available. (Old conversion code commonly overlook surrogates, however.) The historical circumstances of their adoption does also decide what types of encoding they prefer. A system influenced by Unicode 1.0, such as Windows, tends to mainly use "wide strings" made out of wide character units. Other systems such as the Unix-likes, however, tend to retain the 8-bit "narrow string" convention, using a multibyte encoding (almost universally UTF-8) to handle "wide" characters. [5]

Programming specifics

C/C++

The C and C++ standard libraries include a number of facilities for dealing with wide characters and strings composed of them. The wide characters are defined using datatype wchar_t, which in the original C90 standard was defined as

"an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales" (ISO 9899:1990 §4.1.5)

Both C and C++ introduced fixed-size character types char16_t and char32_t in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit Unicode transformation formats, leaving wchar_t implementation-defined. The ISO/IEC 10646:2003 Unicode standard 4.0 says that:

"The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers." [6]

Python

According to Python 2.7's documentation, the language sometimes uses wchar_t as the basis for its character type Py_UNICODE. It depends on whether wchar_t is "compatible with the chosen Python Unicode build variant" on that system. [7] This distinction has been deprecated since Python 3.3, which introduced a flexibly-sized UCS1/2/4 storage for strings and formally aliased Py_UNICODE to wchar_t. [8] Since Python 3.12 use of wchar_t, i.e. the Py_UNICODE typedef, for Python strings (wstr in implementation) has been dropped and still as before an "UTF-8 representation is created on demand and cached in the Unicode object." [9]

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15 of the standard defines 149186 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts. Many common characters, including numerals, punctuation, and other symbols, are unified within the standard and are not treated as specific to any given writing system. Unicode encodes thousands of emoji, with the continued development thereof conducted by the Consortium as a part of the standard. Moreover, the widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan. Unicode is ultimately capable of encoding more than 1.1 million characters.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.

<span class="mw-page-title-main">UTF-16</span> Variable-width encoding of Unicode, using one or two 16-bit code units

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding, now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed.

<span class="mw-page-title-main">Character (computing)</span> Primitive data type

In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language.

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

<span class="mw-page-title-main">Windows-1252</span> Character encoding

Windows-1252 or CP-1252 is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. All modern operating systems, including Windows, now use Unicode code points and text encodings by default, which are portable across all of the world's major languages.

In computer science, primitive data types are a set of basic data types from which all other data types are constructed. Specifically it often refers to the limited set of data representations in use by a particular processor, which all compiled programs must use. Most processors support a similar set of primitive data types, although the specific representations vary. More generally, "primitive data types" may refer to the standard data types built into a programming language. Data types which are not primitive are referred to as derived or composite.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. Originating in 1971, it was most recently revised in 1994.

The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the range 128–255 to offsets within particular blocks of 128 characters. The initial conditions of the encoder mean that existing strings in ASCII and ISO-8859-1 that do not contain C0 control codes other than NULL TAB CR and LF can be treated as SCSU strings. Since most alphabets do reside in blocks of contiguous Unicode codepoints, texts that use small alphabets and either ASCII punctuation or punctuation that fits within the window for the main alphabet can be encoded at one byte per character, most other punctuation can be encoded at 2 bytes per symbol through non-locking shifts. SCSU can also switch to UTF-16 internally to handle non-alphabetic languages.

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation, usually in a computer. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes (octets) to encode different characters. (Some authors, notably in Microsoft documentation, use the term multibyte character set, which is a misnomer, because representation size is an attribute of the encoding, not of the character set.)

UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using one to five one-byte (8-bit) code units. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

The C++ programming language has support for string handling, mostly implemented in its standard library. The language standard specifies several string types, some inherited from C, some designed to make use of the language's features, such as classes and RAII. The most-used of these is std::string.

<span class="mw-page-title-main">Universal Character Set characters</span> Complete list of the characters available on most computers

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.

The Universal Coded Character Set is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

Microsoft was one of the first companies to implement Unicode in their products. Windows NT was the first operating system that used "wide characters" in system calls. Using the UCS-2 encoding scheme at first, it was upgraded to the variable-width encoding UTF-16 starting with Windows 2000, allowing a representation of additional planes with surrogate pairs. However Microsoft did not support UTF-8 in its API until May 2019.

Escape sequences are used in the programming languages C and C++, and their design was copied in many other languages such as Java, PHP, C#, etc. An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly.

The C programming language has a set of functions implementing operations on strings in its standard library. Various operations, such as copying, concatenation, tokenization and searching are supported. For character strings, the standard library uses the convention that strings are null-terminated: a string of n characters is represented as an array of n + 1 elements, the last of which is a "NUL character" with numeric value 0.

Exact measurements are not possible. Counts of numbers of documents are different than counts weighed by actual use or visibility of those documents. The encoding popularity varies depending on the language used for the documents, or the locale that is the source of the document, or the purpose of the document. Text may be ambiguous as to what encoding it is in, for instance pure ASCII text is valid ASCII or ISO-8859-1 or CP1252 or UTF-8. "Tags" may indicate a document encoding, but when this is incorrect this may be silently corrected by display software, so counts of tags may not be accurate.

References

  1. "Globalization Step-by-Step: Unicode Enabled". msdn.microsoft.com. Archived from the original on 1 January 2009.
  2. "String Class (System)". learn.microsoft.com.
  3. "Primitive Data Types (The Java™ Tutorials > Learning the Java Language > Language Basics)". docs.oracle.com.
  4. "Null-terminated wide strings <wctype.h> - cppreference.com". en.cppreference.com.
  5. "UTF-8 Everywhere". In the following years many systems have added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, such as the Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).
  6. "5.2 ANSI/ISO C wchar_t". The Unicode standard. Aliprand, Joan., Unicode Consortium. (Version 4.0 ed.). Boston: Addison-Wesley. 2003. p. 109. ISBN   0-321-18578-1. OCLC   52257637.{{cite book}}: CS1 maint: others (link)
  7. "Unicode Objects and Codecs — Python 2.7 documentation". docs.python.org. Retrieved 19 December 2009.
  8. "Unicode Objects and Codecs — Python 3.10.10 documentation". docs.python.org. Retrieved 18 February 2023.
  9. "Unicode Objects and Codecs". Python documentation. Retrieved 9 September 2023.