Western Latin character sets (computing)

Last updated

Several 8-bit character sets (encodings) were designed for binary representation of common Western European languages (Italian, Spanish, Portuguese, French, German, Dutch, English, Danish, Swedish, Norwegian, and Icelandic), which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols (including some Greek letters). These character sets also happen to support many other languages such as Malay, Swahili, and Classical Latin.

Contents

This material is technically obsolete, having been functionally replaced by Unicode. However it continues to have historical interest.

Summary

The ISO-8859 series of 8-bit character sets encodes all Latin character sets used in Europe, albeit that the same code points have multiple uses that caused some difficulty (including mojibake, or garbled characters, and communication issues). The arrival of Unicode, with a unique code point for every glyph, resolved these issues.

History

The earlier seven-bit U.S. American Standard Code for Information Interchange ('ASCII') encoding has characters sufficient to properly represent only a few languages such as English, Latin, Malay and Swahili. It is missing some letters and letter-diacritic combinations used in other Latin-alphabet languages. However, since there was no other choice on most US-supplied computer platforms, use of ASCII was unavoidable except where there was a strong national computing industry. There was the ISO 646 group of encodings which replaced some of the symbols in ASCII with local characters, but space was very limited, and some of the symbols replaced were quite common in things like programming languages.

Most computers internally used eight-bit bytes but communication (seen as inherently unreliable) used seven data bits plus one parity bit. In time, it became common to use all eight bits for data, creating space for another 128 characters. In the early days most of these were system specific, but gradually the ISO/IEC 8859 standards emerged to provide some cross-platform similarity to enable information interchange.

Towards the end of the 20th century, as storage and memory costs fell, the issues associated with multiple meanings of a given eight-bit code (there are seven ISO-Latin code sets alone) have ceased to be justified. All major operating systems have moved to Unicode as their main internal representation. However, as Windows did not support the UTF-8 method of encoding Unicode (preferring UTF-16), many applications continued to be restricted to these legacy character sets.

The euro sign

The introduction of the euro and its associated euro sign () introduced significant pressure on computer systems developers to support this new symbol, and most 8-bit character sets had to be adapted in some way.

Whilst these decisions had limited effect for documents that were only used within a single computer (or at least within a single vendor's "digital ecosystem"), it meant that documents containing a euro sign would fail to render as expected when interchanged between ecosystems.

All of these issues have been resolved as operating systems have been upgraded to support Unicode as standard, which encodes the euro sign at U+20AC (decimal 8364).

Comparison table

Code points U+0000 to U+007F are not shown in this table currently, as they are directly mapped in all character sets listed here. The ASCII coding standard defines the original specification for the mapping of the first 0-127 characters.

The table is arranged by Unicode code point. Character sets are referred to here by their IANA names in upper case.

CharacterCode point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH
NBSP U+00A0A0A0A0FFFFCA
¡ U+00A1A1A1A1ADADC1
¢ U+00A2A2A2A29BBDA2
£ U+00A3A3A3A39C9CA3
¤ U+00A4A4 A4 CF 
¥ U+00A5A5A5A59DBEB4
¦ U+00A6A6 A6 DD 
§ U+00A7A7A7A7 F5A4
¨ U+00A8A8 A8 F9AC
© U+00A9A9A9A9 B8A9
ª U+00AAAAAAAAA6A6BB
« U+00ABABABABAEAEC7
¬ U+00ACACACACAAAAC2
SHY U+00ADADADAD F0 
® U+00AEAEAEAE A9A8
¯ U+00AFAFAFAF EEF8
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
° U+00B0B0B0B0F8F8A1
± U+00B1B1B1B1F1F1B1
² U+00B2B2B2B2FDFD 
³ U+00B3B3B3B3 FC 
´ U+00B4B4 B4 EFAB
µ U+00B5B5B5B5E6E6B5
U+00B6B6B6B6 F4A6
· U+00B7B7B7B7FAFAE1
¸ U+00B8B8 B8 F7FC
¹ U+00B9B9B9B9 FB 
º U+00BABABABAA7A7BC
» U+00BBBBBBBBAFAFC8
¼ U+00BCBC BCACAC 
½ U+00BDBD BDABAB 
¾ U+00BEBE BE F3 
¿ U+00BFBFBFBFA8A8C0
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
À U+00C0C0C0C0 B7CB
Á U+00C1C1C1C1 B5E7
 U+00C2C2C2C2 B6E5
à U+00C3C3C3C3 C7CC
Ä U+00C4C4C4C48E8E80
Å U+00C5C5C5C58F8F81
Æ U+00C6C6C6C69292AE
Ç U+00C7C7C7C7808082
È U+00C8C8C8C8 D4E9
É U+00C9C9C9C9909083
Ê U+00CACACACA D2E6
Ë U+00CBCBCBCB D3E8
Ì U+00CCCCCCCC DEED
Í U+00CDCDCDCD D6EA
Î U+00CECECECE D7EB
Ï U+00CFCFCFCF D8EC
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
Ð U+00D0D0D0D0 D1 
Ñ U+00D1D1D1D1A5A584
Ò U+00D2D2D2D2 E3F1
Ó U+00D3D3D3D3 E0EE
Ô U+00D4D4D4D4 E2EF
Õ U+00D5D5D5D5 E5CD
Ö U+00D6D6D6D6999985
× U+00D7D7D7D7 9E 
Ø U+00D8D8D8D8 9DAF
Ù U+00D9D9D9D9 EBF4
Ú U+00DADADADA E9F2
Û U+00DBDBDBDB EAF3
Ü U+00DCDCDCDC9A9A86
Ý U+00DDDDDDDD ED 
Þ U+00DEDEDEDE E8 
ß U+00DFDFDFDFE1E1A7
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
à U+00E0E0E0E0858588
á U+00E1E1E1E1A0A087
â U+00E2E2E2E2838389
ã U+00E3E3E3E3 C68B
ä U+00E4E4E4E484848A
å U+00E5E5E5E586868C
æ U+00E6E6E6E69191BE
ç U+00E7E7E7E787878D
è U+00E8E8E8E88A8A8F
é U+00E9E9E9E982828E
ê U+00EAEAEAEA888890
ë U+00EBEBEBEB898991
ì U+00ECECECEC8D8D93
í U+00EDEDEDEDA1A192
î U+00EEEEEEEE8C8C94
ï U+00EFEFEFEF8B8B95
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
ð U+00F0F0F0F0 D0 
ñ U+00F1F1F1F1A4A496
ò U+00F2F2F2F2959598
ó U+00F3F3F3F3A2A297
ô U+00F4F4F4F4939399
õ U+00F5F5F5F5 E49B
ö U+00F6F6F6F694949A
÷ U+00F7F7F7F7F6F6D6
ø U+00F8F8F8F8 9BBF
ù U+00F9F9F9F997979D
ú U+00FAFAFAFAA3A39C
û U+00FBFBFBFB96969E
ü U+00FCFCFCFC81819F
ý U+00FDFDFDFD EC 
þ U+00FEFEFEFE E7 
ÿ U+00FFFFFFFF9898D8
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
ı U+0131    D5F5
Œ U+0152 BC8C  CE
œ U+0153 BD9C  CF
Š U+0160 A68A   
š U+0161 A89A   
Ÿ U+0178 BE9F  D9
Ž U+017D B48E   
ž U+017E B89E   
ƒ U+0192  839F9FC4
ˆ U+02C6  88  F6
ˇ U+02C7     FF
˘ U+02D8     F9
˙ U+02D9     FA
˚ U+02DA     FB
˛ U+02DB     FE
˜ U+02DC  98  F7
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
˝ U+02DD     FD
Γ U+0393   E2  
Θ U+0398   E9  
Σ U+03A3   E4  
Φ U+03A6   E8  
Ω U+03A9   EA BD
α U+03B1   E0  
δ U+03B4   EB  
ε U+03B5   EE  
π U+03C0   E3 B9
σ U+03C3   E5  
τ U+03C4   E7  
φ U+03C6   ED  
U+2013  96  D0
U+2014  97  D1
U+2017    F2 
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2018  91  D4
U+2019  92  D5
U+201A  82  E2
U+201C  93  D2
U+201D  94  D3
U+201E  84  E3
U+2020  86  A0
U+2021  87  E0
U+2022  95  A5
U+2026  85  C9
U+2030  89  E4
U+2039  8B  DC
U+203A  9B  DD
U+2044     DA
U+207F   FC  
U+20A7   9E  
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+20AC A480 (D5) [nb 1] [2] [3] DB
U+2122  99  AA
U+2202     B6
U+2206     C6
U+220F     B8
U+2211     B7
U+2219   F9  
U+221A   FB C3
U+221E   EC B0
U+2229   EF  
U+222B     BA
U+2248   F7 C5
U+2260     AD
U+2261   F0  
U+2264   F3 B2
U+2265   F2 B3
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2310   A9  
U+2320   F4  
U+2321   F5  
U+2500   C4C4 
U+2502   B3B3 
U+250C   DADA 
U+2510   BFBF 
U+2514   C0C0 
U+2518   D9D9 
U+251C   C3C3 
U+2524   B4B4 
U+252C   C2C2 
U+2534   C1C1 
U+253C   C5C5 
U+2550   CDCD 
U+2551   BABA 
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2552   D5  
U+2553   D6  
U+2554   C9C9 
U+2555   B8  
U+2556   B7  
U+2557   BBBB 
U+2558   D4  
U+2559   D3  
U+255A   C8C8 
U+255B   BE  
U+255C   BD  
U+255D   BCBC 
U+255E   C6  
U+255F   C7  
U+2560   CCCC 
U+2561   B5  
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2562   B6  
U+2563   B9B9 
U+2564   D1  
U+2565   D2  
U+2566   CBCB 
U+2567   CF  
U+2568   D0  
U+2569   CACA 
U+256A   D8  
U+256B   D7  
U+256C   CECE 
U+2580   DFDF 
U+2584   DCDC 
U+2588   DBDB 
U+258C   DD  
U+2590   DE  
CharacterCode pointISO-8859-1ISO-8859-15WINDOWS-1252IBM437IBM850MACINTOSH
U+2591   B0B0 
U+2592   B1B1 
U+2593   B2B2 
U+25A0   FEFE 
U+25CA     D7
U+FB01     DE
U+FB02     DF

Notes

  1. IBM's PC DOS 2000, released in 1998, changed their definition of code page 850 to what they called modified code page 850 now including the euro sign at code point 213 instead of adding support for the new code page 858. The reason for this might have been down to existing restrictions in the implementation of the codepage switching logic under MS-DOS/PC DOS, which limited .CPI files to 64 KB in size or about six codepages maximum, a limitation, which was circumvented in some OEM versions of MS-DOS, in Windows NT, and also does not exist in DR-DOS. Further, the parser in MS-DOS/PC DOS limits the number of possible country / codepage entries in COUNTRY.SYS files to a maximum of 146 or 438, a limitation non-existent in DR-DOS. So, adding support for codepage 858 might have meant to drop another (e.g. codepage 850) at the same time, which might not have been a viable solution at that time, given that some applications were hard-wired to use codepage 850.

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

<span class="mw-page-title-main">ISO/IEC 8859-1</span> Character encoding

ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. ISO/IEC 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is the basis for some popular 8-bit character sets and the first two blocks of characters in Unicode.

ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.

ISO/IEC 8859-15:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 15: Latin alphabet No. 9, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1999. It is informally referred to as Latin-9. It is similar to ISO 8859-1, and thus also intended for “Western European” languages, but replaces some less common symbols with the euro sign and some letters that were deemed necessary: This encoding is by far most used, close to half the use, by German, though this is the least used encoding for German.

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte.

A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists stored as data within a computer file system. In operating systems such as CP/M and MS-DOS, where the operating system does not keep track of the file size in bytes, the end of a text file is denoted by placing one or more special characters, known as an end-of-file (EOF) marker, as padding after the last line in a text file. On modern operating systems such as Microsoft Windows and Unix-like systems, text files do not contain any special EOF character, because file systems on those operating systems keep track of the file size in bytes. Most text files need to have end-of-line delimiters, which are done in a few different ways depending on operating system. Some operating systems with record-orientated file systems may not use new line delimiters and will primarily store text files with lines separated as fixed or variable length records.

<span class="mw-page-title-main">Windows-1252</span> Character encoding

Windows-1252 or CP-1252 is a single-byte character encoding of the Latin alphabet that was used by default in Microsoft Windows for English and many Romance and Germanic languages including Spanish, Portuguese, French, and German. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. All modern operating systems, including Windows, now use Unicode code points and text encodings by default, which are portable across all of the world's major languages.

ISO/IEC 8859-2:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as "Latin-2". It is generally intended for Central or "Eastern European" languages that are written in the Latin script. Note that ISO/IEC 8859-2 is very different from code page 852 which is also referred to as "Latin-2" in Czech and Slovak regions. Code page 912 is an extension. Almost half the use of the encoding is for Polish, and it's the main legacy encoding for Polish, while virtually all use of it has been replaced by UTF-8.

ISO/IEC 8859-11:2001, Information technology — 8-bit single-byte coded graphic character sets — Part 11: Latin/Thai alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 2001. It is informally referred to as Latin/Thai. It is nearly identical to the national Thai standard TIS-620 (1990). The sole difference is that ISO/IEC 8859-11 allocates non-breaking space to code 0xA0, while TIS-620 leaves it undefined.

ISO/IEC 8859-8, Information technology — 8-bit single-byte coded graphic character sets — Part 8: Latin/Hebrew alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings. ISO/IEC 8859-8:1999 from 1999 represents its second and current revision, preceded by the first edition ISO/IEC 8859-8:1988 in 1988. It is informally referred to as Latin/Hebrew. ISO/IEC 8859-8 covers all the Hebrew letters, but no Hebrew vowel signs. IBM assigned code page 916 to it. This character set was also adopted by Israeli Standard SI1311:2002, with some extensions.

ISO/IEC 8859-6:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 6: Latin/Arabic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin/Arabic. It was designed to cover Arabic. Only nominal letters are encoded, no preshaped forms of the letters, so shaping processing is required for display. It does not include the extra letters needed to write most Arabic-script languages other than Arabic itself.

ISO/IEC 8859-7:2003, Information technology — 8-bit single-byte coded graphic character sets — Part 7: Latin/Greek alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin/Greek. It was designed to cover the modern Greek language. The original 1987 version of the standard had the same character assignments as the Greek national standard ELOT 928, published in 1986. The table in this article shows the updated 2003 version which adds three characters. Microsoft has assigned code page 28597 a.k.a. Windows-28597 to ISO-8859-7 in Windows. IBM has assigned code page 813 to ISO 8859-7. (IBM CCSID 813 is the original encoding. CCSID 4909 adds the euro sign. CCSID 9005 further adds the drachma sign and ypogegrammeni.)

<span class="mw-page-title-main">Code page 850</span> Code page

Code page 850 is a code page used under DOS and Psion's EPOC16 operating systems in Western Europe. Depending on the country setting and system configuration, code page 850 is the primary code page and default OEM code page in many countries, including various English-speaking locales, whilst other English-speaking locales default to use the hardware code page 437.

<span class="mw-page-title-main">Code page 866</span> Code page

Code page 866 is a code page used under DOS and OS/2 in Russia to write Cyrillic script. It is based on the "alternative code page" developed in 1984 in IHNA AS USSR and published in 1986 by a research group at the Academy of Science of the USSR. The code page was widely used during the DOS era because it preserves all of the pseudographic symbols of code page 437 and maintains alphabetic order of Cyrillic letters. Initially, this encoding was only available in the Russian version of MS-DOS 4.01 (1990) and since MS-DOS 6.22 in any language version.

Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use the Latin script. It is primarily used by Czech, though Czech has now moved to UTF-8 and mostly abandoned this legacy encoding. It is also used for Polish, Slovak, Hungarian, Slovene, Serbo-Croatian, Romanian, Rotokas and Albanian. It may also be used with the German language, though it's missing uppercase ẞ. German-language texts encoded with Windows-1250 and Windows-1252 are identical.

Code page 852 is a code page used under DOS to write Central European languages that use Latin script.

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Windows, although they are still supported both within Windows and other platforms, and still apply when Alt code shortcuts are used.

<span class="mw-page-title-main">Extended ASCII</span> Nick-name for 8-bit ASCII-derived character sets

Extended ASCII is a repertoire of character encodings that include the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes criticized, because it can be mistakenly interpreted to mean that the American National Standards Institute (ANSI) had updated its ANSI X3.4-1986 standard to include more characters, or that the term identifies a single unambiguous encoding, neither of which is the case.

In computing, a hardware code page (HWCP) refers to a code page supported natively by a hardware device such as a display adapter or printer. The glyphs to present the characters are stored in the alphanumeric character generator's resident read-only memory and are thus not user-changeable. They are available for use by the system without having to load any font definitions into the device first. Startup messages issued by a PC's System BIOS or displayed by an operating system before initializing its own code page switching logic and font management and before switching to graphics mode are displayed in a computer's default hardware code page.

Code page 922 is a code page used under IBM AIX and DOS to write the Estonian language. It is an extension and modification of ISO/IEC 8859-1, where the letters Ð/ð and Þ/þ used for Icelandic are replaced by the letters Š/š and Ž/ž respectively. This matches the encoding of these letters in Windows-1257 and ISO/IEC 8859-13.

References

  1. "00858". Code pages by CPGID. IBM. Archived from the original on 2016-06-06. Retrieved 2016-06-06.
  2. Paul, Matthias R. (2001-08-15). "Changing codepages in FreeDOS" (Technical design specification based on fd-dev post ). Archived from the original on 2016-06-06. Retrieved 2016-06-06. The new official ID for the Multilingual "codepage 850 with EURO SIGN" is 858, not 850. IBM will switch to use 858 instead of their 850 variant with future issues of their products. [...] I can only guess why they didn't add 858 to their EGAx.CPI, COUNTRY.SYS, and KEYBOARD.SYS files in PC DOS 2000. Many third-party applications are designed to work with 850 and didn't know about 858 at the time PC DOS 2000 was released, so it's easier for everyone, but unfortunately it's not compatible. [...] As explained above, COUNTRY.SYS and KEYBOARD.SYS contain only two codepage entries for a given country in Western issues of DOS. (In Arabic and Hebrew issues there can be up to 8 codepages for one country, in theory there is no limit below the range of allowed codepages 1..65534). [...] The problem is that removing support for 850 might have caused compatibility problems with applications which are hard-wired to use 850. Adding 858 as a third choice to all the files would have increased the file and table sizes significantly. The COUNTRY.SYS file parser in MS-DOS/PC DOS IO.SYS/IBMBIO.COM sets aside a 6 Kb (for DOS 6) scratchpad to load all the info. This allows a maximum of 438 entries in a COUNTRY.SYS file to be accepted, otherwise you will get the message "COUNTRY.SYS too large.". The NLSFUNC parser does not have this limitation, and the file parsers in DR-DOS (kernel and NLSFUNC) also do not know of such a restriction. Older issues of MS-DOS/PC DOS even had a 2 Kb buffer for a maximum of 146 entries.{{cite web}}: External link in |type= (help)
  3. Paul, Matthias R. (2001-08-27). "Changing codepages in FreeDOS (follow-up)". Archived from the original on 2014-10-01. Retrieved 2013-05-08. [...] one could also create custom .CPI files in the traditional FONT style without difficulties, but you could only store up to [...] six codepages in such a file if it should be useable by MS-DOS/PC DOS (some OEM issues and NT can handle files larger than 64 Kb, but MS-DOS/PC DOS can not).
  4. "IBM Conversion Mapping Tables". Unicode Consortium.