Cork encoding

Last updated

The Cork (also known as T1 or EC) encoding is a character encoding used for encoding glyphs in fonts. [1] It is named after the city of Cork in Ireland, where during a TeX Users Group (TUG) conference in 1990 a new encoding was introduced for LaTeX. [1] It contains 256 characters supporting most west- and east-European languages with the Latin alphabet. [2]

Contents

Details

In 8-bit TeX engines the font encoding has to match the encoding of hyphenation patterns where this encoding is most commonly used. [3] In LaTeX one can switch to this encoding with \usepackage[T1]{fontenc}, while in ConTeXt MkII this is the default encoding already. In modern engines such as XeTeX and LuaTeX Unicode is fully supported and the 8-bit font encodings are obsolete.

Character set

Cork encoding
0123456789ABCDEF
0x `
0060
´
00B4
ˆ
02C6
˜
02DC
¨
00A8
˝
02DD
˚
02DA
ˇ
02C7
˘
02D8
¯
00AF
˙
02D9
¸
00B8
˛
02DB

201A

2039

203A
1x
201C

201D

201E
«
00AB
»
00BB

2013

2014
ZWSP [lower-alpha 1]
2080
ı [lower-alpha 2]
0131
ȷ [lower-alpha 2]
0237

FB00

FB01

FB02

FB03

FB04
2x  SP   ! " # $ % &
2019
( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x
2018
a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~ SHY [lower-alpha 3]
8x Ă
0102
Ą
0104
Ć
0106
Č
010C
Ď
010E
Ě
011A
Ę
0118
Ğ
011E
Ĺ
0139
Ľ
013D
Ł
0141
Ń
0143
Ň
0147
Ŋ
014A
Ő
0150
Ŕ
0154
9x Ř
0158
Ś
015A
Š
0160
Ș
0218
Ť
0164
Ț
021A
Ű
0170
Ů
016E
Ÿ
0178
Ź
0179
Ž
017D
Ż
017B
IJ
0132
İ
0130
đ
0111
§
00A7
Ax ă
0103
ą
0105
ć
0107
č
010D
ď
010F
ě
011B
ę
0119
ğ
011F
ĺ
013A
ľ
013E
ł
0142
ń
0144
ň
0148
ŋ
014B
ő
0151
ŕ
0155
Bx ř
0159
ś
015B
š
0161
ș
0219
ť
0165
ț
021B
ű
0171
ů
016F
ÿ
00FF
ź
017A
ž
017E
ż
017C
ij
0133
¡
00A1
¿
00BF
£
00A3
Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Dx Ð [lower-alpha 4] Ñ Ò Ó Ô Õ Ö Œ
0152
Ø Ù Ú Û Ü Ý Þ SS [lower-alpha 5]
1E9E
Ex à á â ã ä å æ ç è é ê ë ì í î ï
Fx ð ñ ò ó ô õ ö œ
0153
ø ù ú û ü ý þ ß
00DF

Notes

  1. 0x18 is just a "trailing zero", used to compose or (or arbitrary smaller quantities) out of percent sign (%).
  2. 1 2 Dotless i and dotless j may be used to compose accented variants like i with macron (ī).
  3. 0x7F is the hyphenation character (not really a soft hyphen).
  4. 0xD0 is used both as Eth (Ð, U+00D0) and as D with stroke (Đ, U+0110) which might be a problem at some occasions (like copying text from PDF, hyphenation, ...)
  5. 0xDF contains SS (two letters S). It allows TeX to automatically convert the German lowercase ß into the uppercase form.

Supported languages

The encoding supports most European languages written in Latin alphabet. Notable exceptions are:

Languages with slightly suboptimal support include:

Related Research Articles

<span class="mw-page-title-main">Cyrillic script</span> Writing system used for various Eurasian languages

The Cyrillic script, Slavonic script or simply Slavic script is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Eastern Europe, the Caucasus, Central Asia, North Asia, and East Asia, and used by many other minority languages.

Koppa or qoppa is a letter that was used in early forms of the Greek alphabet, derived from Phoenician qoph (𐤒). It was originally used to denote the sound, but dropped out of use as an alphabetic character and replaced by Kappa (Κ). It has remained in use as a numeral symbol (90) in the system of Greek numerals, although with a modified shape. Koppa is the source of Latin Q, as well as the Cyrillic numeral sign of the same name (Koppa).

<span class="mw-page-title-main">Ƒ</span> Letter of the Latin alphabet

The letter F with hook is a letter of the Latin script, based on the italic form of f; or on its regular form with a descender hook added. A very similar-looking letter, ʄ, is used in the IPA for a voiced palatal implosive.

<span class="mw-page-title-main">Ligature (writing)</span> Glyph combining two or more letterforms

In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters ⟨æ⟩ and ⟨œ⟩ used in English and French, in which the letters ⟨a⟩ and ⟨e⟩ are joined for the first ligature and the letters ⟨o⟩ and ⟨e⟩ are joined for the second ligature. For stylistic and legibility reasons, ⟨f⟩ and ⟨i⟩ are often merged to create ⟨fi⟩ ; the same is true of ⟨s⟩ and ⟨t⟩ to create ⟨st⟩. The common ampersand, ⟨&⟩, developed from a ligature in which the handwritten Latin letters ⟨e⟩ and ⟨t⟩ were combined.

ISO/IEC 8859-9:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 9: Latin alphabet No. 5, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1989. It is designated ECMA-128 by Ecma International and TS 5881 as a Turkish standard. It is informally referred to as Latin-5 or Turkish. It was designed to cover the Turkish language, designed as being of more use than the ISO/IEC 8859-3 encoding. It is identical to ISO/IEC 8859-1 except for the replacement of six Icelandic characters with characters unique to the Turkish alphabet. And the uppercase of i is İ; the lowercase of I is ı.

ISO/IEC 8859-14:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 14: Latin alphabet No. 8 (Celtic), is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1998. It is informally referred to as Latin-8 or Celtic. It was designed to cover the Celtic languages, such as Irish, Manx, Scottish Gaelic, Welsh, Cornish, and Breton.

<span class="mw-page-title-main">Letter case</span> Uppercase or lowercase

Letter case is the distinction between the letters that are in larger uppercase or capitals and smaller lowercase in the written representation of certain languages. The writing systems that distinguish between the upper- and lowercase have two parallel sets of letters: each in the majuscule set has a counterpart in the minuscule set. Some counterpart letters have the same shape, and differ only in size, but for others the shapes are different. The two case variants are alternative representations of the same letter: they have the same name and pronunciation and are typically treated identically when sorting in alphabetical order.

<span class="mw-page-title-main">Glottal stop (letter)</span> Letter of the Latin alphabet

The character ʔ called glottal stop, is an alphabetic letter in some Latin alphabets, most notably in several languages of Canada where it indicates a glottal stop sound. Such usage derives from phonetic transcription, for example the International Phonetic Alphabet (IPA), that use this letter for the glottal stop sound. The letter derives graphically from use of the apostrophe ⟨ʼ⟩ or the symbol ʾ for glottal stop.

<span class="mw-page-title-main">Dotted and dotless I in computing</span>

The Latin-derived letters dotted İ i and dotless I ı, which are distinct letters in the alphabets of a number of Turkic languages, unlike in English and most languages using the Latin script, have caused some issues in computing.

Unicode has subscripted and superscripted versions of a number of characters including a full set of Arabic numerals. These characters allow any polynomial, chemical and certain other equations to be represented in plain text without using any form of markup like HTML or TeX.

Several 8-bit character sets (encodings) were designed for binary representation of common Western European languages, which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols. These character sets also happen to support many other languages such as Malay, Swahili, and Classical Latin.

<span class="mw-page-title-main">Ɪ</span> Additional letter of the Latin alphabet

Small capital I is an additional letter of the Latin alphabet similar in its dimensions to the letter "i" but with a shape based on ⟨I⟩, its capital form. Although ⟨ɪ⟩ is usually an allograph of the letter I, it is considered as an additional letter in the African reference alphabet and has been used as such in some publications in the Kulango languages in Côte d'Ivoire in the 1990s. In the International Phonetic Alphabet, the lowercase small capital I is used as the symbol for the near-close near-front unrounded vowel, like letter i in the word "Fit".

MIK (МИК) is an 8-bit Cyrillic code page used with DOS. It is based on the character set used in the Bulgarian Pravetz 16 IBM PC compatible system. Kermit calls this character set "BULGARIA-PC" / "bulgaria-pc". In Bulgaria, it was sometimes incorrectly referred to as code page 856. This code page is known by FreeDOS as Code page 3021.

L, or l, is the 12th letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is el, plural els.

Dz is a digraph of the Latin script, consisting of the consonants D and Z. It may represent, , or, depending on the language.

Transformations of text are strategies to perform geometric transformations on text, particularly in systems that do not natively support transformation, such as HTML, seven-segment displays and plain text.

The ISO basic Latin alphabet is an international standard for a Latin-script alphabet that consists of two sets of 26 letters, codified in various national and international standards and used widely in international communication. They are the same letters that comprise the current English alphabet. Since medieval times, they are also the same letters of the modern Latin alphabet. The order is also important for sorting words into alphabetical order.

I, or i, is the ninth letter and the third vowel letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is i, plural ies.

The Vietnamese language is written with a Latin script with diacritics which requires several accommodations when typing on phone or computers. Software-based systems are a form of writing Vietnamese on phones or computers with software that can be installed on the device or from third-party software such as UniKey. Telex is the oldest input method devised to encode the Vietnamese language with its tones. Other input methods may also include VNI and VIQR. VNI input method is not to be confused with VNI code page.

<span class="mw-page-title-main">Atari ST character set</span> Character set of the Atari ST personal computer family

The Atari ST character set is the character set of the Atari ST personal computer family including the Atari STE, TT and Falcon. It is based on code page 437, the original character set of the IBM PC.

References

  1. 1 2 Petrlik, Lukas (1996-06-19). "The Czech and Slovak Character Encoding Mess Explained". cs-encodings-faq. 1.10. Archived from the original on 2016-06-21. Retrieved 2016-06-21.
  2. Ferguson, Michael (1990), "Report on Multilingual Activities" (PDF), TUGboat, 11 (4): 514–516
  3. TeX hyphenation patterns