Cork encoding

Last updated

The Cork (also known as T1 or EC) encoding is a character encoding used for encoding glyphs in fonts. [1] It is named after the city of Cork in Ireland, where during a TeX Users Group (TUG) conference in 1990 a new encoding was introduced for LaTeX. [1] It contains 256 characters supporting most west- and east-European languages with the Latin alphabet. [2]

Contents

Details

In 8-bit TeX engines the font encoding has to match the encoding of hyphenation patterns where this encoding is most commonly used. [3] In LaTeX one can switch to this encoding with \usepackage[T1]{fontenc}, while in ConTeXt MkII this is the default encoding already. In modern engines such as XeTeX and LuaTeX Unicode is fully supported and the 8-bit font encodings are obsolete.

Character set

Cork encoding
0123456789ABCDEF
0x `
0060
´
00B4
ˆ
02C6
˜
02DC
¨
00A8
˝
02DD
˚
02DA
ˇ
02C7
˘
02D8
¯
00AF
˙
02D9
¸
00B8
˛
02DB

201A

2039

203A
1x
201C

201D

201E
«
00AB
»
00BB

2013

2014
ZWSP [lower-alpha 1]
200B
[lower-alpha 2]
2080
ı [lower-alpha 3]
0131
ȷ [lower-alpha 3]
0237

FB00

FB01

FB02

FB03

FB04
2x  SP   ! " # $ % &
2019
( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x
2018
a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~ SHY [lower-alpha 4]
8x Ă
0102
Ą
0104
Ć
0106
Č
010C
Ď
010E
Ě
011A
Ę
0118
Ğ
011E
Ĺ
0139
Ľ
013D
Ł
0141
Ń
0143
Ň
0147
Ŋ
014A
Ő
0150
Ŕ
0154
9x Ř
0158
Ś
015A
Š
0160
Ş
015E
Ť
0164
Ţ
0162
Ű
0170
Ů
016E
Ÿ
0178
Ź
0179
Ž
017D
Ż
017B
IJ
0132
İ
0130
đ
0111
§
00A7
Ax ă
0103
ą
0105
ć
0107
č
010D
ď
010F
ě
011B
ę
0119
ğ
011F
ĺ
013A
ľ
013E
ł
0142
ń
0144
ň
0148
ŋ
014B
ő
0151
ŕ
0155
Bx ř
0159
ś
015B
š
0161
ş
015F
ť
0165
ţ
0163
ű
0171
ů
016F
ÿ
00FF
ź
017A
ž
017E
ż
017C
ij
0133
¡
00A1
¿
00BF
£
00A3
Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Dx Ð [lower-alpha 5] Ñ Ò Ó Ô Õ Ö Œ
0152
Ø Ù Ú Û Ü Ý Þ SS [lower-alpha 6]
1E9E
Ex à á â ã ä å æ ç è é ê ë ì í î ï
Fx ð ñ ò ó ô õ ö œ
0153
ø ù ú û ü ý þ ß
00DF

Notes

  1. 0x17 is dubbed a “compound word mark” (CWM) in the Cork encoding, and is an innovation of this standard. It is an invisible character that separates compounds in a complex word, for instance in German, in order to disallow esthetic ligatures at compound boundaries. [2] It is mapped to the Unicode “zero-width space” (ZWSP, U+200B), defined at about the same time, whose purpose is similar, if not identical.
  2. 0x18 is a “small o”, used to compose or (or arbitrary smaller quantities) out of percent sign (%). [2]
  3. 1 2 Dotless i and dotless j may be used to compose accented variants like i with macron (ī).
  4. 0x7F is the hyphenation character, not really a soft hyphen (SHY) as defined by Unicode.
  5. 0xD0 is used both as Eth (Ð, U+00D0) and as D with stroke (Đ, U+0110) which might be a problem at some occasions (like copying text from PDF, hyphenation, ...)
  6. 0xDF contains SS (two letters S). It allows TeX to automatically convert the German lowercase ß into the uppercase form.

Supported languages

The encoding supports most European languages written in Latin alphabet. Notable exceptions are:

Languages with slightly suboptimal support include:

Related Research Articles

<span class="mw-page-title-main">Cyrillic script</span> Writing system used for various Eurasian languages

The Cyrillic script, Slavonic script or simply Slavic script is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Eastern Europe, the Caucasus, Central Asia, North Asia, and East Asia, and used by many other minority languages.

<span class="mw-page-title-main">Mojibake</span> Garbled text as a result of incorrect character encodings

Mojibake is the garbled or gibberish text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

<span class="mw-page-title-main">Ƒ</span> Latin letter F with hook

The letter F with hook is a letter of the Latin script, based on the italic form of f; or on its regular form with a descender hook added. A very similar-looking letter, ⟨ʄ⟩, is used in the IPA for a voiced palatal implosive.

<span class="mw-page-title-main">Ligature (writing)</span> Glyph combining two or more letterforms

In writing and typography, a ligature occurs where two or more graphemes or letters are joined to form a single glyph. Examples are the characters ⟨æ⟩ and ⟨œ⟩ used in English and French, in which the letters ⟨a⟩ and ⟨e⟩ are joined for the first ligature and the letters ⟨o⟩ and ⟨e⟩ are joined for the second ligature. For stylistic and legibility reasons, ⟨f⟩ and ⟨i⟩ are often merged to create ⟨fi⟩ ; the same is true of ⟨s⟩ and ⟨t⟩ to create ⟨st⟩. The common ampersand, ⟨&⟩, developed from a ligature in which the handwritten Latin letters ⟨e⟩ and ⟨t⟩ were combined.

<span class="mw-page-title-main">PETSCII</span> Character encoding on Commodore computers

PETSCII, also known as CBM ASCII, is the character set used in Commodore Business Machines' 8-bit home computers.

<span class="mw-page-title-main">Letter case</span> Uppercase or lowercase

Letter case is the distinction between the letters that are in larger uppercase or capitals and smaller lowercase in the written representation of certain languages. The writing systems that distinguish between the upper- and lowercase have two parallel sets of letters: each in the majuscule set has a counterpart in the minuscule set. Some counterpart letters have the same shape, and differ only in size, but for others the shapes are different. The two case variants are alternative representations of the same letter: they have the same name and pronunciation and are typically treated identically when sorting in alphabetical order.

<span class="mw-page-title-main">ʻOkina</span> Letter of the Latin alphabet

The ʻokina, also called by several other names, is a consonant letter used within the Latin script to mark the phonemic glottal stop in many Polynesian languages. It does not have distinct uppercase and lowercase forms.

<span class="mw-page-title-main">Glottal stop (letter)</span> Letter of the Latin alphabet

The character ʔ called glottal stop, is an alphabetic letter in some Latin alphabets, most notably in several languages of Canada where it indicates a glottal stop sound. Such usage derives from phonetic transcription, for example the International Phonetic Alphabet (IPA), that use this letter for the glottal stop sound. The letter derives graphically from use of the apostrophe ⟨ʼ⟩ or the symbol ʾ for glottal stop.

<span class="mw-page-title-main">Dotted and dotless I in computing</span>

The Latin-derived letters dotted İ i and dotless I ı, which are distinct letters in the alphabets of a number of Turkic languages, unlike in English and most languages using the Latin script, have caused some issues in computing.

The internationalized domain name (IDN) homograph attack is a way a malicious party may deceive computer users about what remote system they are communicating with, by exploiting the fact that many different characters look alike. For example, the Cyrillic, Greek and Latin alphabets each have a letter ⟨o⟩ that has the same shape but different meaning from its counterparts.

<span class="mw-page-title-main">Ou (ligature)</span>

Ou is a ligature of the Greek letters ο and υ which was frequently used in Byzantine manuscripts. This omicron-upsilon ligature is still seen today on icon artwork in Greek Orthodox churches, and sometimes in graffiti or other forms of informal or decorative writing.

Unicode has subscripted and superscripted versions of a number of characters including a full set of Arabic numerals. These characters allow any polynomial, chemical and certain other equations to be represented in plain text without using any form of markup like HTML or TeX.

<span class="mw-page-title-main">Ɪ</span> Additional letter of the Latin alphabet

Small capital I is an additional letter of the Latin alphabet similar in its dimensions to the letter "i" but with a shape based on ⟨I⟩, its capital form. Although ⟨ɪ⟩ is usually an allograph of the letter I, it is considered as an additional letter in the African reference alphabet and has been used as such in some publications in the Kulango languages in Côte d'Ivoire in the 1990s. In the International Phonetic Alphabet, the lowercase small capital I is used as the symbol for the near-close near-front unrounded vowel, like the letter i in the word "fit".

<span class="mw-page-title-main">L</span> 12th letter of the Latin alphabet

L, or l, is the twelfth letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is el, plural els.

<span class="mw-page-title-main">Dz (digraph)</span> Digraph of the Latin script

Dz is a digraph of the Latin script, consisting of the consonants D and Z. It may represent, , or, depending on the language.

Transformations of text are strategies to perform geometric transformations on text, particularly in systems that do not natively support transformation, such as HTML, seven-segment displays and plain text.

The ISO basic Latin alphabet is an international standard for a Latin-script alphabet that consists of two sets of 26 letters, codified in various national and international standards and used widely in international communication. They are the same letters that comprise the current English alphabet. Since medieval times, they are also the same letters of the modern Latin alphabet. The order is also important for sorting words into alphabetical order.

<span class="mw-page-title-main">I</span> 9th letter of the Latin alphabet

I, or i, is the ninth letter and the third vowel letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is i, plural ies.

The Vietnamese language is written with a Latin script with diacritics which requires several accommodations when typing on phone or computers. Software-based systems are a form of writing Vietnamese on phones or computers with software that can be installed on the device or from third-party software such as UniKey. Telex is the oldest input method devised to encode the Vietnamese language with its tones. Other input methods may also include VNI and VIQR. VNI input method is not to be confused with VNI code page.

<span class="mw-page-title-main">Atari ST character set</span> Character set of the Atari ST personal computer family

The Atari ST character set is the character set of the Atari ST personal computer family including the Atari STE, TT and Falcon. It is based on code page 437, the original character set of the IBM PC.

References

  1. 1 2 Petrlik, Lukas (1996-06-19). "The Czech and Slovak Character Encoding Mess Explained". cs-encodings-faq. 1.10. Archived from the original on 2016-06-21. Retrieved 2016-06-21.
  2. 1 2 3 Ferguson, Michael (1990), "Report on Multilingual Activities" (PDF), TUGboat, 11 (4): 514–516
  3. TeX hyphenation patterns