Latin script in Unicode

Last updated

Over a thousand characters from the Latin script are encoded in the Unicode Standard, grouped in several basic and extended Latin blocks. The extended ranges contain mainly precomposed letters plus diacritics that are equivalently encoded with combining diacritics, as well as some ligatures and distinct letters, used for example in the orthographies of various African languages (including click symbols in Latin Extended-B) and the Vietnamese alphabet (Latin Extended Additional). Latin Extended-C contains additions for Uighur and the Claudian letters. Latin Extended-D comprises characters that are mostly of interest to medievalists. Latin Extended-E mostly comprises characters used for German dialectology (Teuthonista). [1] Latin Extended-F and -G contain characters for phonetic transcription.

Contents

Blocks

As of version 16.0 of the Unicode Standard, 1,487 characters in the following 19 blocks are classified as belonging to the Latin script. [2]

In addition, a number of Latin-like characters are encoded in the Currency Symbols, Control Pictures, CJK Compatibility, Enclosed Alphanumerics, Enclosed CJK Letters and Months, Mathematical Alphanumeric Symbols, and Enclosed Alphanumeric Supplement blocks, but, although they are Latin letters graphically, they have the script property common , and, so, do not belong to the Latin script in Unicode terms. Lisu also consists almost entirely of Latin forms, but uses its own script property.

Table of characters

In this table those characters with the Unicode script property of Latin are highlighted in colour, indicating the version of Unicode they were introduced in. Reserved code points (which may be assigned as characters at a future date) have a grey background. All characters that do not belong to the Latin script have a white background (and the version of Unicode they were introduced in is therefore not indicated).

Legend: Unicode version
Unicode 1.0 Unicode 6.1
Unicode 1.1 Unicode 7.0
Unicode 2.0 Unicode 8.0
Unicode 3.0 Unicode 9.0
Unicode 3.2 Unicode 11.0
Unicode 4.0 Unicode 12.0
Unicode 4.1 Unicode 13.0
Unicode 5.0 Unicode 14.0
Unicode 5.1 Unicode 15.0
Unicode 5.2 Unicode 16.0
Unicode 6.0
ReservedNot Latin script
U+ 0123456789ABCDEFBlock#
0040@ A B C D E F G H I J K L M N O C0 Controls and Basic Latin
0000–007F
(identical to ASCII)
52
0050 P Q R S T U V W X Y Z [\]^_
0060` a b c d e f g h i j k l m n o
0070 p q r s t u v w x y z {|}~DEL
00A0 ¡¢£¤¥¦§¨© ª «¬®¯ C1 Controls and Latin-1 Supplement
0080–00FF
(identical to ISO/IEC 8859-1)
64
00B0°±²³´µ·¸¹ º »¼½¾¿
00C0 À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
00D0 Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
00E0 à á â ã ä å æ ç è é ê ë ì í î ï
00F0 ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
0100 Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Latin Extended-A
0100–017F
128
0110 Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ
0120 Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į
0130 İ ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ
0140 ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ
0150 Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş
0160 Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů
0170 Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž ſ
0180 ƀ Ɓ Ƃ ƃ Ƅ ƅ Ɔ Ƈ ƈ Ɖ Ɗ Ƌ ƌ ƍ Ǝ Ə Latin Extended-B
0180–024F
208
0190 Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ɵ
01A0 Ơ ơ Ƣ ƣ Ƥ ƥ Ʀ Ƨ ƨ Ʃ ƪ ƫ Ƭ ƭ Ʈ Ư
01B0 ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ ƹ ƺ ƻ Ƽ ƽ ƾ ƿ
01C0 ǀ ǁ ǂ ǃ DŽ Dž dž LJ Lj lj NJ Nj nj Ǎ ǎ Ǐ
01D0 ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ
01E0 Ǡ ǡ Ǣ ǣ Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ ǭ Ǯ ǯ
01F0 ǰ DZ Dz dz Ǵ ǵ Ƕ Ƿ Ǹ ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ
0200 Ȁ ȁ Ȃ ȃ Ȅ ȅ Ȇ ȇ Ȉ ȉ Ȋ ȋ Ȍ ȍ Ȏ ȏ
0210 Ȑ ȑ Ȓ ȓ Ȕ ȕ Ȗ ȗ Ș ș Ț ț Ȝ ȝ Ȟ ȟ
0220 Ƞ ȡ Ȣ ȣ Ȥ ȥ Ȧ ȧ Ȩ ȩ Ȫ ȫ Ȭ ȭ Ȯ ȯ
0230 Ȱ ȱ Ȳ ȳ ȴ ȵ ȶ ȷ ȸ ȹ Ⱥ Ȼ ȼ Ƚ Ⱦ ȿ
0240 ɀ Ɂ ɂ Ƀ Ʉ Ʌ Ɇ ɇ Ɉ ɉ Ɋ ɋ Ɍ ɍ Ɏ ɏ
0250 ɐ ɑ ɒ ɓ ɔ ɕ ɖ ɗ ɘ ə ɚ ɛ ɜ ɝ ɞ ɟ IPA Extensions
0250–02AF
96
0260 ɠ ɡ ɢ ɣ ɤ ɥ ɦ ɧ ɨ ɩ ɪ ɫ ɬ ɭ ɮ ɯ
0270 ɰ ɱ ɲ ɳ ɴ ɵ ɶ ɷ ɸ ɹ ɺ ɻ ɼ ɽ ɾ ɿ
0280 ʀ ʁ ʂ ʃ ʄ ʅ ʆ ʇ ʈ ʉ ʊ ʋ ʌ ʍ ʎ ʏ
0290 ʐ ʑ ʒ ʓ ʔ ʕ ʖ ʗ ʘ ʙ ʚ ʛ ʜ ʝ ʞ ʟ
02A0 ʠ ʡ ʢ ʣ ʤ ʥ ʦ ʧ ʨ ʩ ʪ ʫ ʬ ʭ ʮ ʯ
02B0 ʰ ʱ ʲ ʳ ʴ ʵ ʶ ʷ ʸ ʹʺʻʼʽʾʿ Spacing Modifier Letters
02B0–02FF
14
02E0 ˠ ˡ ˢ ˣ ˤ ˥˦˧˨˩˪˫ˬ˭ˮ˯
1D00 Phonetic Extensions
1D00–1D7F
111
1D10
1D20
1D30 ᴿ
1D40
1D50
1D60
1D70 ᵿ
1D80 Phonetic Extensions Supplement
1D80–1DBF
63
1D90
1DA0
1DB0 ᶿ
1E00 Latin Extended Additional
1E00–1EFF
256
1E10
1E20
1E30 ḿ
1E40
1E50
1E60
1E70 ṿ
1E80
1E90
1EA0
1EB0 ế
1EC0
1ED0
1EE0
1EF0 ỿ
2070    Superscripts and Subscripts
2070–209F
15
2090   
2120 Ω Letterlike symbols
2100–214F
4
2130
2140
2160 Number Forms
2150–218F
41
2170
2180     
2C60 Latin Extended-C
2C60–2C7F
32
2C70 Ɀ
A720 Latin Extended-D
A720–A7FF
194
A730
A740
A750
A760
A770
A780
A790
A7A0
A7B0
A7C0   
A7D0        
A7E0                
A7F0  
AB30 ꬿ Latin Extended-E
AB30–AB6F
56
AB40
AB50
AB60     
FB00           Alphabetic Presentation Forms 7
FF20 Halfwidth and Fullwidth Forms
(fullwidth Latin letters)
FF00–FFEF
52
FF30 _
FF40
FF50
10780𐞀𐞁𐞂𐞃𐞄𐞅 𐞇𐞈𐞉𐞊𐞋𐞌𐞍𐞎𐞏 Latin Extended-F
10780–107BF
57
10790𐞐𐞑𐞒𐞓𐞔𐞕𐞖𐞗𐞘𐞙𐞚𐞛𐞜𐞝𐞞𐞟
107A0𐞠𐞡𐞢𐞣𐞤𐞥𐞦𐞧𐞨𐞩𐞪𐞫𐞬𐞭𐞮𐞯
107B0𐞰 𐞲𐞳𐞴𐞵𐞶𐞷𐞸𐞹𐞺     
1DF00𝼀𝼁𝼂𝼃𝼄𝼅𝼆𝼇𝼈𝼉𝼊𝼋𝼌𝼍𝼎𝼏 Latin Extended-G
1DF00–1DFFF
37
1DF10𝼐𝼑𝼒𝼓𝼔𝼕𝼖𝼗𝼘𝼙𝼚𝼛𝼜𝼝𝼞 
1DF20     𝼥𝼦𝼧𝼨𝼩𝼪     
Total characters1,487

See also

Related Research Articles

<span class="mw-page-title-main">Bitstream Cyberbit</span> Unicode serif typeface

Bitstream Cyberbit is a commercial serif Unicode font designed by Bitstream Inc. It is freeware for non-commercial uses. It was one of the first widely available fonts to support a large portion of the Unicode repertoire.

Unicode has subscripted and superscripted versions of a number of characters including a full set of Arabic numerals. These characters allow any polynomial, chemical and certain other equations to be represented in plain text without using any form of markup like HTML or TeX.

<span class="mw-page-title-main">L</span> 12th letter of the Latin alphabet

L or l is the twelfth letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is el, plural els.

New Gulim (새굴림/SaeGulRim) is a sans-serif type Unicode font designed especially for the Korean-language script, designed by HanYang System Co., Limited. It is an expanded version of Hanyang Gulrim.

Unicode supports several phonetic scripts and notation systems through its existing scripts and the addition of extra blocks with phonetic characters. These phonetic characters are derived from an existing script, usually Latin, Greek or Cyrillic. Apart from the International Phonetic Alphabet (IPA), extensions to the IPA and obsolete and nonstandard IPA symbols, these blocks also contain characters from the Uralic Phonetic Alphabet and the Americanist Phonetic Alphabet.

In computing, a Unicode symbol is a Unicode character which is not part of a script used to write a natural language, but is nonetheless available for use as part of a text.

<span class="mw-page-title-main">Mathematical operators and symbols in Unicode</span>

The Unicode Standard encodes almost all standard characters used in mathematics. Unicode Technical Report #25 provides comprehensive information about the character repertoire, their properties, and guidelines for implementation. Mathematical operators and symbols are in multiple Unicode blocks. Some of these blocks are dedicated to, or primarily contain, mathematical characters while others are a mix of mathematical and non-mathematical characters. This article covers all Unicode characters with a derived property of "Math".

<span class="mw-page-title-main">GNU FreeFont</span> Font family

GNU FreeFont is a family of free OpenType, TrueType and WOFF vector fonts, implementing as much of the Universal Character Set (UCS) as possible, aside from the very large CJK Asian character set. The project was initiated in 2002 by Primož Peterlin and is now maintained by Steve White.

In Unicode and the UCS, a compatibility character is a character that is encoded solely to maintain round-trip convertibility with other, often older, standards. As the Unicode Glossary says:

A character that would not have been encoded except for compatibility and round-trip convertibility with other standards

In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh). Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly used characters. The higher planes 1 through 16 are called "supplementary planes". The last code point in Unicode is the last code point in plane 16, U+10FFFF. As of Unicode version 16.0, five of the planes have assigned code points (characters), and seven are named.

The ISO basic Latin alphabet is an international standard for a Latin-script alphabet that consists of two sets of 26 letters, codified in various national and international standards and used widely in international communication. They are the same letters that comprise the current English alphabet. Since medieval times, they are also the same letters of the modern Latin alphabet. The order is also important for sorting words into alphabetical order.

Unicode contains a number of characters that represent various cultural, political, and religious symbols. Most, but not all, of these symbols are in the Miscellaneous Symbols block.

Enclosed Alphanumeric Supplement is a Unicode block consisting of Latin alphabet characters and Arabic numerals enclosed in circles, ovals or boxes, used for a variety of purposes. It is encoded in the range U+1F100–U+1F1FF in the Supplementary Multilingual Plane.

A variant form is an alternate glyph for a character, encoded in Unicode through the mechanism of variation sequences: sequences in Unicode that consist of a base character followed by a variation selector character.

CJK Compatibility is a Unicode block containing square symbols encoded for compatibility with East Asian character sets. In Unicode 1.0, it was divided into two blocks, named CJK Squared Words (U+3300–U+337F) and CJK Squared Abbreviations (U+3380–U+33FF). The square forms can have different presentations when they are used in horizontal or vertical text. For example, the characters U+333ESQUARE BORUTO and U+3327SQUARE TON should look different in horizontal and in vertical right-to-left: ㌧㌾

The Vietnamese language is written with a Latin script with diacritics which requires several accommodations when typing on phone or computers. Software-based systems are a form of writing Vietnamese on phones or computers with software that can be installed on the device or from third-party software such as UniKey. Telex is the oldest input method devised to encode the Vietnamese language with its tones. Other input methods may also include VNI and VIQR. VNI input method is not to be confused with VNI code page.

A number of Greek letters, variants, digits, and other symbols are supported by the Unicode character encoding standard.

References

  1. Everson, Michael; Dicklberger, Alois; Pentzlin, Karl; Wandl-Vogt, Eveline (2011-06-02). "Revised proposal to encode "Teuthonista" phonetic characters in the UCS" (PDF).
  2. "Scripts-16.0.0.txt". Unicode Consortium. 2024-04-30. Retrieved 2024-09-12.
Listen to this article (4 minutes)
Sound-icon.svg
This audio file was created from a revision of this article dated 9 November 2023 (2023-11-09), and does not reflect subsequent edits.