Basic Latin (Unicode block)

Last updated
Basic Latin
or
C0 Controls and Basic Latin
RangeU+0000..U+007F
(128 code points)
Plane BMP
Scripts Latin (52 characters)
Common (76 characters)
Major alphabets English
French
German
Spanish
Vietnamese
Symbol sets Arabic numerals
Punctuation
Assigned128 code points
33 Control or Format
Unused0 reserved code points
Source standards ISO/IEC 8859, ISO 646
Unicode version history
1.0.0 (1991)128 (+128)
Unicode documentation
Code chart ∣ Web page
Note: [1] [2]

The Basic Latin Unicode block, [3] sometimes informally called C0 Controls and Basic Latin, [4] is the first block of the Unicode standard, and the only block which is encoded in one byte in UTF-8. The block contains all the letters and control codes of the ASCII encoding. It ranges from U+0000 to U+007F, contains 128 characters and includes the C0 controls, ASCII punctuation and symbols, ASCII digits, both the uppercase and lowercase of the English alphabet and a control character.

Contents

The Basic Latin block was included in its present form from version 1.0.0 of the Unicode Standard, without addition or alteration of the character repertoire. [5] Its block name in Unicode 1.0 was ASCII. [6]

Table of characters

CodeResultDescriptionAcronym
C0 controls
U+0000 Null character NUL
U+0001 Start of Heading SOH
U+0002 Start of Text STX
U+0003 End-of-text character ETX
U+0004 End-of-transmission character EOT
U+0005 Enquiry character ENQ
U+0006 Acknowledge character ACK
U+0007 Bell character BEL
U+0008 Backspace BS
U+0009 Horizontal tab HT
U+000A Line feed LF
U+000B Vertical tab VT
U+000C Form feed FF
U+000D Carriage return CR
U+000E Shift Out SO
U+000F Shift In SI
U+0010 Data Link Escape DLE
U+0011 Device Control 1 DC1
U+0012 Device Control 2 DC2
U+0013 Device Control 3 DC3
U+0014 Device Control 4 DC4
U+0015 Negative-acknowledge character NAK
U+0016 Synchronous Idle SYN
U+0017 End of Transmission Block ETB
U+0018 Cancel character CAN
U+0019 End of Medium EM
U+001A Substitute character SUB
U+001B Escape character ESC
U+001C File Separator FS
U+001D Group Separator GS
U+001E Record Separator RS
U+001F Unit Separator US
ASCII punctuation and symbols
U+0020  Space SP
U+0021! Exclamation mark EXC
U+0022" Quotation mark QUO
U+0023# Number sign
U+0024$ Dollar sign
U+0025% Percent sign
U+0026& Ampersand
U+0027' Apostrophe
U+0028( Left parenthesis
U+0029) Right parenthesis
U+002A* Asterisk
U+002B+ Plus sign
U+002C, Comma
U+002D- Hyphen-minus
U+002E. Full stop or period
U+002F/ Solidus or Slash
ASCII digits
U+00300 Digit Zero
U+00311 Digit One
U+00322 Digit Two
U+00333 Digit Three
U+00344 Digit Four
U+00355 Digit Five
U+00366 Digit Six
U+00377 Digit Seven
U+00388 Digit Eight
U+00399 Digit Nine
ASCII punctuation and symbols
U+003A: Colon
U+003B; Semicolon
U+003C< Less-than sign
U+003D= Equal sign
U+003E> Greater-than sign
U+003F? Question mark
U+0040@ At sign or Commercial at
Uppercase Latin alphabet
U+0041A Latin Capital letter A
U+0042B Latin Capital letter B
U+0043C Latin Capital letter C
U+0044D Latin Capital letter D
U+0045E Latin Capital letter E
U+0046F Latin Capital letter F
U+0047G Latin Capital letter G
U+0048H Latin Capital letter H
U+0049I Latin Capital letter I
U+004AJ Latin Capital letter J
U+004BK Latin Capital letter K
U+004CL Latin Capital letter L
U+004DM Latin Capital letter M
U+004EN Latin Capital letter N
U+004FO Latin Capital letter O
U+0050P Latin Capital letter P
U+0051Q Latin Capital letter Q
U+0052R Latin Capital letter R
U+0053S Latin Capital letter S
U+0054T Latin Capital letter T
U+0055U Latin Capital letter U
U+0056V Latin Capital letter V
U+0057W Latin Capital letter W
U+0058X Latin Capital letter X
U+0059Y Latin Capital letter Y
U+005AZ Latin Capital letter Z
ASCII punctuation and symbols
U+005B[ Left Square Bracket
U+005C\ Backslash [A]
U+005D] Right Square Bracket
U+005E^ Circumflex accent
U+005F_ Low line
U+0060` Grave accent
Lowercase Latin alphabet
U+0061a Latin Small Letter A
U+0062b Latin Small Letter B
U+0063c Latin Small Letter C
U+0064d Latin Small Letter D
U+0065e Latin Small Letter E
U+0066f Latin Small Letter F
U+0067g Latin Small Letter G
U+0068h Latin Small Letter H
U+0069i Latin Small Letter I
U+006Aj Latin Small Letter J
U+006Bk Latin Small Letter K
U+006Cl Latin Small Letter L
U+006Dm Latin Small Letter M
U+006En Latin Small Letter N
U+006Fo Latin Small Letter O
U+0070p Latin Small Letter P
U+0071q Latin Small Letter Q
U+0072r Latin Small Letter R
U+0073s Latin Small Letter S
U+0074t Latin Small Letter T
U+0075u Latin Small Letter U
U+0076v Latin Small Letter V
U+0077w Latin Small Letter W
U+0078x Latin Small Letter X
U+0079y Latin Small Letter Y
U+007Az Latin Small Letter Z
ASCII punctuation and symbols
U+007B{ Left Curly Bracket
U+007C| Vertical bar
U+007D} Right Curly Bracket
U+007E~ Tilde
Control character
U+007F Delete DEL
A The letter U+005C (\) may show up as a Yen(¥) or Won(₩) sign in Japanese/Korean fonts mistaking Unicode (especially UTF-8) as a legacy character set which replaced the backslash with these signs. [7]

Subheadings

The C0 Controls and Basic Latin block contains six subheadings. [8]

C0 controls

The C0 Controls, referred to as C0 ASCII control codes in version 1.0, are inherited from ASCII and other 7-bit and 8-bit encoding schemes. The Alias names for C0 controls are taken from the ISO/IEC 6429:1992 standard. [8]

ASCII punctuation and symbols

This subheading refers to standard punctuation characters, simple mathematical operators, and symbols like the dollar sign, percent, ampersand, underscore, and pipe. [8]

ASCII digits

The ASCII Digits subheading contains the standard European number characters 1–9 and 0. [8]

Uppercase Latin alphabet

The Uppercase Latin alphabet subheading contains the standard 26-letter unaccented Latin alphabet in the majuscule. [8]

Lowercase Latin alphabet

The Lowercase Latin Alphabet subheading contains the standard 26-letter unaccented Latin alphabet in the minuscule. [8]

Control character

The Control Character subheading contains the "Delete" character. [8]

Number of symbols, letters and control codes

The table below shows the number of letters, symbols and control codes in each of the subheadings in the C0 Controls and Basic Latin block.

SubheadingNumber of symbolsRange of characters
C0 controls32 control codesU+0000 to U+001F
ASCII punctuation and symbols33 punctuation marks and symbolsU+0020 to U+002F, U+003A to U+0040, U+005B to U+0060 and U+007B to U+007E
ASCII digits10 digitsU+0030 to U+0039
Uppercase Latin Alphabet26 unaccented Latin letters in the majuscule.U+0041 to U+005A
Lowercase Latin Alphabet26 unaccented Latin letters in the minuscule.U+0061 to U+007A
Control character1 control code containing the "Delete" character.U+007F

Chart

C0 Controls and Basic Latin [lower-alpha 1]
Official Unicode Consortium code chart (PDF)
 0123456789ABCDEF
U+000xNULSOHSTXETXEOTENQACKBEL BS  HT  LF  VT  FF  CR  SO  SI 
U+001xDLEDC1DC2DC3DC4NAKSYNETBCAN EM SUBESC FS  GS  RS  US 
U+002x SP  !"#$ %&'()*+,-./
U+003x0123456789 : ;<=> ?
U+004x@ABCDEFGHIJKLMNO
U+005xPQRSTUVWXYZ[\]^_
U+006x`abcdefghijklmno
U+007xpqrstuvwxyz{|}~DEL
  1. As of Unicode version 15.1

Variants

Several of the characters are defined to render as a standardized variant if followed by variant indicators.

A variant is defined for a zero with a short diagonal stroke: U+0030 DIGIT ZERO, U+FE00 VS1 (0). [9] [10]

Twelve characters (#, *, and the digits) can be followed by U+FE0E VS15 or U+FE0F VS16 to create emoji variants. [11] [12] [13] [14] They are keycap base characters, for example #️⃣ (U+0023 NUMBER SIGN U+FE0F VS16 U+20E3 COMBINING ENCLOSING KEYCAP). The VS15 version is "text presentation" while the VS16 version is "emoji-style". [10]

Emoji variation sequences
U+0023002A0030003100320033003400350036003700380039
base#*0123456789
base+VS15+keycap#*0123456789
base+VS16+keycap#*0123456789

History

The following Unicode-related documents record the purpose and process of defining specific characters in the Basic Latin block:

Version Final code points [lower-alpha 1] Count UTC  ID L2  ID WG2  IDDocument
1.0.0U+0000..007F128(to be determined)
UTC/1999-013 Karlsson, Kent (1999-05-27), Tildes and micro sign decompositions
L2/99-176R Moore, Lisa (1999-11-04), "Micro Sign Case Mappings", Minutes from the joint UTC/L2 meeting in Seattle, June 8-10, 1999
L2/04-145 Starner, David (2004-04-30), C with stroke character examples from BAE report 1884 (Dorsey)
L2/04-202 Anderson, Deborah (2004-06-07), Slashed C Feedback
N3046 Suignard, Michel (2006-02-22), Improving formal definition for control characters
N3103 (pdf, doc)Umamaheswaran, V. S. (2006-08-25), "M48.33", Unconfirmed minutes of WG 2 meeting 48, Mountain View, CA, USA; 2006-04-24/27
L2/11-043 Freytag, Asmus; Karlsson, Kent (2011-02-02), Proposal to correct mistakes and inconsistencies in certain property assignments for super and subscripted letters
L2/11-160 PRI #181 Changing General Category of Twelve Characters, 2011-05-02
L2/11-261R2 Moore, Lisa (2011-08-16), "Consensus 128-C3", UTC #128 / L2 #225 Minutes, Accept Ken Whistler's recommendations in L2/11-281 on name aliases for control characters with the addition of the abbreviations BEL and NUL.
L2/11-438 [lower-alpha 2] [lower-alpha 3] N4182 Edberg, Peter (2011-12-22), Emoji Variation Sequences (Revision of L2/11-429)
L2/15-107 Moore, Lisa (2015-05-12), "Consensus 143-C5", UTC #143 Minutes, Add the 12 keycap sequences in emoji-data.txt as provisional named sequences in Unicode 8.0.
L2/15-268 Beeton, Barbara; Freytag, Asmus; Iancu, Laurențiu; Sargent, Murray (2015-10-30), Proposal to Represent the Slashed Zero Variant of Empty Set
L2/15-301 [lower-alpha 4] [lower-alpha 3] Pournader, Roozbeh (2015-11-01), A proposal for 278 standardized variation sequences for emoji
L2/15-254 Moore, Lisa (2015-11-16), "B.12.1.2 Proposal to Represent the Slashed Zero Variant of Empty Set", UTC #145 Minutes
L2/17-294 N4914 Lunde, Ken (2017-08-14), Proposal to add standardized variation sequence for U+FF10 FULLWIDTH DIGIT ZERO
L2/22-019 Scherer, Markus; et al. (2022-01-19), "F.2 F4: U+0019 in ISO vs. NameAliases.txt vs. chart/NamesList.txt", UTC #170 properties feedback & recommendations
L2/22-016 Constable, Peter (2022-04-21), "Consensus 170-C24", UTC #170 Minutes, For U+0019, add a Name alias "EM" of type abbreviation, for Unicode version 15.0.
  1. Proposed code points and characters names may differ from final code points and names
  2. See also L2/10-458, L2/11-414, L2/11-415, and L2/11-429
  3. 1 2 Refer to the history section of the Miscellaneous Symbols and Pictographs block for additional emoji-related documents
  4. See also L2/15-198 and L2/15-275

See also

Related Research Articles

Miscellaneous Symbols is a Unicode block (U+2600–U+26FF) containing glyphs representing concepts from a variety of categories: astrological, astronomical, chess, dice, musical notation, political symbols, recycling, religious symbols, trigrams, warning signs, and weather, among others.

Geometric Shapes is a Unicode block of 96 symbols at code point range U+25A0–25FF.

Letterlike Symbols is a Unicode block containing 80 characters which are constructed mainly from the glyphs of one or more letters. In addition to this block, Unicode includes full styled mathematical alphabets, although Unicode does not explicitly categorize these characters as being "letterlike."

Miscellaneous Technical is a Unicode block ranging from U+2300 to U+23FF, which contains various common symbols which are related to and used in the various technical, programming language, and academic professions. For example:

Supplemental Arrows-B is a Unicode block containing miscellaneous arrows, arrow tails, crossing arrows used in knot descriptions, curved arrows, and harpoons.

Miscellaneous Symbols and Arrows is a Unicode block containing arrows and geometric shapes with various fills, astrological symbols, technical symbols, intonation marks, and others.

The Latin-1 Supplement is the second Unicode block in the Unicode standard. It encodes the upper range of ISO 8859-1: 80 (U+0080) - FF (U+00FF). C1 Controls (0080–009F) are not graphic. This block ranges from U+0080 to U+00FF, contains 128 characters and includes the C1 controls, Latin-1 punctuation and symbols, 30 pairs of majuscule and minuscule accented Latin characters and 2 mathematical operators.

The ISO basic Latin alphabet is an international standard for a Latin-script alphabet that consists of two sets of 26 letters, codified in various national and international standards and used widely in international communication. They are the same letters that comprise the current English alphabet. Since medieval times, they are also the same letters of the modern Latin alphabet. The order is also important for sorting words into alphabetical order.

Enclosed Alphanumerics is a Unicode block of typographical symbols of an alphanumeric within a circle, a bracket or other not-closed enclosure, or ending in a full stop.

CJK Symbols and Punctuation is a Unicode block containing symbols and punctuation used for writing the Chinese, Japanese and Korean languages. It also contains one Chinese character.

Enclosed Alphanumeric Supplement is a Unicode block consisting of Latin alphabet characters and Arabic numerals enclosed in circles, ovals or boxes, used for a variety of purposes. It is encoded in the range U+1F100–U+1F1FF in the Supplementary Multilingual Plane.

Miscellaneous Symbols and Pictographs is a Unicode block containing meteorological and astronomical symbols, emoji characters largely for compatibility with Japanese telephone carriers' implementations of Shift JIS, and characters originally from the Wingdings and Webdings fonts found in Microsoft Windows.

Enclosed CJK Letters and Months is a Unicode block containing circled and parenthesized Katakana, Hangul, and CJK ideographs. Also included in the block are miscellaneous glyphs that would more likely fit in CJK Compatibility or Enclosed Alphanumerics: a few unit abbreviations, circled numbers from 21 to 50, and circled multiples of 10 from 10 to 80 enclosed in black squares.

Mahjong Tiles is a Unicode block containing characters depicting the standard set of tiles used in the game of Mahjong.

General Punctuation is a Unicode block containing punctuation, spacing, and formatting characters for use with all scripts and writing systems. Included are the defined-width spaces, joining formats, directional formats, smart quotes, archaic and novel punctuation such as the interrobang, and invisible mathematical operators.

Dingbats is a Unicode block containing dingbats. Most of its characters were taken from Zapf Dingbats; it was the Unicode block to have imported characters from a specific typeface; Unicode later adopted a policy that excluded symbols with "no demonstrated need or strong desire to exchange in plain text", and thus no further dingbat typefaces were encoded until Webdings and Wingdings were encoded in Version 7.0. Some ornaments are also an emoji, having optional presentation variants.

<span class="mw-page-title-main">Enclosed Ideographic Supplement</span> Unicode character block

Enclosed Ideographic Supplement is a Unicode block containing forms of characters and words from Chinese, Japanese and Korean enclosed within or stylised as squares, brackets, or circles. It contains three such characters containing one or more kana, and many containing CJK ideographs. Many of its characters were added for compatibility with the Japanese ARIB STD-B24 standard. Six symbols from Chinese folk religion were added in Unicode version 10.

Transport and Map Symbols is a Unicode block containing transportation and map icons, largely for compatibility with Japanese telephone carriers' emoji implementations of Shift JIS, and to encode characters in the Wingdings and Wingdings 2 character sets.

Variation Selectors is the block name of a Unicode code point block containing 16 variation selectors used to specify a glyph variant for a preceding character. They are currently used to specify standardized variation sequences for mathematical symbols, emoji symbols, 'Phags-pa letters, and CJK unified ideographs corresponding to CJK compatibility ideographs. At present only standardized variation sequences with VS1, VS2, VS3, VS15 and VS16 have been defined; VS15 and VS16 are reserved to request that a character should be displayed as text or as an emoji respectively.

The ISO 2033:1983 standard defines character sets for use with Optical Character Recognition or Magnetic Ink Character Recognition systems. The Japanese standard JIS X 9010:1984 is closely related.

References

  1. "Unicode character database". The Unicode Standard. Retrieved 2023-07-26.
  2. "Enumerated Versions of The Unicode Standard". The Unicode Standard. Retrieved 2023-07-26.
  3. {{cite web|url=https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt%7Ctitle=block.txt%7Caccessdate=23-03-2023%7Cpublisher=The Unicode Consortium
  4. "C0 Controls and Basic Latin" (PDF). The Unicode Standard, Version 15.0. Unicode, Inc. 2022. Retrieved March 22, 2023.
  5. The Unicode Standard Version 1.0, Volume 1. Addison-Wesley Publishing Company, Inc. 1990. ISBN   0-201-56788-1.
  6. "3.8: Block-by-Block Charts" (PDF). The Unicode Standard. version 1.0. Unicode Consortium.
  7. Michael S. Kaplan (2005-09-17). "When is a backslash not a backslash?". Sorting it all Out. Microsoft. Archived from the original on 2010-06-12. Also available at: http://archives.miloush.net/michkap/archive/2005/09/17/469941.html
  8. 1 2 3 4 5 6 7 "Unicode 6.2 code charts" (PDF). The Unicode Standard. Retrieved 1 April 2013.
  9. Beeton, Barbara; Freytag, Asmus; Iancu, Laurențiu; Sargent, Murray (2015-10-30). "L2/15-268: Proposal to Represent the Slashed Zero Variant of Empty Set" (PDF).
  10. 1 2 "UTS #51 Emoji Variation Sequences". The Unicode Consortium.
  11. Edberg, Peter (2011-12-22). "L2/11-438: Emoji Variation Sequences (Revision of L2/11-429)" (PDF).
  12. Pournader, Roozbeh (2015-11-01). "L2/15-301: A proposal for 278 standardized variation sequences for emoji" (PDF).
  13. "UTR #51: Unicode Emoji". Unicode Consortium. 2023-09-05.
  14. "UCD: Emoji Data for UTR #51". Unicode Consortium. 2023-02-01.