Tags (Unicode block)

Last updated
Tags
RangeU+E0000..U+E007F
(128 code points)
Plane SSP
Scripts Common
Assigned97 code points
Unused31 reserved code points
1 deprecated
Unicode version history
3.1 (2001)97 (+97)
Unicode documentation
Code chart ∣ Web page
Note: [1] [2]

Tags is a Unicode block containing formatting tag characters. The block is designed to mirror ASCII. It was originally intended for language tags, but has now been repurposed as emoji modifiers, specifically for region flags.

Contents

Legacy use

U+E0001, U+E0020–U+E007F were originally intended for invisibly tagging texts by language [3] but that use is no longer recommended. [4] All of those characters were deprecated in Unicode 5.1.

With the release of Unicode 8.0, U+E0020–U+E007E are no longer deprecated characters. The change was made "to clear the way for the potential future use of tag characters for a purpose other than to represent language tags". [5] Unicode states that "the use of tag characters to represent language tags in a plain text stream is still a deprecated mechanism for conveying language information about text". [5]

Current use

With the release of Unicode 9.0, U+E007F is no longer a deprecated character. (U+E0001 LANGUAGE TAG remains deprecated.) The release of Emoji 5.0 in May 2017 [6] considers these characters to be emoji for use as modifiers in special sequences.

The only usage specified is for representing the flags of regions, alongside the use of Regional Indicator Symbols for national flags. [7] These sequences consist of U+1F3F4🏴WAVING BLACK FLAG followed by a sequence of tags corresponding to the region as coded in the CLDR, then U+E007FCANCEL TAG. For example, using the tags for "gbeng" (🏴󠁧󠁢󠁥󠁮󠁧󠁿) will cause some systems to display the flag of England, those for "gbsct" (🏴󠁧󠁢󠁳󠁣󠁴󠁿) the flag of Scotland, and those for "gbwls" (🏴󠁧󠁢󠁷󠁬󠁳󠁿) the flag of Wales. [7]

The tag sequences are derived from ISO 3166-2, but sequences representing other subnational flags (for example US states) are also possible using this mechanism. However, as of Unicode version 12.0 only the three flag sequences listed above are "Recommended for General Interchange" by the Unicode Consortium, meaning they are "most likely to be widely supported across multiple platforms". [8]

Unicode block

Tags [1] [2] [3]
Official Unicode Consortium code chart (PDF)
 0123456789ABCDEF
U+E000xBEGIN
U+E001x
U+E002x SP   !    "    #    $    %    &    '    (    )    *    +    ,    -    .    /  
U+E003x  0    1    2    3    4    5    6    7    8    9    :    ;    <    =    >    ?  
U+E004x @   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O  
U+E005x  P    Q    R    S    T    U    V    W    X    Y    Z    [    \    ]    ^    _  
U+E006x  `    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o  
U+E007x  p    q    r    s    t    u    v    w    x    y    z    {    |    }    ~  END
1. ^ As of Unicode version 15.0
2. ^ Grey areas indicate non-assigned code points
3. ^ Unicode code points U+E0001 and U+E0020 through U+E007F were deprecated with Unicode version 5.1 however as of Unicode version 9.0 only U+E0001 remains deprecated

History

The following Unicode-related documents record the purpose and process of defining specific characters in the Tags block:

Version Final code points [lower-alpha 1] Count L2  ID WG2  IDDocument
3.1U+E00011L2/97-203Whistler, Ken; Adams, Glenn (1997-08-05), Plane 14 characters for generic tags
L2/97-171R2Whistler, Ken (1997-09-18), Plane 14 Characters for Generic Tags
L2/97-256Allouche, Mati (1997-10-20), Comments on Plane 14 Position Paper
L2/97-255R Aliprand, Joan (1997-12-03), "3.B. Lightweight language tagging", Approved Minutes - UTC #73 & L2 #170 joint meeting, Palo Alto, CA - August 4-5, 1997
L2/98-027 N1670 Plane 14 characters for language tags, 1997-12-12
L2/98-039 Aliprand, Joan; Winkler, Arnold (1998-02-24), "2.C REVISED PROPOSALS", Preliminary Minutes - UTC #74 & L2 #171, Mountain View, CA - December 5, 1997
L2/98-286 N1703 Umamaheswaran, V. S.; Ksar, Mike (1998-07-02), "7.4", Unconfirmed Meeting Minutes, WG 2 Meeting #34, Redmond, WA, USA; 1998-03-16--20
L2/98-281R (pdf, html)Aliprand, Joan (1998-07-31), "IETF and W3C Issues (VI)", Unconfirmed Minutes - UTC #77 & NCITS Subgroup L2 # 174 JOINT MEETING, Redmond, WA -- July 29-31, 1998
L2/00-010 N2103 Umamaheswaran, V. S. (2000-01-05), "9.1", Minutes of WG 2 meeting 37, Copenhagen, Denmark: 1999-09-13--16
L2/01-301 Whistler, Ken (2001-08-01), "Tag Characters", Analysis of Character Deprecation in the Unicode Standard
L2/02-166R2 Moore, Lisa (2002-08-09), "Character Deprecation", UTC #91 Minutes
U+E0020..E007F96 L2/16-042 Fonts, Agustin; Pournader, Roozbeh (2015-01-26), Clarifications Requested for "Full Emoji Data" and Emoji Flags
L2/15-145R Edberg, Peter (2015-05-07), Proposal for additional regional indicator symbols
L2/15-107 Moore, Lisa (2015-05-12), "E.1.3 Proposal for additional regional indicator symbols", UTC #143 Minutes
L2/15-190 Edberg, Peter (2015-06-29), PRI #299 Background: Representing Additional Types of Flags
L2/15-206 Davis, Mark (2015-07-25), Region / Subdivision validity for flags
L2/16-180R Burge, Jeremy; Williams, Owen (2016-07-07), Proposal to include Emoji Flags for England, Scotland and Wales
L2/17-016 Moore, Lisa (2017-02-08), "Action item 150-A59", UTC #150 Minutes, Add the three sequences for flags documented in L2/16-180R to emoji-sequences.txt for emoji 5.0.
L2/17-048 Pournader, Roozbeh (2017-01-24), Feedback on PRI 343 (Unicode Emoji 5.0)
L2/17-086 Burge, Jeremy; et al. (2017-03-27), Add ZWJ, VS-16, Keycaps & Tags to Emoji_Component
L2/17-103 Moore, Lisa (2017-05-18), "E.1.7 Add ZWJ, VS-16, Keycaps & Tags to Emoji_Component", UTC #151 Minutes
  1. Proposed code points and characters names may differ from final code points and names

Related Research Articles

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts. Many common characters, including numerals, punctuation, and other symbols, are unified within the standard and are not treated as specific to any given writing system. Unicode encodes thousands of emoji, with the continued development thereof conducted by the Consortium as a part of the standard. Moreover, the widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan. Unicode is ultimately capable of encoding more than 1.1 million characters.

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from UnicodeTransformation Format – 8-bit.

<span class="mw-page-title-main">Emoji</span> Symbols often used as emotional cues in text

An emoji is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of emoji is to fill in emotional cues otherwise missing from typed conversation. Emoji exist in various genres, including facial expressions, common objects, places and types of weather, and animals. They are much like emoticons, except emoji are pictures rather than typographic approximations; the term "emoji" in the strict sense refers to such pictures which can be represented as encoded characters, but it is sometimes applied to messaging stickers by extension. Originally meaning pictograph, the word emoji comes from Japanese e + moji; the resemblance to the English words emotion and emoticon is purely coincidental. The ISO 15924 script code for emoji is Zsye.

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE represents a blank space punctuation character in text, used as a word divider in Western scripts.

Geometric Shapes is a Unicode block of 96 symbols at code point range U+25A0–25FF.

Miscellaneous Technical is a Unicode block ranging from U+2300 to U+23FF, which contains various common symbols which are related to and used in the various technical, programming language, and academic professions. For example:

Many Unicode characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the null character is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string, since the string ends once the program reads the null character.

The Basic Latin Unicode block, sometimes informally called C0 Controls and Basic Latin, is the first block of the Unicode standard, and the only block which is encoded in one byte in UTF-8. The block contains all the letters and control codes of the ASCII encoding. It ranges from U+0000 to U+007F, contains 128 characters and includes the C0 controls, ASCII punctuation and symbols, ASCII digits, both the uppercase and lowercase of the English alphabet and a control character.

The Latin-1 Supplement is the second Unicode block in the Unicode standard. It encodes the upper range of ISO 8859-1: 80 (U+0080) - FF (U+00FF). C1 Controls (0080–009F) are not graphic. This block ranges from U+0080 to U+00FF, contains 128 characters and includes the C1 controls, Latin-1 punctuation and symbols, 30 pairs of majuscule and minuscule accented Latin characters and 2 mathematical operators.

Enclosed Alphanumerics is a Unicode block of typographical symbols of an alphanumeric within a circle, a bracket or other not-closed enclosure, or ending in a full stop.

The Unicode Standard assigns various properties to each Unicode character and code point.

CJK Symbols and Punctuation is a Unicode block containing symbols and punctuation used for writing the Chinese, Japanese and Korean languages. It also contains one Chinese character.

The regional indicator symbols are a set of 26 alphabetic Unicode characters (A–Z) intended to be used to encode ISO 3166-1 alpha-2 two-letter country codes in a way that allows optional special treatment.

Enclosed Alphanumeric Supplement is a Unicode block consisting of Latin alphabet characters and Arabic numerals enclosed in circles, ovals or boxes, used for a variety of purposes. It is encoded in the range U+1F100–U+1F1FF in the Supplementary Multilingual Plane.

Miscellaneous Symbols and Pictographs is a Unicode block containing meteorological and astronomical symbols, emoji characters largely for compatibility with Japanese telephone carriers' implementations of Shift JIS, and characters originally from the Wingdings and Webdings fonts found in Microsoft Windows.

Enclosed CJK Letters and Months is a Unicode block containing circled and parenthesized Katakana, Hangul, and CJK ideographs. Also included in the block are miscellaneous glyphs that would more likely fit in CJK Compatibility or Enclosed Alphanumerics: a few unit abbreviations, circled numbers from 21 to 50, and circled multiples of 10 from 10 to 80 enclosed in black squares.

Dingbats is a Unicode block containing dingbats. Most of its characters were taken from Zapf Dingbats; it was the Unicode block to have imported characters from a specific typeface; Unicode later adopted a policy that excluded symbols with "no demonstrated need or strong desire to exchange in plain text," and thus no further dingbat typefaces were encoded until Webdings and Wingdings were encoded in Version 7.0. Some ornaments are also an emoji, having optional presentation variants.

Emoticons is a Unicode block containing emoticons or emoji. Most of them are intended as representations of faces, although some of them include hand gestures or non-human characters.

Variation Selectors is the block name of a Unicode code point block containing 16 variation selectors used to specify a glyph variant for a preceding character. They are currently used to specify standardized variation sequences for mathematical symbols, emoji symbols, 'Phags-pa letters, and CJK unified ideographs corresponding to CJK compatibility ideographs. At present only standardized variation sequences with VS1, VS2, VS3, VS15 and VS16 have been defined; VS15 and VS16 are reserved to request that a character should be displayed as text or as an emoji respectively.

Supplemental Symbols and Pictographs is a Unicode block containing emoji characters. It extends the set of symbols included in the Miscellaneous Symbols and Pictographs block. It also includes Typikon symbols.

References

  1. "Unicode character database". The Unicode Standard. Retrieved 2023-07-26.
  2. "Enumerated Versions of The Unicode Standard". The Unicode Standard. Retrieved 2023-07-26.
  3. Whistler, K.; Adams, G. (January 1999). "RFC2482: Language Tagging in Unicode Plain Text". Network Working Group. doi:10.17487/RFC2482.{{cite journal}}: Cite journal requires |journal= (help)
  4. Whistler, K.; Adams, G.; Duerst, M.; Klensin, J.; Klensin, J. (November 2010). Presuhn, R. (ed.). "RFC6082: Deprecating Unicode Language Tag Characters: RFC 2482 is Historic". Internet Engineering Task Force (IETF). doi: 10.17487/RFC6082 .{{cite journal}}: Cite journal requires |journal= (help)
  5. 1 2 "Unicode 8.0.0, Implications for Migration". Unicode Consortium.
  6. "Emoji Version 5.0 List". Emojipedia . Retrieved 24 July 2021.
  7. 1 2 "UTR #51: Unicode Emoji". Unicode Consortium. 2017-05-18.
  8. "emoji-sequences.txt". Unicode Consortium. 2023-06-05. Retrieved 5 March 2019.