Unicode input

Last updated
The KCharSelect character mapping tool shown displaying a subset of the Unicode Mathematical Operators KCharSelect Mathematical Operators.png
The KCharSelect character mapping tool shown displaying a subset of the Unicode Mathematical Operators
The Unicode logo New Unicode logo.svg
The Unicode logo

Unicode input is the insertion of a specific Unicode character on a computer by a user; it is a common way to input characters not directly supported by a physical keyboard. Unicode characters can be produced either by selecting them from a display or by typing a certain sequence of keys on a physical keyboard. In addition, a character produced by one of these methods in one web page or document can be copied into another. In contrast to ASCII's 96 element character set (which it contains), Unicode encodes hundreds of thousands of graphemes (characters) from almost all of the world's written languages and many other signs and symbols besides. [1] [ better source needed ]

Contents

A Unicode input system must provide for a large repertoire of characters, ideally all valid Unicode code points. This is different from a keyboard layout which defines keys and their combinations only for a limited number of characters appropriate for a certain locale.

Unicode numbers

Unicode characters are distinguished by code points, which are conventionally represented by "U+" followed by four, five or six hexadecimal digits, for example U+00AE or U+1D310. Characters in the Basic Multilingual Plane (BMP), containing modern scripts  – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such as emoticons, emojis, playing cards and many CJK characters) have 5-digit codes.

Glyph availability

An application can display a character only if it can access a font which contains a glyph for the character. [2] Very few fonts have full Unicode coverage; most only contain the glyphs needed to support a few writing systems. However, most modern browsers and other text-processing applications are able to display multilingual content because they perform font substitution, automatically switching to a fallback font when necessary to display characters which are not supported in the current font. Which fonts are used for fallback and the thoroughness of Unicode coverage varies by software and operating system; some software will search for a suitable glyph in all of the installed fonts, others only search within certain fonts.

If an application does not have access to a glyph, the character will usually be shown as the font's ".notdef" glyph 􏿮 [3] which often appears as an empty box, ☐ (nicknamed "tofu" based on the shape), a box with an X in it, ☒, or a box with a question mark in it, ⍰.

Techniques

Extended keyboard mapping

Most operating systems support extended keyboard mapping   the facility to increase the repertoire of characters available using techniques such as Alternate graphic ("AltGr") that gives a third and fourth meaning to every key; Compose key (sometimes called multi key), a key on a computer keyboard that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a precomposed character or a symbol; [4] dead keys typically used to attach a specific diacritic to a base letter; [5] or indeed combinations of these.

These techniques facilitate entry of character sets beyond the basic set provided as standard with the computer.

Selection from a screen

GNOME Character Map 20170802073953!GNOME Character Map 3.18.2.png
GNOME Character Map

Many systems provide a way to select Unicode characters visually. ISO/IEC 14755 refers to this as a screen-selection entry method. [6]

Microsoft Windows has provided a Unicode version of the Character Map program, appearing in the consumer edition since XP. This is limited to characters in the Basic Multilingual Plane (BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block. [7] More advanced third-party tools of the same type are also available (a notable freeware example is BabelMap, which supports all Unicode characters). On most Linux desktop environments, equivalent tools – such as gucharmap (GNOME) or kcharselect (KDE) – are available. [8]

Generally these tools let the user "copy" the selected characters into the clipboard, and then paste them into the document, rather than pretending to directly type them.

It is often practical to just find the desired character on the web or in another document, and copy and paste it from there.

Decimal input (Alt codes)

Some programs running in Microsoft Windows, including recent versions of Word and Notepad, can produce characters from their Unicode code points expressed in decimal and entered on the numeric keypad with the Alt key held down. For example, the Euro sign has 20AC as its hexadecimal code point, which is 8364 in decimal, so Alt+8364 will produce the symbol. Similarly, Alt+120132 produces the double-struck (blackboard bold) character 𝕄.

Decimal code points in the range 160 –255 must be entered with a leading zero (so that the Windows code page is chosen) and furthermore the Windows code page must be set to match Unicode (CP1252 must be used [lower-alpha 1] ). For example, Alt+0247 yields a ÷, corresponding to its code point, but the character produced by Alt+247 depends on the OEM code page, such as Code page 437, and may yield a . Also Alt+0128 through Alt+0159 yield the characters assigned in rows 8 and 9 in the CP1252 layout, rather than the C1 control codes that are assigned to those numbers in Unicode.

In programs which were not designed to handle Alt codes over 255, the character retrieved usually corresponds to the remainder when the number is divided by 256.[ citation needed ]

The text editor Vim allows characters to be specified by two-character mnemonics referred to as digraphs. The installed set can be augmented by custom mnemonics defined for arbitrary code points, specified in decimal. For example, as decimal 9881 is equal to hexadecimal 2699, dig Gr 9881 associates "Gr" with U+2699GEAR.

See below for use of decimal code points in HTML.

Hexadecimal input

Clause 5.1 of ISO/IEC 14755 describes a Basic method whereby a beginning sequence is followed by the hex number representation of the code point and the ending sequence. Most modern systems have some method to emulate this, sometimes limited to four digits (thus only the Basic Multilingual Plane).

In Microsoft Windows

Hexadecimal Unicode input can be enabled by adding a string type (REG_SZ) value called EnableHexNumpad to the registry key HKEY_CURRENT_USER\Control Panel\Input Method and assigning the value data 1 to it. Users will need to log off and back in after editing the registry for this input method to start working. (In versions earlier than Vista, users needed to reboot for it to start working.)

Unicode characters can then be entered by holding down Alt, and typing + on the numeric keypad, followed by the hexadecimal code, and then releasing Alt. [2] This may not work for 5-digit hexadecimal codes like U+1F937. Some versions of Windows may require the digits 0-9 to be typed on the numeric keypad or require NumLock to be on.[ citation needed ]

In some applications (Word, Notepad and LibreOffice programs) Alt+X will replace the hexadecimal number to the left of the cursor with the matching Unicode character. Unless it is six hexadecimal digits long, the code must not be preceded by any digit or letters a–f as they may be treated as part of the code to be converted. For example, entering af1 followed by Alt+X (or Alt+C if using a French version) will produce '૱' (U+0AF1), but entering a0000f1 followed by Alt+X will produce 'añ' ('a' followed by character U+00F1).

This facility enables Unicode characters to be entered in other applications: one can create a desired character in Notepad, for example, and then cut and paste it wherever desired.

In MacOS

Hex input of Unicode must be enabled. In Mac OS 8.5 and later, one can choose the Unicode Hex Input keyboard layout; in OS X (10.10) Yosemite, this can be added in Keyboard → Input Sources.

Holding down ⌥ Option, one types the four-digit hexadecimal Unicode code point and the equivalent character appears; one can then release the ⌥ Option key. [9] Characters outside of the BMP (the Basic Multilingual Plane) exceed the four-digit limit of the Unicode hex input mechanism but can be entered by using surrogate pairs: holding down the ⌥ Option key while entering the first surrogate, the +, the second surrogate, then releasing the Option key.

In X11 (Linux and other Unix variants including ChromeOS)

In many applications one or both of the following methods work to directly input Unicode characters:

  • Holding Ctrl+⇧ Shift and typing u followed by the hex digits, then releasing Ctrl+⇧ Shift.
  • Entering Ctrl+⇧ Shift+u, releasing, then typing the hex digits and pressing ↵ Enter (or Space or even, on some systems, pressing and releasing ⇧ Shift or Ctrl). [10]

This is supported by GTK and Qt applications, and possibly others. In ChromeOS, this is an operating system function. [10]

In platform-independent applications

HTML

In HTML and XML, character codes to be rendered as characters are prefixed by ampersand and number sign (&#), and are followed by a semicolon (;). The code point can be either in decimal or in hexadecimal; in the latter case it is preceded by an "x". Leading zeros may be omitted. A number of characters may be represented by a named entity.

Example: In HTML/XML, the copyright sign © (U+00A9) may be coded as:

This works in many pieces of software that accept HTML markup, such as Thunderbird and Wikipedia editing.

See also

Notes

  1. CP1252 is the default in North and South America including the Caribbean islands, Western Europe, Central and Southern Africa, Australia, New Zealand, and the (former) European colonies and possessions in Oceania

Related Research Articles

An interpunct⟨·⟩, also known as an interpoint, middle dot, middot, centered dot or centred dot, is a punctuation mark consisting of a vertically centered dot used for interword separation in Classical Latin. It appears in a variety of uses in some modern languages and is present in Unicode as U+00B7·MIDDLE DOT.

Ø is a letter used in the Danish, Norwegian, Faroese, and Southern Sámi languages. It is mostly used as to represent the mid front rounded vowels, such as and, except for Southern Sámi where it is used as an diphthong.

<span class="mw-page-title-main">Ü</span> Latin letter U with umlaut/diaeresis

Ü is a Latin script character composed of the letter U and the diaeresis diacritical mark. In some alphabets such as those of a number of Romance languages or Guarani it denotes an instance of regular U to be construed in isolation from adjacent characters with which it would usually form a larger unit; other alphabets like the Azerbaijani, Estonian, German, Hungarian and Turkish ones treat it as a letter in its own right. In those cases it typically represents a close front rounded vowel.

<span class="mw-page-title-main">Pound sign</span> Currency sign

The pound sign is the symbol for the pound unit of sterling – the currency of the United Kingdom and its associated Crown Dependencies and British Overseas Territories and previously of Great Britain and of the Kingdom of England. The same symbol is used for other currencies called pound, such as the Egyptian and Syrian pounds. The sign may be drawn with one or two bars depending on personal preference, but the Bank of England has used the one-bar style exclusively on banknotes since 1975.

The National Library at Kolkata romanisation is a widely used transliteration scheme in dictionaries and grammars of Indic languages. This transliteration scheme is also known as (American) Library of Congress and is nearly identical to one of the possible ISO 15919 variants. The scheme is an extension of the IAST scheme that is used for transliteration of Sanskrit.

The International Alphabet of Sanskrit Transliteration (IAST) is a transliteration scheme that allows the lossless romanisation of Indic scripts as employed by Sanskrit and related Indic languages. It is based on a scheme that emerged during the 19th century from suggestions by Charles Trevelyan, William Jones, Monier Monier-Williams and other scholars, and formalised by the Transliteration Committee of the Geneva Oriental Congress, in September 1894. IAST makes it possible for the reader to read the Indic text unambiguously, exactly as if it were in the original Indic script. It is this faithfulness to the original scripts that accounts for its continuing popularity amongst scholars.

In word processing and digital typesetting, a non-breaking space, also called NBSP, required space, hard space, or fixed space, is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space. Non-breaking space characters with other widths also exist.

<span class="mw-page-title-main">Code page 437</span> Character set of the original IBM PC

Code page 437 is the character set of the original IBM PC. It is also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US. The set includes all printable ASCII characters as well as some accented letters (diacritics), Greek letters, icons, and line-drawing symbols. It is sometimes referred to as the "OEM font" or "high ASCII", or as "extended ASCII".

The degree symbol or degree sign, °, is a glyph or symbol that is used, among other things, to represent degrees of arc, hours, degrees of temperature or alcohol proof. The symbol consists of a small superscript circle.

<span class="mw-page-title-main">Compose key</span> Computer key to initiate glyph merger

A compose key is a key on a computer keyboard that indicates that the following keystrokes trigger the insertion of an alternate character, typically a precomposed character or a symbol.

The euro sign is the currency sign used for the euro, the official currency of the eurozone and adopted, although not required to, by Kosovo and Montenegro. The design was presented to the public by the European Commission on 12 December 1996. It consists of a stylized letter E, crossed by two lines instead of one. Depending on convention in each nation, the symbol can either precede the value, or follow the value, often with an intervening space.

<span class="mw-page-title-main">Esc key</span> Computer key

On computer keyboards, the Esc keyEsc is a key used to generate the escape character. The escape character, when sent from the keyboard to a computer, often is interpreted by software as "stop", "cancel" or "exit", and when sent from the computer to an external device marks the beginning of an escape sequence to specify operating modes or characteristics generally.

Diacritical marks of two dots¨, placed side-by-side over or under a letter, are used in a number of languages for several different purposes. The most familiar to English-language speakers are the diaeresis and the umlaut, though there are numerous others. For example, in Albanian, ë represents a schwa. Such diacritics are also sometimes used for stylistic reasons.

On personal computers with numeric keypads that use Microsoft operating systems, such as Windows, many characters that do not have a dedicated key combination on the keyboard may nevertheless be entered using the Alt code. This is done by pressing and holding the Alt key, then typing a number on the keyboard's numeric keypad that identifies the character and then releasing Alt.

A Unicode font is a computer font that maps glyphs to code points defined in the Unicode Standard. The vast majority of modern computer fonts use Unicode mappings, even those fonts which only include glyphs for a single writing system, or even only support the basic Latin alphabet. Fonts which support a wide range of Unicode scripts and Unicode symbols are sometimes referred to as "pan-Unicode fonts", although as the maximum number of glyphs that can be defined in a TrueType font is restricted to 65,535, it is not possible for a single font to provide individual glyphs for all defined Unicode characters. This article lists some widely used Unicode fonts that support a comparatively large number and broad range of Unicode characters.

<span class="mw-page-title-main">Universal Character Set characters</span> Complete list of the characters available on most computers

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.

<span class="mw-page-title-main">GNU Unifont</span> Duospaced bitmap font

GNU Unifont is a free Unicode bitmap font created by Roman Czyborra. The main Unifont covers all of the Basic Multilingual Plane (BMP). The "upper" companion covers significant parts of the Supplementary Multilingual Plane (SMP). The "Unifont JP" companion contains Japanese kanji present in the JIS X 0213 character set.

ISO/IEC 14755 is a joint International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) standard for input methods to enter characters defined in ISO/IEC 10646, the international standard corresponding to the Unicode Standard. As the repertoires of ISO/IEC 10646 and the Unicode Standard are identical, ISO/IEC 14755 therefore also describes methods for inputting Unicode characters. The standard was developed by ISO/IEC JTC 1/SC 35 User interfaces, and was published in August 1997.

There are a number of methods to input Esperanto letters and text on a computer, e.g. when using a word processor or email. Input methods depend on a computer's operating system. Specifically the characters ĵ, ĝ, ĉ, ĥ, ŭ, ŝ can be problematic.

References

  1. Lafontaine, Sylvain (February 17, 2012). "Unicode vs ASCII difference and benefits". MSDN. Retrieved 28 February 2014.
  2. 1 2 Andrew Marcuse, "How to enter Unicode characters in Microsoft Windows". Access date: September 13, 2012
  3. This is a private-use code point U+10FFEE which is unlikely to have a glyph assigned so it should display as a replacement.
  4. "Linux Keyboard Text Symbols: Compose-Key Shortcuts". FSymbols. 2013-07-24. Retrieved 2015-07-07.
  5. "Dead Key | Definition of Dead Key by Merriam-Webster". Merriam-webster.com. Retrieved 2017-05-01.
  6. "ISO/IEC 14755:1997 Information technology -- Input methods to enter characters from the repertoire of ISO/IEC 10646 with a keyboard or other input device". ISO. Retrieved 2017-10-14.
  7. "How to Use Special Characters in Windows Documents". support.microsoft.com. Jul 31, 2019. Retrieved 2020-10-17.
  8. Peck, Akkana (2009-11-25). "Mastering Characters Sets in Linux (Weird Characters, part 2)". LinuxPlanet . Archived from the original on 2010-11-26. Retrieved 2018-12-05.
  9. Typing special and accented characters Archived 2008-03-09 at the Wayback Machine
  10. 1 2 Jack Busch (April 20, 2018). "Type Special Characters with a Chromebook (Accents, Symbols, Em Dashes)". groovypost.com. Retrieved February 28, 2020.
  11. Vim documentation: gui_w32