Numeric character reference

Last updated

A numeric character reference (NCR) is a common markup construct used in SGML and SGML-derived markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represents a single character. Since WebSgml, XML and HTML 4, the code points of the Universal Character Set (UCS) of Unicode are used. NCRs are typically used in order to represent characters that are not directly encodable in a particular document (for example, because they are international characters that do not fit in the 8-bit character set being used, or because they have special syntactic meaning in the language). When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

Contents

Examples

In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma

Numerical character reference of U+03A3ΣGREEK CAPITAL LETTER SIGMA
(Note that 3A316 = 93110)
Unicode character Numerical baseNumerical reference in markupEffect
U+03A3DecimalΣΣ
U+03A3DecimalΣΣ
U+03A3HexadecimalΣΣ
U+03A3HexadecimalΣΣ
U+03A3HexadecimalΣΣ

In SGML, HTML, and XML, the following are all valid numeric character references for the Latin capital letter AE

Numerical character reference of U+00C6ÆLATIN CAPITAL LETTER AE
Unicode character Numerical baseNumerical reference in markupEffect
U+00C6DecimalÆÆ
U+00C6HexadecimalÆÆ

In SGML, HTML, and XML, the following are all valid numeric character references for the Latin small letter sharp s ß

Numerical character reference of U+00DFßLATIN SMALL LETTER SHARP S
Unicode character Numerical baseNumerical reference in markupEffect
U+00DFDecimalßß
U+00DFHexadecimalßß

List of numeric character references for the printable ASCII characters:

Unicode character Character
Reference
(decimal)
Character
Reference
(hexadecimal)
Effect
U+0020  (space)
U+0021!! !
U+0022"""
U+0023###
U+0024$$$
U+0025%% %
U+0026&&&
U+0027'''
U+0028(((
U+0029)))
U+002A***
U+002B+++
U+002C,,,
U+002D---
U+002E...
U+002F///
U+0030000
U+0031111
U+0032222
U+0033333
U+0034444
U+0035555
U+0036666
U+0037777
U+0038888
U+0039999
U+003A:: :
U+003B&#59;&#x3B; ;
U+003C&#60;&#x3C;<
U+003D&#61;&#x3D;=
U+003E&#62;&#x3E;>
U+003F&#63;&#x3F; ?
U+0040&#64;&#x40;@
U+0041&#65;&#x41;A
U+0042&#66;&#x42;B
U+0043&#67;&#x43;C
U+0044&#68;&#x44;D
U+0045&#69;&#x45;E
U+0046&#70;&#x46;F
U+0047&#71;&#x47;G
U+0048&#72;&#x48;H
U+0049&#73;&#x49;I
U+004A&#74;&#x4A;J
U+004B&#75;&#x4B;K
U+004C&#76;&#x4C;L
U+004D&#77;&#x4D;M
U+004E&#78;&#x4E;N
U+004F&#79;&#x4F;O
U+0050&#80;&#x50;P
U+0051&#81;&#x51;Q
U+0052&#82;&#x52;R
U+0053&#83;&#x53;S
U+0054&#84;&#x54;T
U+0055&#85;&#x55;U
U+0056&#86;&#x56;V
U+0057&#87;&#x57;W
U+0058&#88;&#x58;X
U+0059&#89;&#x59;Y
U+005A&#90;&#x5A;Z
U+005B&#91;&#x5B;[
U+005C&#92;&#x5C;\
U+005D&#93;&#x5D;]
U+005E&#94;&#x5E;^
U+005F&#95;&#x5F;_
U+0060&#96;&#x60;'
U+0061&#97;&#x61;a
U+0062&#98;&#x62;b
U+0063&#99;&#x63;c
U+0064&#100;&#x64;d
U+0065&#101;&#x65;e
U+0066&#102;&#x66;f
U+0067&#103;&#x67;g
U+0068&#104;&#x68;h
U+0069&#105;&#x69;i
U+006A&#106;&#x6A;j
U+006B&#107;&#x6B;k
U+006C&#108;&#x6C;l
U+006D&#109;&#x6D;m
U+006E&#110;&#x6E;n
U+006F&#111;&#x6F;o
U+0070&#112;&#x70;p
U+0071&#113;&#x71;q
U+0072&#114;&#x72;r
U+0073&#115;&#x73;s
U+0074&#116;&#x74;t
U+0075&#117;&#x75;u
U+0076&#118;&#x76;v
U+0077&#119;&#x77;w
U+0078&#120;&#x78;x
U+0079&#121;&#x79;y
U+007A&#122;&#x7A;z
U+007B&#123;&#x7B;{
U+007C&#124;&#x7C;|
U+007D&#125;&#x7D;}
U+007E&#126;&#x7E;~

Discussion

Markup languages are typically defined in terms of UCS or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding.

Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence.

Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on ISO 8859 can only represent, at most, 256 unique characters as one 8-bit byte each.

Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism.

The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range (the first 128 code points of Unicode) to represent, or reference, any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are character references.

Character references that are based on the referenced character's UCS or Unicode code point are called numeric character references. In HTML 4 and in all versions of XHTML and XML, the code point can be expressed either as a decimal (base 10) number or as a hexadecimal (base 16) number. The syntax is as follows:

Character U+0026 (ampersand), followed by character U+0023 (number sign), followed by one of the following choices:

all followed by character U+003B (semicolon). Older versions of HTML disallowed the hexadecimal syntax.

The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable.

There is another kind of character reference called a character entity reference , which allows a character to be referred to by a name instead of a number. (Naming a character creates a character entity .) HTML defines some character entities, but not many; all other characters can only be included by direct encoding or using NCRs.

Restrictions

The Universal Character Set defined by ISO 10646 is the "document character set" of SGML, HTML 4, so by default, any character in such a document, and any character referenced in such a document, must be in the UCS.

While the syntax of SGML does not prohibit references to invalid or unassigned code points, such as &#xFFFF;, SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to only those code points that are assigned to characters.

Restrictions may also apply for other reasons. For example, in HTML 4, &#12;, which is a reference to a non-printing "form feed" control character, is allowed because a form feed character is allowed. But in XML, the form feed character cannot be used, not even by reference. [1] [ citation needed ] As another example, &#128;, which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers – some of which interpret it as a reference to the character represented by code value 128 in the Windows-1252 encoding for compatibility reasons. This character, "€", has to be represented as &#8364; in a standard-compliant HTML code. As a further example, prior to the publication of XML 1.0 Second Edition on October 6, 2000, XML 1.0 was based on an older version of ISO 10646 and prohibited using characters above U+FFFD, except in character data, thus making a reference like &#65536; (U+10000) illegal. In XML 1.1 and newer editions of XML 1.0, such a reference is allowed, because the available character repertoire was explicitly extended.

Markup languages also place restrictions on where character references can occur.

Compatibility issues

In the initial versions of SGML and HTML, numeric character references were interpreted in relationship to the document character encoding, rather than Unicode. For Latin-script documents, numeric character references to characters between x80 and x9F in those documents will not be correct against Unicode, and must be recoded. HTML standards prior to HTML 4 supported only Western Latin script documents: the treatment of character references above #7F may vary between applications and national conventions.

For example, as mentioned above, the correct numeric character reference for the Euro sign "€" U+20AC when using Unicode is decimal &#8364; and hexadecimal &#x20AC;. However, if using tools supporting obsolete implementations of HTML, the reference &#128; (Euro sign in the CP-1252 code page) or &#164; (Euro sign in ISO/IEC 8859-15) may work.

As another example, if some text was created originally using the MacRoman character set, the left double quotation mark will be represented with code point xD2. This will not display properly in a system expecting a document encoded as UTF-8, ISO 8859-1, or CP-1252, where this code point is occupied by the letter Ò. The correct numeric character reference for in HTML 4 and newer is &#x201C;, because U+201C is its UCS code. In some systems, the named character reference &ldquo; may also be available.

See also

Related Research Articles

<span class="mw-page-title-main">Character encoding</span> Using numbers to represent text characters

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a "code page", or a "character map".

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

A document type definition (DTD) is a specification file that contains set of markup declarations that define a document type for an SGML-family markup language. The DTD specification file can be used to validate documents.

<span class="mw-page-title-main">HTML</span> HyperText Markup Language

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It defines the meaning and structure of web content. It is often assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript.

ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. The ISO working group maintaining this series of standards has been disbanded.

<span class="mw-page-title-main">Plain text</span> Term for computer data consisting only of unformatted characters of readable material

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

<span class="mw-page-title-main">Standard Generalized Markup Language</span> Markup language

The Standard Generalized Markup Language is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates":

<span class="mw-page-title-main">Unicode</span> Character encoding standard

Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major writing systems. Version 15.1 of the standard defines 149813 characters and 161 scripts used in various ordinary, literary, academic, and technical contexts.

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in an HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.

<span class="mw-page-title-main">XML</span> Markup language by the W3C for encoding of data

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The World Wide Web Consortium's XML 1.0 Specification of 1998 and several other related specifications—all of them free open standards—define XML.

ISO/IEC 2022Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding. It is equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202. Originating in 1971, it was most recently revised in 1994.

<span class="mw-page-title-main">Soft hyphen</span> Unicode character

In computing and typesetting, a soft hyphen or syllable hyphen, abbreviated SHY, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they fall on the line end but remain invisible within the line.

T.61 is an ITU-T Recommendation for a Teletex character set. T.61 predated Unicode, and was the primary character set in ASN.1 used in early versions of X.500 and X.509 for encoding strings containing characters used in Western European languages. It is also used by older versions of LDAP. While T.61 continues to be supported in modern versions of X.500 and X.509, it has been deprecated in favor of Unicode. It is also called Code page 1036, CP1036, or IBM 01036.

The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup languages SGML and XML. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE represents a blank space punctuation character in text, used as a word divider in Western scripts.

<span class="mw-page-title-main">Universal Character Set characters</span> Complete list of the characters available on most computers

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.

<span class="mw-page-title-main">Unicode input</span> Input characters using their Unicode code points

Unicode input is the insertion of a specific Unicode character on a computer by a user; it is a common way to input characters not directly supported by a physical keyboard. Unicode characters can be produced either by selecting them from a display or by typing a certain sequence of keys on a physical keyboard. In addition, a character produced by one of these methods in one web page or document can be copied into another. In contrast to ASCII's 96 element character set, Unicode encodes hundreds of thousands of graphemes (characters) from almost all of the world's written languages and many other signs and symbols besides.

A Formal Public Identifier (FPI) is a short piece of text with a particular structure that may be used to uniquely identify a product, specification or document. FPIs were introduced as part of Standard Generalized Markup Language (SGML), and serve particular purposes in formats historically derived from SGML. Some of their most common uses are as part of document type declarations (DOCTYPEs) and document type definitions (DTDs) in SGML, XML and historically HTML, but they are also used in the vCard and iCalendar file formats to identify the software product which generated the file.

The Universal Coded Character Set is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

References

  1. "HTML 5.2: 8. The HTML syntax". www.w3.org.