Zero-width space

Last updated

The zero-width space(), abbreviated ZWSP, is a non-printing character used in computerized typesetting to indicate word boundaries to text-processing systems for scripts that do not use explicit spacing, or after characters not followed by a visible space after which there may be a line break.

Contents

Purpose

The zero-width space marks a potential line break without hyphenation; for hyphenated line breaks, a soft hyphen is used. The zero-width space can be used to mark word breaks in languages without visible space between words, such as Thai, Myanmar, Khmer, and Japanese. [1] [2]

Unlike fixed-width spaces, in justified text that increases spacing between letters, characters adjacent to the zero-width space are spaced as if it was not present. [2]

Example

To show the effect of the zero-width space, the following words have been separated with zero-width spaces:

AllHumanBeingsAreBornFreeAndEqualInDignityAndRightsTheyAreEndowedWithReasonAndConscienceAndShouldActTowardsOneAnotherInASpiritOfBrotherhood

And the following words are not separated with these spaces:

AllHumanBeingsAreBornFreeAndEqualInDignityAndRightsTheyAreEndowedWithReasonAndConscienceAndShouldActTowardsOneAnotherInASpiritOfBrotherhood

On browsers supporting zero-width spaces, resizing the window will re-break the first text only at word boundaries, while the second text will not be broken at all.

Usage

HTML

In HTML pages, the HTML element <wbr> functions as a zero-width space. In Internet Explorer 6, the zero-width space was not supported in some fonts. [3]

Prohibition in domain names

ICANN rules prohibit domain names from containing non-displayed characters, including the zero-width space, and most browsers prohibit their use within domain names because they can be used to create a homograph attack, where a malicious URL is visually indistinguishable from a legitimate one. [4] [5]

Encoding

The zero-width space character is encoded in Unicode as U+200BZERO WIDTH SPACE, [6] and input in HTML as &ZeroWidthSpace;, &#8203; or &#x200B;. Contrary to what their names suggest, the character entities &NegativeThickSpace;, &NegativeMediumSpace;, &NegativeThinSpace;, and &NegativeVeryThinSpace; also refer to the zero-width space. [7]

The TeX representation is \hskip0pt; the LaTeX representation is \hspace{0pt}; [8] and the groff representation is \:. [9]

Its semantics and HTML implementation are similar to the soft hyphen, except that soft hyphens display a hyphen character at the point where the line is broken.

See also

Related Research Articles

The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. Son-in-law is an example of a hyphenated word.

The byte-order mark (BOM) is a particular usage of the special Unicode character code, U+FEFFZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

An interpunct⟨·⟩, also known as an interpoint, middle dot, middot, centered dot or centred dot, is a punctuation mark consisting of a vertically centered dot used for interword separation in Classical Latin. It appears in a variety of uses in some modern languages and is present in Unicode as U+00B7·MIDDLE DOT.

In writing, a space is a blank area that separates words, sentences, syllables and other written or printed glyphs (characters). Conventions for spacing vary among languages, and in some languages the spacing rules are complex. Inter-word spaces ease the reader's task of identifying words, and avoid outright ambiguities such as "now here" vs. "nowhere". They also provide convenient guides for where a human or program may start new lines.

<span class="mw-page-title-main">Zero-width non-joiner</span> Non-printing character that separates two normally joined characters

The zero-width non-joiner () is a non-printing character used in the computerization of writing systems that make use of ligatures. When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively. This is also an effect of a space character, but a ZWNJ is used when it is desirable to keep the characters closer together or to connect a word with its morpheme.

<span class="mw-page-title-main">Soft hyphen</span> Unicode character

In computing and typesetting, a soft hyphen or syllable hyphen, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they fall on the line end but remain invisible within the line.

In word processing and digital typesetting, a non-breaking space, also called NBSP, required space, hard space, or fixed space, is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space. Non-breaking space characters with other widths also exist.

Line breaking, also known as word wrapping, is breaking a section of text into lines so that it will fit into the available width of a page, window or other display area. In text display, line wrap is continuing on a new line when a line is full, so that each line fits into the viewable window, allowing text to be read from top to bottom without any horizontal scrolling. Word wrap is the additional feature of most text editors, word processors, and web browsers, of breaking lines between words rather than within words, where possible. Word wrap makes it unnecessary to hard-code newline delimiters within paragraphs, and allows the display of text to adapt flexibly and dynamically to displays of varying sizes.

In Latin script, the double hyphen is a punctuation mark that consists of two parallel hyphens. It was a development of the earlier double oblique hyphen, which developed from a Central European variant of the virgule slash, originally a form of scratch comma. Similar marks are used in other scripts.

A whitespace character is a character data element that represents white space when text is rendered for display by a computer.

The Cork encoding is a character encoding used for encoding glyphs in fonts. It is named after the city of Cork in Ireland, where during a TeX Users Group (TUG) conference in 1990 a new encoding was introduced for LaTeX. It contains 256 characters supporting most west- and east-European languages with the Latin alphabet.

<span class="mw-page-title-main">Universal Character Set characters</span> Complete list of the characters available on most computers

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.

Microtypography is a range of methods for improving the readability and appearance of text, especially justified text. The methods reduce the appearance of large interword spaces and create edges to the text that appear more even. Microtypography methods can also increase reading comprehension of text, reducing the cognitive load of reading.

Writing systems that use Chinese characters also include various punctuation marks, derived from both Chinese and Western sources. Historically, jùdú annotations were often used to indicate the boundaries of sentences and clauses in text. The use of punctuation in written Chinese only became mandatory during the 20th century, due to Western influence. Unlike modern punctuation, judu marks were added by scholars for pedagogical purposes and were not viewed as integral to the text. Texts were therefore generally transmitted without judu. In most cases, this practice did not interfere with the interpretation of a text, although it occasionally resulted in ambiguity.

The word joiner (WJ) is a Unicode format character which is used to indicate that line breaking should not occur at its position. It does not affect the formation of ligatures or cursive joining and is ignored for the purpose of text segmentation. It is encoded since Unicode version 3.2 as U+2060WORD JOINER.

The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the en dash, generally longer than the hyphen but shorter than the minus sign; the em dash, longer than either the en dash or the minus sign; and the horizontal bar, whose length varies across typefaces but tends to be between those of the en and em dashes.

The Unicode Standard assigns various properties to each Unicode character and code point.

Sentence spacing in digital media concerns the horizontal width of the space between sentences in computer- and web-based media. Digital media allow sentence spacing variations not possible with the typewriter. Most digital fonts permit the use of a variable space or a no-break space. Some modern font specifications, such as OpenType, have the ability to automatically add or reduce space after punctuation, and users may be able to choose sentence spacing variations.

A figure space or numeric space is a typographic unit equal to the size of a single numerical digit. Its size can fluctuate somewhat depending on which font is being used. This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.

General Punctuation is a Unicode block containing punctuation, spacing, and formatting characters for use with all scripts and writing systems. Included are the defined-width spaces, joining formats, directional formats, smart quotes, archaic and novel punctuation such as the interrobang, and invisible mathematical operators.

References

Citations

  1. "Zones spéciales et caractères de formatage" [Special areas and formatting characters](PDF). Hapax Quebec (in French). p. 3. Archived from the original (PDF) on 27 December 2005. Retrieved 31 July 2019. Les espaces sans chasse sont conçues pour les langues qui ne séparent pas les mots à l'aide d'espaces visibles, comme le thaï ou le japonais.
  2. 1 2 The Unicode® Standard Version 15.0 – Core Specification (PDF). The Unicode Consortium. September 2022. p. 918. ISBN   978-1-936213-32-0.
  3. Dunae, Alex. "Better Web Typography with Spaces and Hyphens". dunae.ca. Archived from the original on December 14, 2010. Retrieved December 3, 2009.
  4. "Network.IDN.blacklist_chars". mozillaZine. Retrieved 2018-02-07.
  5. "Unicode Character 'Zero Width Space'". FileFormat.Info. Retrieved 2018-02-07.
  6. "General Punctuation – Unicode" (PDF). Retrieved 2013-07-20.
  7. Entities/ZeroWidthSpace in MathML Version 2.0
  8. "The LaTeX Companion. Chapter 3: Basic Formatting Tools" (PDF). Retrieved 2019-07-16.
  9. "groff(7) – Linux manual page" . Retrieved 2014-02-08.

Sources

  • Mair, Victor H.; Liu, Yongquan (1991), Characters and computers, IOS Press