Zero-width space

Last updated

The zero-width space (ZWSP) is a non-printing character used in computerized typesetting to indicate where the word boundaries are, without actually displaying a visible space in the rendered text. This enables text-processing systems for scripts that do not use explicit spacing to recognize where word boundaries are for the purpose of handling line breaks appropriately. Zero-width space is unicode character U+200B, and is located in the unicode General Punctuation block, and can be represented by numeric character references ​ or ​.

Contents

Purpose

The zero-width space marks a potential line break without hyphenation. Its semantics and HTML implementation are similar to the soft hyphen, but soft hyphens display a hyphen character at the point where the line is broken.

The zero-width space can be used to mark word breaks in languages without visible space between words, such as Thai, Myanmar, Khmer, and Japanese. [1]

In justified text, the rendering engine may add inter-character spacing, also known as letter spacing, between letters separated by a zero-width space, unlike around fixed-width spaces. [1]

Example

To show the effect of the zero-width space in text, the following words have been separated with zero-width spaces:

LoremIpsumDolorSitAmetConsecteturAdipiscingElitSedDoEiusmodTemporIncididuntUtLaboreEtDoloreMagnaAliquaUtEnimAdMinimVeniamQuisNostrudExercitationUllamcoLaborisNisiUtAliquipExEaCommodoConsequatDuisAuteIrureDolorInReprehenderitInVoluptateVelitEsseCillumDoloreEuFugiatNullaPariaturExcepteurSintOccaecatCupidatatNonProidentSuntInCulpaQuiOfficiaDeseruntMollitAnimIdEstLaborum

And the following words have not been separated with these spaces:

LoremIpsumDolorSitAmetConsecteturAdipiscingElitSedDoEiusmodTemporIncididuntUtLaboreEtDoloreMagnaAliquaUtEnimAdMinimVeniamQuisNostrudExercitationUllamcoLaborisNisiUtAliquipExEaCommodoConsequatDuisAuteIrureDolorInReprehenderitInVoluptateVelitEsseCillumDoloreEuFugiatNullaPariaturExcepteurSintOccaecatCupidatatNonProidentSuntInCulpaQuiOfficiaDeseruntMollitAnimIdEstLaborum

The first text only breaks at word boundaries, while the second text will not be broken at all. Resizing the browser window will re-break the text accordingly.

Usage

HTML

In HTML pages, the HTML element <wbr> functions as a zero-width space. In Internet Explorer 6, the zero-width space was not supported in some fonts. [2]

Prohibition in domain names

ICANN rules prohibit domain names from containing non-displayed characters, including the zero-width space, and most browsers prohibit their use within domain names because they can be used to create a homograph attack, where a malicious URL is visually indistinguishable from a legitimate one. [3] [4]

Encoding

The zero-width space character is encoded in Unicode as U+200BZERO WIDTH SPACE, [5] and input in HTML as &ZeroWidthSpace;, &#8203; or &#x200B;. Contrary to what their names suggest, the character entities &NegativeThickSpace;, &NegativeMediumSpace;, &NegativeThinSpace;, and &NegativeVeryThinSpace; also refer to the zero-width space. [6]

The TeX representation is \hskip0pt; the LaTeX representation is \hspace{0pt}; [7] and the groff representation is \:. [8]

See also

Related Research Articles

The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation.

<span title="Latin-language text"><i lang="la">Scriptio continua</i></span> Style of writing without spaces between words

Scriptio continua, also known as scriptura continua or scripta continua, is a style of writing without spaces or other marks between the words or sentences. The form also lacks punctuation, diacritics, or distinguished letter case. In the West, the oldest Greek and Latin inscriptions used word dividers to separate words in sentences; however, Classical Greek and late Classical Latin both employed scriptio continua as the norm.

The byte-order mark (BOM) is a particular usage of the special Unicode character code, U+FEFFZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

In punctuation, a word divider is a form of glyph which separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or whitespace. This convention is spreading, along with other aspects of European punctuation, to Asia and Africa, where words are usually written without word separation.

In writing, a space is a blank area that separates words, sentences, syllables and other written or printed glyphs (characters). Conventions for spacing vary among languages, and in some languages the spacing rules are complex. Inter-word spaces ease the reader's task of identifying words, and avoid outright ambiguities such as "now here" vs. "nowhere". They also provide convenient guides for where a human or program may start new lines.

<i>Lorem ipsum</i> Placeholder text used in publishing and graphic design

In publishing and graphic design, Lorem ipsum is a placeholder text commonly used to demonstrate the visual form of a document or a typeface without relying on meaningful content. Lorem ipsum may be used as a placeholder before the final copy is available. It is also used to temporarily replace text in a process called greeking, which allows designers to consider the form of a webpage or publication, without the meaning of the text influencing the design.

In the written form of many languages, indentation describes empty space, a.k.a. white space, used around text to signify an important aspect of the text such as:

<span class="mw-page-title-main">Soft hyphen</span> Unicode character

In computing and typesetting, a soft hyphen or syllable hyphen, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they fall on the line end but remain invisible within the line.

In word processing and digital typesetting, a non-breaking space, also called NBSP, required space, hard space, or fixed space, is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space. Non-breaking space characters with other widths also exist.

Line breaking, also known as word wrapping, is breaking a section of text into lines so that it will fit into the available width of a page, window or other display area. In text display, line wrap is continuing on a new line when a line is full, so that each line fits into the viewable window, allowing text to be read from top to bottom without any horizontal scrolling. Word wrap is the additional feature of most text editors, word processors, and web browsers, of breaking lines between words rather than within words, where possible. Word wrap makes it unnecessary to hard-code newline delimiters within paragraphs, and allows the display of text to adapt flexibly and dynamically to displays of varying sizes.

<span class="mw-page-title-main">Zero-width joiner</span> Non-printing character used in computerized typesetting

The zero-width joiner is a non-printing character used in the computerized typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes, such as the Arabic script or any Indic script. Sometimes the Roman script is to be counted as complex, e.g. when using a Fraktur typeface. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.

A whitespace character is a character data element that represents white space when text is rendered for display by a computer.

<span class="mw-page-title-main">Initial</span> Oversized first letter in a text block

In a written or published work, an initial is a letter at the beginning of a word, a chapter, or a paragraph that is larger than the rest of the text. The word is ultimately derived from the Latin initiālis, which means of the beginning. An initial is often several lines in height, and, in older books or manuscripts, may take the form of an inhabited or historiated initial. Certain important initials, such as the Beatus initial, or B, of Beatus vir... at the opening of Psalm 1 at the start of a vulgate Latin. These specific initials in an illuminated manuscript were also called initia.

The fmt command in Unix, Plan 9, Inferno, and Unix-like operating systems is used to format natural language text for humans to read.

<span class="mw-page-title-main">Universal Character Set characters</span> Complete list of the characters available on most computers

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.

The word joiner (WJ) is a Unicode format character which is used to indicate that line breaking should not occur at its position. It does not affect the formation of ligatures or cursive joining and is ignored for the purpose of text segmentation. It is encoded since Unicode version 3.2 as U+2060WORD JOINER.

The tie is a symbol in the shape of an arc similar to a large breve, used in Greek, phonetic alphabets, and Z notation. It can be used between two characters with spacing as punctuation, non-spacing as a diacritic, or (underneath) as a proofreading mark. It can be above or below, and reversed. Its forms are called tie, double breve, enotikon or papyrological hyphen, ligature tie, and undertie.

fold is a Unix command used for making a file with long lines more readable on a limited width computer terminal by performing a line wrap.

A figure space or numeric space is a typographic unit equal to the size of a single numerical digit. Its size can fluctuate somewhat depending on which font is being used. This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.

General Punctuation is a Unicode block containing punctuation, spacing, and formatting characters for use with all scripts and writing systems. Included are the defined-width spaces, joining formats, directional formats, smart quotes, archaic and novel punctuation such as the interrobang, and invisible mathematical operators.

References

Citations

  1. 1 2 "23.2 Layout Controls". The Unicode® Standard Version 15.0 – Core Specification (PDF). The Unicode Consortium. September 2022. p. 918. ISBN   978-1-936213-32-0.
  2. Dunae, Alex. "Better Web Typography with Spaces and Hyphens". dunae.ca. Archived from the original on December 14, 2010. Retrieved December 3, 2009.
  3. "Network.IDN.blacklist_chars". mozillaZine. Retrieved 2018-02-07.
  4. "Unicode Character 'Zero Width Space'". FileFormat.Info. Retrieved 2018-02-07.
  5. "General Punctuation – Unicode" (PDF). Retrieved 2013-07-20.
  6. Entities/ZeroWidthSpace in MathML Version 2.0
  7. "The LaTeX Companion. Chapter 3: Basic Formatting Tools" (PDF). Retrieved 2019-07-16.
  8. "groff(7) – Linux manual page" . Retrieved 2014-02-08.

Sources

  • Mair, Victor H.; Liu, Yongquan (1991), Characters and computers, IOS Press