Line wrap and word wrap

Last updated November 09, 2024

With word wrap

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Contents
Soft and hard returns
Unicode
Word boundaries, hyphenation, and hard spaces
Word wrapping in text containing Chinese, Japanese, and Korean
Algorithm
Minimum number of lines
Minimum raggedness
History
See also
References
External links
Knuth's algorithm
Other word-wrap links

Without word wrap

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Hard coded newlines

Lorem ipsum dolor sit amet, consectetur adipiscing elit,sed do eiusmod tempor incididunt ut labore et dolore
magna aliqua. Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat
non proident, sunt in culpa qui officia deserunt mollit
anim id est laborum.

Line breaking, also known as word wrapping, is breaking a section of text into lines so that it will fit into the available width of a page, window or other display area. In text display, line wrap is continuing on a new line when a line is full, so that each line fits into the viewable window, allowing text to be read from top to bottom without any horizontal scrolling. Word wrap is the additional feature of most text editors, word processors, and web browsers, of breaking lines between words rather than within words, where possible. Word wrap makes it unnecessary to hard-code newline delimiters within paragraphs, and allows the display of text to adapt flexibly and dynamically to displays of varying sizes.

Soft and hard returns

A soft return or soft wrap is the break resulting from line wrap or word wrap (whether automatic or manual), whereas a hard return or hard wrap is an intentional break, creating a new paragraph. With a hard return, paragraph-break formatting can (and should) be applied (either indenting or vertical whitespace). Soft wrapping allows line lengths to adjust automatically with adjustments to the width of the user's window or margin settings, and is a standard feature of all modern text editors, word processors, and email clients. Manual soft breaks are unnecessary when word wrap is done automatically, so hitting the "Enter" key usually produces a hard return.

Alternatively, "soft return" can mean an intentional, stored line break that is not a paragraph break. For example, it is common to print postal addresses in a multiple-line format, but the several lines are understood to be a single paragraph. Line breaks are needed to divide the words of the address into lines of the appropriate length.

In the contemporary graphical word processors Microsoft Word and Libreoffice Writer, users are expected to type a carriage return (↵ Enter ) between each paragraph. Formatting settings, such as first-line indentation or spacing between paragraphs, take effect where the carriage return marks the break. A non-paragraph line break, which is a soft return, is inserted using ⇧ Shift +↵ Enter or via the menus, and is provided for cases when the text should start on a new line but none of the other side effects of starting a new paragraph are desired.

In text-oriented markup languages, a soft return is typically offered as a markup tag. For example, in HTML there is a <br> tag that has the same purpose as the soft return in word processors described above.

Unicode

The Unicode Line Breaking Algorithm determines a set of positions, known as break opportunities, that are appropriate places in which to begin a new line. The actual line break positions are picked from among the break opportunities by the higher level software that calls the algorithm, not by the algorithm itself, because only the higher level software knows about the width of the display the text is displayed on and the width of the glyphs that make up the displayed text.^[1]

The Unicode character set provides a line separator character as well as a paragraph separator to represent the semantics of the soft return and hard return.

0x2028 LINE SEPARATOR

* may be used to represent this semantic unambiguously

0x2029 PARAGRAPH SEPARATOR

* may be used to represent this semantic unambiguously

Word boundaries, hyphenation, and hard spaces

The soft returns are usually placed after the ends of complete words, or after the punctuation that follows complete words. However, word wrap may also occur following a hyphen inside of a word. This is sometimes not desired, and can be blocked by using a non-breaking hyphen, or hard hyphen, instead of a regular hyphen.

A word without hyphens can be made wrappable by having soft hyphens in it. When the word isn't wrapped (i.e., isn't broken across lines), the soft hyphen isn't visible. But if the word is wrapped across lines, this is done at the soft hyphen, at which point it is shown as a visible hyphen on the top line where the word is broken. (In the rare case of a word that is meant to be wrappable by breaking it across lines but without making a hyphen ever appear, a zero-width space is put at the permitted breaking point(s) in the word.)

Sometimes word wrap is undesirable between adjacent words. In such cases, word wrap can usually be blocked by using a hard space or non-breaking space between the words, instead of regular spaces.

Word wrapping in text containing Chinese, Japanese, and Korean

In Chinese, Japanese, and Korean, word wrapping can usually occur before and after any Han character, but certain punctuation characters are not allowed to begin a new line.^[2] Japanese kana are treated the same way as Han Characters (Kanji) by extension, meaning words can, and tend to be, broken without any explicit indication that a word continues on the next line.

Under certain circumstances, however, word wrapping is not desired. For instance,

word wrapping might not be desired within personal names, and
word wrapping might not be desired within any compound words (when the text is flush left but only in some styles).

Most existing word processors and typesetting software cannot handle either of the above scenarios.

CJK punctuation may or may not follow rules similar to the above-mentioned special circumstances. It is up to line breaking rules in CJK.

Algorithm

Word wrapping is an optimization problem. Depending on what needs to be optimized for, different algorithms are used.

Minimum number of lines

A simple way to do word wrapping is to use a greedy algorithm that puts as many words on a line as possible, then moving on to the next line to do the same until there are no more words left to place. This method is used by many modern word processors, such as Libreoffice Writer and Microsoft Word.^{[ citation needed ]} This algorithm always uses the minimum possible number of lines but may lead to lines of widely varying lengths. The following pseudocode implements this algorithm:

SpaceLeft := LineWidth for each Word in Text     if (Width(Word) + SpaceWidth) > SpaceLeft         insert line break before Word in Text         SpaceLeft := LineWidth - Width(Word)     else         SpaceLeft := SpaceLeft - (Width(Word) + SpaceWidth)

Where LineWidth is the width of a line, SpaceLeft is the remaining width of space on the line to fill, SpaceWidth is the width of a single space character, Text is the input text to iterate over and Word is a word in this text.

Minimum raggedness

A different algorithm, used in TeX, minimizes the sum of the squares of the lengths of the spaces at the end of lines to produce a more aesthetically pleasing result than the greedy algorithm, which does not always minimize squared space.

History

A primitive line-breaking feature was used in 1955 in a "page printer control unit" developed by Western Union. This system used relays rather than programmable digital computers, and therefore needed a simple algorithm that could be implemented without data buffers. In the Western Union system, each line was broken at the first space character to appear after the 58th character, or at the 70th character if no space character was found.^[3]

The greedy algorithm for line-breaking predates the dynamic programming method outlined by Donald Knuth in an unpublished 1977 memo describing his TeX typesetting system^[4] and later published in more detail by Knuth & Plass (1981).^[5]

Related Research Articles

TeX, stylized within the system as $T e X$ , is a typesetting program which was designed and written by computer scientist and Stanford University professor Donald Knuth and first released in 1978. The term now refers to the system of extensions – which includes software programs called TeX engines, sets of TeX macros, and packages which provide extra typesetting functionality – built around the original TeX language. TeX is a popular means of typesetting complex mathematical formulae; it has been noted as one of the most sophisticated digital typographical systems.

The hyphen‐ is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation.

Lorem ipsum is a dummy or placeholder text commonly used in graphic design, publishing, and web development to fill empty spaces in a layout that do not yet have content.

A newline is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a line of text and the start of a new one.

In the written form of many languages, indentation describes empty space, a.k.a. white space, used around text to signify an important aspect of the text such as:

In computing and typesetting, a soft hyphen or syllable hyphen, is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they fall on the line end but remain invisible within the line.

In word processing and digital typesetting, a non-breaking space, also called NBSP, required space, hard space, or fixed space, is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space. Non-breaking space characters with other widths also exist.

Syllabification or syllabication, also known as hyphenation, is the separation of a word into syllables, whether spoken, written or signed.

The hyphen-minus symbol - is the form of hyphen most commonly used in digital documents. On most keyboards, it is the only character that resembles a minus sign or a dash so it is also used for these. The name hyphen-minus derives from the original ASCII standard, where it was called hyphen (minus). The character is referred to as a hyphen, a minus sign, or a dash according to the context where it is being used.

In typesetting and page layout, alignment or range is the setting of text flow or image placement relative to a page, column (measure), table cell, or tab.

A whitespace character is a character data element that represents white space when text is rendered for display by a computer.

<span class="mw-page-title-main">Initial</span> Oversized first letter in a text block

In a written or published work, an initial is a letter at the beginning of a word, a chapter, or a paragraph that is larger than the rest of the text. The word is ultimately derived from the Latin initiālis, which means of the beginning. An initial is often several lines in height, and, in older books or manuscripts, may take the form of an inhabited or historiated initial. Certain important initials, such as the Beatus initial, or B, of Beatus vir... at the opening of Psalm 1 at the start of a vulgate Latin. These specific initials in an illuminated manuscript were also called initia.

The fmt command in Unix, Plan 9, Inferno, and Unix-like operating systems is used to format natural language text for humans to read.

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set, is an international standard to map characters, discrete symbols used in natural language, mathematics, music, and other domains, to unique machine-readable data values. By creating this mapping, the UCS enables computer software vendors to interoperate, and transmit—interchange—UCS-encoded text strings from one to another. Because it is a universal map, it can be used to represent multiple languages at the same time. This avoids the confusion of using multiple legacy character encodings, which can result in the same sequence of codes having multiple interpretations depending on the character encoding in use, resulting in mojibake if the wrong one is chosen.

The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the en dash–, generally longer than the hyphen but shorter than the minus sign; the em dash—, longer than either the en dash or the minus sign; and the horizontal bar―, whose length varies across typefaces but tends to be between those of the en and em dashes.

The zero-width space (), abbreviated ZWSP, is a non-printing character used in computerized typesetting to indicate where the word boundaries are, without actually displaying a visible space in the rendered text. This enables text-processing systems for scripts that do not use explicit spacing to recognize where word boundaries are for the purpose of handling line breaks appropriately. Zero-width space is unicode character U+200B, and is located in the unicode General Punctuation block, and can be represented by numeric character references  or .

fold is a Unix command used for making a file with long lines more readable on a limited width computer terminal by performing a line wrap.

The Unicode Standard assigns various properties to each Unicode character and code point.

In typography, a thin space is a space character whose width is usually 1⁄5 or 1⁄6 of an em. It is used to add a narrow space, such as between nested quotation marks or to separate glyphs that interfere with one another. It is not as narrow as the hair space. It is also used in the International System of Units and in many countries as a thousands separator when writing numbers in groups of three digits, in order to facilitate reading. It also avoids the ambiguity of the comma, used as a thousands separator in many countries but as a decimal point in Europe.

The Knuth–Plass algorithm is a line-breaking algorithm designed for use in Donald Knuth's typesetting program TeX. It integrates the problems of text justification and hyphenation into a single algorithm by using a discrete dynamic programming method to minimize a loss function that attempts to quantify the aesthetic qualities desired in the finished output.

References

↑ Heninger, Andy, ed. (2013-01-25). "Unicode Line Breaking Algorithm" (PDF). Technical Reports. Annex #14 (Proposed Update Unicode Standard): 2. Retrieved 10 March 2015. WORD JOINER should be used if the intent is to merely prevent a line break
↑ Lunde, Ken (1999), CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing, O'Reilly Media, Inc., p. 352, ISBN 9781565922242 .
↑ Harris, Robert W. (January 1956), "Keyboard standardization", Western Union Technical Review, 10 (1): 37–42, archived from the original on 2015-08-03, retrieved 2013-04-07.
↑ Knuth, Donald (1977), TEXDR.AFT , retrieved 2013-04-07. Reprinted in Knuth, Donald (1999), Digital Typography , CSLI Lecture Notes, vol. 78, Stanford, California: Center for the Study of Language and Information, ISBN 1-57586-010-4 .
↑ Knuth, Donald Ervin; Plass, Michael F (1981), "Breaking Paragraphs into Lines", Software: Practice and Experience, 11 (11): 1119–84, doi:10.1002/spe.4380111102, S2CID 206508107

External links

Unicode Line Breaking Algorithm

Knuth's algorithm

"Knuth & Plass line-breaking Revisited"
"tex_wrap": "Implements TeX's algorithm for breaking paragraphs into lines." Reference: "Breaking Paragraphs into Lines", D.E. Knuth and M.F. Plass, chapter 3 of _Digital Typography_, CSLI Lecture Notes #78.
Text::Reflow - Perl module for reflowing text files using Knuth's paragraphing algorithm. "The reflow algorithm tries to keep the lines the same length but also tries to break at punctuation, and avoid breaking within a proper name or after certain connectives ("a", "the", etc.). The result is a file with a more "ragged" right margin than is produced by fmt or Text::Wrap but it is easier to read since fewer phrases are broken across line breaks."
adjusting the Knuth algorithm to recognize the "soft hyphen".
Knuth's breaking algorithm. "The detailed description of the model and the algorithm can be found on the paper "Breaking Paragraphs into Lines" by Donald E. Knuth, published in the book "Digital Typography" (Stanford, California: Center for the Study of Language and Information, 1999), (CSLI Lecture Notes, no. 78.)"; part of Google Summer Of Code 2006
"Bridging the Algorithm Gap: A Linear-time Functional Program for Paragraph Formatting" by Oege de Moor, Jeremy Gibbons, 1997