HOME

TheInfoList



OR:

Line breaking, also known as word wrapping, is breaking a section of text into lines so that it will fit into the available width of a page, window or other display area. In text display, line wrap is continuing on a new line when a line is full, so that each line fits into the viewable window, allowing text to be read from top to bottom without any horizontal
scrolling In computer displays, filmmaking, television production, and other kinetic displays, scrolling is sliding text, images or video across a monitor or display, vertically or horizontally. "Scrolling," as such, does not change the layout of the text ...
. Word wrap is the additional feature of most
text editor A text editor is a type of computer program that edits plain text. Such programs are sometimes known as "notepad" software (e.g. Windows Notepad). Text editors are provided with operating systems and software development packages, and can be us ...
s,
word processors A word processor is an electronic device (later a computer software application) for text, composing, editing, formatting, and printing. The word processor was a stand-alone office machine in the 1960s, combining the keyboard text-entry and prin ...
, and
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used on ...
s, of breaking lines between words rather than within words, where possible. Word wrap makes it unnecessary to hard-code
newline Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a ...
delimiters within paragraphs, and allows the display of text to adapt flexibly and dynamically to displays of varying sizes.


Soft and hard returns

A soft return or soft wrap is the break resulting from line wrap or word wrap (whether automatic or manual), whereas a hard return or hard wrap is an intentional break, creating a new paragraph. With a hard return, paragraph-break formatting can (and should) be applied (either indenting or vertical whitespace). Soft wrapping allows line lengths to adjust automatically with adjustments to the width of the user's window or margin settings, and is a standard feature of all modern text editors, word processors, and
email client An email client, email reader or, more formally, message user agent (MUA) or mail user agent is a computer program used to access and manage a user's email. A web application which provides message management, composition, and reception functio ...
s. Manual soft breaks are unnecessary when word wrap is done automatically, so hitting the "Enter" key usually produces a hard return. Alternatively, "soft return" can mean an intentional, stored line break that is not a paragraph break. For example, it is common to print postal addresses in a multiple-line format, but the several lines are understood to be a single paragraph. Line breaks are needed to divide the words of the address into lines of the appropriate length. In the contemporary
graphical Graphics () are visual images or designs on some surface, such as a wall, canvas, screen, paper, or stone, to inform, illustrate, or entertain. In contemporary usage, it includes a pictorial representation of data, as in design and manufacture ...
word processors
Microsoft Word Microsoft Word is a word processing software developed by Microsoft. It was first released on October 25, 1983, under the name ''Multi-Tool Word'' for Xenix systems. Subsequent versions were later written for several other platforms includin ...
and
OpenOffice.org OpenOffice.org (OOo), commonly known as OpenOffice, is a discontinued open-source office suite. Active successor projects include LibreOffice (the most actively developed), Apache OpenOffice, Collabora Online (enterprise ready LibreOffice) a ...
, users are expected to type a carriage return () between each paragraph. Formatting settings, such as first-line indentation or spacing between paragraphs, take effect where the carriage return marks the break. A non-paragraph line break, which is a soft return, is inserted using or via the menus, and is provided for cases when the text should start on a new line but none of the other side effects of starting a new paragraph are desired. In text-oriented markup languages, a soft return is typically offered as a markup tag. For example, in
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
there is a <br> tag that has the same purpose as the soft return in word processors described above.


Unicode

The
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
Line Breaking Algorithm determines a set of positions, known as ''break opportunities'', that are appropriate places in which to begin a new line. The actual line break positions are picked from among the break opportunities by the higher level software that calls the algorithm, not by the algorithm itself, because only the higher level software knows about the width of the display the text is displayed on and the width of the glyphs that make up the displayed text. The Unicode character set provides a line separator character as well as a paragraph separator to represent the semantics of the soft return and hard return. :0x2028 LINE SEPARATOR : * may be used to represent this semantic unambiguously :0x2029 PARAGRAPH SEPARATOR : * may be used to represent this semantic unambiguously


Word boundaries, hyphenation, and hard spaces

The soft returns are usually placed after the ends of complete words, or after the punctuation that follows complete words. However, word wrap may also occur following a
hyphen The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. ''Son-in-law'' is an example of a hyphenated word. The hyphen is sometimes confused with dashes (figure d ...
inside of a word. This is sometimes not desired, and can be blocked by using a
non-breaking hyphen The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. ''Son-in-law'' is an example of a hyphenated word. The hyphen is sometimes confused with dashes ( figure ...
, or
hard hyphen The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. ''Son-in-law'' is an example of a hyphenated word. The hyphen is sometimes confused with dashes (figure ...
, instead of a regular hyphen. A word without hyphens can be made wrappable by having
soft hyphen In computing and typesetting, a soft hyphen (ISO 8859: 0xAD, Unicode , HTML: &#xAD; or &#173; or &shy;) or syllable hyphen (EBCDIC: 0xCA), abbreviated SHY, is a code point reserved in some coded character sets for the purpose of breaki ...
s in it. When the word isn't wrapped (i.e., isn't broken across lines), the soft hyphen isn't visible. But if the word is wrapped across lines, this is done at the soft hyphen, at which point it is shown as a visible hyphen on the top line where the word is broken. (In the rare case of a word that is meant to be wrappable by breaking it across lines but ''without'' making a hyphen ever appear, a
zero-width space The zero-width space , abbreviated ZWSP, is a non-printing character used in computerized typesetting to indicate word boundaries to text-processing systems in scripts that do not use explicit spacing, or after characters (such as the slash) that a ...
is put at the permitted breaking point(s) in the word.) Sometimes word wrap is undesirable between adjacent words. In such cases, word wrap can usually be blocked by using a ''hard space'' or non-breaking space between the words, instead of regular spaces.


Word wrapping in text containing Chinese, Japanese, and Korean

In
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...
,
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
, and
Korean Korean may refer to: People and culture * Koreans, ethnic group originating in the Korean Peninsula * Korean cuisine * Korean culture * Korean language **Korean alphabet, known as Hangul or Chosŏn'gŭl **Korean dialects and the Jeju language ** ...
, word wrapping can usually occur before and after any
Han character Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji' ...
, but certain punctuation characters are not allowed to begin a new line. Japanese
kana The term may refer to a number of syllabaries used to write Japanese phonological units, morae. Such syllabaries include (1) the original kana, or , which were Chinese characters (kanji) used phonetically to transcribe Japanese, the most pr ...
, letters of the Japanese alphabet, are treated the same way as Han Characters (
Kanji are the logographic Chinese characters taken from the Chinese family of scripts, Chinese script and used in the writing of Japanese language, Japanese. They were made a major part of the Japanese writing system during the time of Old Japanese ...
) by extension, meaning words can, and tend to be broken without any hyphen or other indication that this has happened. Under certain circumstances, however, word wrapping is not desired. For instance, * word wrapping might not be desired within personal names, and * word wrapping might not be desired within any compound words (when the text is flush left but only in some styles). Most existing word processors and
typesetting Typesetting is the composition of text by means of arranging physical ''type'' (or ''sort'') in mechanical systems or ''glyphs'' in digital systems representing ''characters'' (letters and other symbols).Dictionary.com Unabridged. Random Ho ...
software cannot handle either of the above scenarios. CJK punctuation may or may not follow rules similar to the above-mentioned special circumstances. It is up to line breaking rules in CJK. A special case of line breaking rules in CJK, however, always applies: line wrap must never occur inside the CJK dash and ellipsis. Even though each of these punctuation marks must be represented by two characters due to a limitation of all existing
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
s, each of these are intrinsically a single punctuation mark that is two ems wide, not two one-em-wide punctuation marks.


Algorithm

Word wrapping is an
optimization problem In mathematics, computer science and economics, an optimization problem is the problem of finding the ''best'' solution from all feasible solutions. Optimization problems can be divided into two categories, depending on whether the variables ...
. Depending on what needs to be optimized for, different algorithms are used.


Minimum number of lines

A simple way to do word wrapping is to use a
greedy algorithm A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage. In many problems, a greedy strategy does not produce an optimal solution, but a greedy heuristic can yield locally ...
that puts as many words on a line as possible, then moving on to the next line to do the same until there are no more words left to place. This method is used by many modern word processors, such as OpenOffice.org Writer and Microsoft Word. This algorithm always uses the minimum possible number of lines but may lead to lines of widely varying lengths. The following pseudocode implements this algorithm: SpaceLeft := LineWidth for each Word in Text if (Width(Word) + SpaceWidth) > SpaceLeft insert line break before Word in Text SpaceLeft := LineWidth - Width(Word) else SpaceLeft := SpaceLeft - (Width(Word) + SpaceWidth) Where LineWidth is the width of a line, SpaceLeft is the remaining width of space on the line to fill, SpaceWidth is the width of a single space character, Text is the input text to iterate over and Word is a word in this text.


Minimum raggedness

A different algorithm, used in
TeX Tex may refer to: People and fictional characters * Tex (nickname), a list of people and fictional characters with the nickname * Joe Tex (1933–1982), stage name of American soul singer Joseph Arrington Jr. Entertainment * ''Tex'', the Italian ...
, minimizes the sum of the squares of the lengths of the spaces at the end of lines to produce a more aesthetically pleasing result. The following example compares this method with the greedy algorithm, which does not always minimize squared space. For the input text AAA BB CC DDDDD with line width 6, the greedy algorithm would produce: ------ Line width: 6 AAA BB Remaining space: 0 CC Remaining space: 4 DDDDD Remaining space: 1 The sum of squared space left over by this method is 0^2 + 4^2 + 1^2 = 17. However, the optimal solution achieves the smaller sum 3^2 + 1^2 + 1^2 = 11: ------ Line width: 6 AAA Remaining space: 3 BB CC Remaining space: 1 DDDDD Remaining space: 1 The difference here is that the first line is broken before BB instead of after it, yielding a better right margin and a lower cost 11. By using a
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. I ...
algorithm to choose the positions at which to break the line, instead of choosing breaks greedily, the solution with minimum raggedness may be found in time O(n^2), where n is the number of words in the input text. Typically, the cost function for this technique should be modified so that it does not count the space left on the final line of a paragraph; this modification allows a paragraph to end in the middle of a line without penalty. It is also possible to apply the same dynamic programming technique to minimize more complex cost functions that combine other factors such as the number of lines or costs for hyphenating long words.. Faster but more complicated
linear time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
algorithms based on the
SMAWK algorithm The SMAWK algorithm is an algorithm for finding the minimum value in each row of an implicitly-defined totally monotone Matrix (mathematics), matrix. It is named after the initials of its five inventors, Peter Shor, Shlomo Moran, Alok Aggarwal, Robe ...
are also known for the minimum raggedness problem, and for some other cost functions that have similar properties.


History

A primitive line-breaking feature was used in 1955 in a "page printer control unit" developed by
Western Union The Western Union Company is an American multinational financial services company, headquartered in Denver, Colorado. Founded in 1851 as the New York and Mississippi Valley Printing Telegraph Company in Rochester, New York, the company chang ...
. This system used relays rather than programmable digital computers, and therefore needed a simple algorithm that could be implemented without
data buffer In computer science, a data buffer (or just buffer) is a region of a memory used to temporarily store data while it is being moved from one place to another. Typically, the data is stored in a buffer as it is retrieved from an input device (such a ...
s. In the Western Union system, each line was broken at the first space character to appear after the 58th character, or at the 70th character if no space character was found. The greedy algorithm for line-breaking predates the dynamic programming method outlined by
Donald Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer sc ...
in an unpublished 1977 memo describing his TeX typesetting system. Reprinted in . and later published in more detail by .


See also

* Non-breaking space *
Typographic alignment In typesetting and page layout, alignment or range is the setting of text flow or image placement relative to a page, column (measure), table cell, or tab (and often to an image above it or under it). The type alignment setting is sometimes re ...
*
Zero-width space The zero-width space , abbreviated ZWSP, is a non-printing character used in computerized typesetting to indicate word boundaries to text-processing systems in scripts that do not use explicit spacing, or after characters (such as the slash) that a ...
*
Word divider In punctuation, a word divider is a glyph that separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitespace''. T ...
*
Word joiner The word joiner (WJ) is a format character in Unicode used to indicate that word separation should not occur at a position, when using scripts such as Arabic that do not use explicit spacing. It is encoded since Unicode version 3.2 (released i ...


References

{{reflist


External links


Unicode Line Breaking Algorithm


Knuth's algorithm




"tex_wrap": "Implements TeX's algorithm for breaking paragraphs into lines."
Reference: "Breaking Paragraphs into Lines", D.E. Knuth and M.F. Plass, chapter 3 of _Digital Typography_, CSLI Lecture Notes #78.
Text::Reflow - Perl module for reflowing text files using Knuth's paragraphing algorithm.
"The reflow algorithm tries to keep the lines the same length but also tries to break at punctuation, and avoid breaking within a proper name or after certain connectives ("a", "the", etc.). The result is a file with a more "ragged" right margin than is produced by fmt or Text::Wrap but it is easier to read since fewer phrases are broken across line breaks."

to recognize the "soft hyphen".
Knuth's breaking algorithm.
"The detailed description of the model and the algorithm can be found on the paper "Breaking Paragraphs into Lines" by Donald E. Knuth, published in the book "Digital Typography" (Stanford, California: Center for the Study of Language and Information, 1999), (CSLI Lecture Notes, no. 78.)"; part o
Google Summer Of Code 2006

"Bridging the Algorithm Gap: A Linear-time Functional Program for Paragraph Formatting"
by Oege de Moor, Jeremy Gibbons, 1997


Other word-wrap links



(Archived version)

by Simon Pepping 2006. Extends the Knuth model to handle a few enhancements.

... The *really* interesting thing is how Adobe's algorithm differs from the Knuth-Plass algorithm. It must differ, since Adobe has managed to patent its algorithm (6,510,441)


"Line breaking"
compares the algorithms of various time complexities. Text editor features Typography Dynamic programming Unicode algorithms