Word Break
   HOME

TheInfoList



OR:

In
punctuation Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. An ...
, a word divider is a
glyph A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...
that separates written words. In languages which use the
Latin Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
,
Cyrillic , bg, кирилица , mk, кирилица , russian: кириллица , sr, ћирилица, uk, кирилиця , fam1 = Egyptian hieroglyphs , fam2 = Proto-Sinaitic , fam3 = Phoenician , fam4 = G ...
, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank
space Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually consider ...
, or ''whitespace''. This convention is spreading, along with other aspects of European punctuation, to Asia and Africa, where words are usually written without word separation. In computing, the word
delimiter A delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain text, mathematical expressions or other data streams. An example of a delimiter is the comma character, which acts a ...
is used to refer to a
character Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
that separates two words. In
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
,
word segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in comput ...
depends on which
characters Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
are defined as word dividers.


History

In Ancient Egyptian,
determinative A determinative, also known as a taxogram or semagram, is an ideogram used to mark semantic categories of words in logographic scripts which helps to disambiguate interpretation. They have no direct counterpart in spoken language, though they may ...
s may have been used as much to demarcate word boundaries as to disambiguate the semantics of words. Rarely in
Assyrian cuneiform Cuneiform is a Logogram, logo-Syllabary, syllabic writing system, script that was used to write several languages of the Ancient Near East, Ancient Middle East. The script was in active use from the early Bronze Age until the beginning of the ...
, but commonly in the later cuneiform
Ugaritic alphabet The Ugaritic writing system is a cuneiform abjad (consonantal alphabet) used from around either 1400 BCE or 1300 BCE for Ugaritic, an extinct Northwest Semitic language, and discovered in Ugarit (modern Ras Al Shamra), Syria, in 1928. It h ...
, a vertical stroke 𒑰 was used to separate words. In
Old Persian cuneiform Old Persian cuneiform is a semi-alphabetic cuneiform script that was the primary script for Old Persian. Texts written in this cuneiform have been found in Iran (Persepolis, Susa, Hamadan, Kharg Island), Armenia, Romania (Gherla), Turkey ( Van Fo ...
, a diagonally sloping wedge 𐏐 was used. As the alphabet spread throughout the ancient world, words were often run together without division, and this practice remains or remained until recently in much of South and Southeast Asia. However, not infrequently in inscriptions a vertical line, and in manuscripts a single (·), double (:), or triple (⫶)
interpunct An interpunct , also known as an interpoint, middle dot, middot and centered dot or centred dot, is a punctuation mark consisting of a vertically centered dot used for interword separation in ancient Latin script. (Word-separating spaces did no ...
(dot) was used to divide words. This practice was found in Phoenician,
Aramaic The Aramaic languages, short Aramaic ( syc, ܐܪܡܝܐ, Arāmāyā; oar, 𐤀𐤓𐤌𐤉𐤀; arc, 𐡀𐡓𐡌𐡉𐡀; tmr, אֲרָמִית), are a language family containing many varieties (languages and dialects) that originated in ...
,
Hebrew Hebrew (; ; ) is a Northwest Semitic language of the Afroasiatic language family. Historically, it is one of the spoken languages of the Israelites and their longest-surviving descendants, the Jews and Samaritans. It was largely preserved ...
,
Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...
, and
Latin Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
, and continues today with Ethiopic, though there whitespace is gaining ground.


Scriptio continua

The early
alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syll ...
ic writing systems, such as the
Phoenician alphabet The Phoenician alphabet is an alphabet (more specifically, an abjad) known in modern times from the Canaanite and Aramaic inscriptions found across the Mediterranean region. The name comes from the Phoenician civilization. The Phoenician alpha ...
, had only signs for
consonant In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract. Examples are and pronounced with the lips; and pronounced with the front of the tongue; and pronounced wit ...
s (although some signs for consonants could also stand for a
vowel A vowel is a syllabic speech sound pronounced without any stricture in the vocal tract. Vowels are one of the two principal classes of speech sounds, the other being the consonant. Vowels vary in quality, in loudness and also in quantity (leng ...
, so-called ''
matres lectionis ''Matres lectionis'' (from Latin "mothers of reading", singular form: ''mater lectionis'', from he, אֵם קְרִיאָה ) are consonants that are used to indicate a vowel, primarily in the writing down of Semitic languages such as Arabic, ...
''). Without some form of visible word dividers, parsing a text into its separate words would have been a puzzle. With the introduction of letters representing vowels in the
Greek alphabet The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BCE. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as we ...
, the need for inter-word separation lessened. The earliest Greek inscriptions used interpuncts, as was common in the writing systems which preceded it, but soon the practice of ''
scriptio continua ''Scriptio continua'' (Latin for "continuous script"), also known as ''scriptura continua'' or ''scripta continua'', is a style of writing without spaces or other marks between the words or sentences. The form also lacks punctuation, diacritic ...
'', continuous writing in which all words ran together without separation became common.


Types


None

Alphabetic writing without inter-word separation, known as ''
scriptio continua ''Scriptio continua'' (Latin for "continuous script"), also known as ''scriptura continua'' or ''scripta continua'', is a style of writing without spaces or other marks between the words or sentences. The form also lacks punctuation, diacritic ...
'', was used in Ancient Egyptian. It appeared in Post-classical Latin after several centuries of the use of the interpunct. Traditionally, ''scriptio continua'' was used for the Indic alphabets of South and Southeast Asia and
hangul The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. ( ) in South Korea and Chosŏn'gŭl in North Korea, is the modern official writing system for the Korean language. The let ...
of Korea, but spacing is now used with hangul and increasingly with the Indic alphabets. Today
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...
and
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
are the most widely-used scripts consistently written without punctuation to separate words, though other scripts such as
Thai Thai or THAI may refer to: * Of or from Thailand, a country in Southeast Asia ** Thai people, the dominant ethnic group of Thailand ** Thai language, a Tai-Kadai language spoken mainly in and around Thailand *** Thai script *** Thai (Unicode block ...
and Lao also follow this writing convention. In Classical Chinese, a word and a
character Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
were almost the same thing, so that word dividers would have been superfluous. Although Modern Mandarin has numerous polysyllabic words, and each syllable is written with a distinct character, the conceptual link between character and word or at least
morpheme A morpheme is the smallest meaningful Constituent (linguistics), constituent of a linguistic expression. The field of linguistics, linguistic study dedicated to morphemes is called morphology (linguistics), morphology. In English, morphemes are ...
remains strong, and no need is felt for word separation apart from what characters already provide. This link is also found in the
Vietnamese language Vietnamese ( vi, tiếng Việt, links=no) is an Austroasiatic languages, Austroasiatic language originating from Vietnam where it is the national language, national and official language. Vietnamese is spoken natively by over 70 million people, ...
; however, in the
Vietnamese alphabet The Vietnamese alphabet ( vi, chữ Quốc ngữ, lit=script of the National language) is the modern Latin writing script or writing system for Vietnamese language, Vietnamese. It uses the Latin script based on Romance languages originally develo ...
, virtually all syllables are separated by spaces, whether or not they form word boundaries.


Space

Space is the most common word divider, especially in
Latin script The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Italy ...
.


Vertical lines

Ancient inscribed and cuneiform scripts such as
Anatolian hieroglyphs Anatolian hieroglyphs are an indigenous logographic script native to central Anatolia, consisting of some 500 signs. They were once commonly known as Hittite hieroglyphs, but the language they encode proved to be Luwian, not Hittite, and the ter ...
frequently used short vertical lines to separate words, as did
Linear B Linear B was a syllabic script used for writing in Mycenaean Greek, the earliest attested form of Greek. The script predates the Greek alphabet by several centuries. The oldest Mycenaean writing dates to about 1400 BC. It is descended from ...
. In manuscripts, vertical lines were more commonly used for larger breaks, equivalent to the Latin comma and period. This was the case for
Biblical Hebrew Biblical Hebrew (, or , ), also called Classical Hebrew, is an archaic form of the Hebrew language, a language in the Canaanite branch of Semitic languages spoken by the Israelites in the area known as the Land of Israel, roughly west of ...
(the
paseq Hebrew punctuation is similar to that of English and other Western languages, Modern Hebrew having imported additional punctuation marks from these languages in order to avoid the ambiguities sometimes occasioned by the relative paucity of such ...
) and continues with many Indic scripts today (the
danda In Indic scripts, the daṇḍa (Sanskrit: दण्ड ' "stick") is a punctuation mark. The glyph consists of a single vertical stroke. Use The daṇḍa marks the end of a sentence or line, comparable to a full stop (period) as commonly u ...
).


Interpunct, multiple dots, and hypodiastole

As noted above, the single and double interpunct were used in manuscripts (on paper) throughout the ancient world. For example, Ethiopic inscriptions used a vertical line, whereas manuscripts used double dots (፡) resembling a colon. The latter practice continues today, though the space is making inroads. Classical Latin used the interpunct in both paper manuscripts and stone inscriptions.(Wingo 1972:16) Ancient Greek orthography used between two and five dots as word separators, as well as the
hypodiastole The hypodiastole (Greek: , , ), also known as a diastole,''Oxford English Dictionary'', "diastole, ''n.''" Oxford University Press (Oxford), 1895. was an interpunct developed in late Ancient and Byzantine Greek texts before the separation o ...
.


Different letter forms

In the modern
Hebrew Hebrew (; ; ) is a Northwest Semitic language of the Afroasiatic language family. Historically, it is one of the spoken languages of the Israelites and their longest-surviving descendants, the Jews and Samaritans. It was largely preserved ...
and Arabic alphabets, some letters have distinct forms at the ends and/or beginnings of words. This demarcation is used in addition to spacing.


Vertical arrangement

The
Nastaʿlīq ''Nastaliq'' (; fa, , ), also romanized as ''Nastaʿlīq'', is one of the main calligraphic hands used to write the Perso-Arabic script in the Persian and Urdu languages, often used also for Ottoman Turkish poetry, rarely for Arabic. ''Nasta ...
form of
Islamic calligraphy Islamic calligraphy is the artistic practice of handwriting and calligraphy, in the languages which use Arabic alphabet or the alphabets derived from it. It includes Arabic, Persian, Ottoman, and Urdu calligraphy.Chapman, Caroline (2012). '' ...
uses vertical arrangement to separate words. The beginning of each word is written higher than the end of the preceding word, so that a line of text takes on a sawtooth appearance. Nastaliq spread from Persia and today is used for
Persian Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the ...
, Uyghur,
Pashto Pashto (,; , ) is an Eastern Iranian language in the Indo-European language family. It is known in historical Persian literature as Afghani (). Spoken as a native language mostly by ethnic Pashtuns, it is one of the two official languages ...
, and
Urdu Urdu (;"Urdu"
''
finger spelling Fingerspelling (or dactylology) is the representation of the letters of a writing system, and sometimes numeral systems, using only the hands. These manual alphabets (also known as finger alphabets or hand alphabets) have often been used in deaf e ...
and in
Morse code Morse code is a method used in telecommunication to encode text characters as standardized sequences of two different signal durations, called ''dots'' and ''dashes'', or ''dits'' and ''dahs''. Morse code is named after Samuel Morse, one of ...
, words are separated by a pause.


Unicode

For use with computers, these marks have
codepoint In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s in
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
: * * *


See also

* Whitespace *
Sentence spacing Sentence spacing concerns how spaces are inserted between sentences in typeset text and is a matter of typographical convention. Since the introduction of movable-type printing in Europe, various sentence spacing conventions have been used in ...
*
Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...
*
Zero-width non-joiner The zero-width non-joiner (ZWNJ) is a non-printing character used in the computerization of writing systems that make use of ligatures. When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to ...
*
Zero-width space The zero-width space , abbreviated ZWSP, is a non-printing character used in computerized typesetting to indicate word boundaries to text-processing systems in scripts that do not use explicit spacing, or after characters (such as the slash) that a ...
* Substitute blank *
Underscore An underscore, ; also called an underline, low line, or low dash; is a line drawn under a segment of text. In proofreading, underscoring is a convention that says "set this text in italic type", traditionally used on Manuscript (publishing), man ...


References


Further reading

* * * * * {{navbox punctuation Punctuation