Word separator
   HOME

TheInfoList



OR:

In
punctuation Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. An ...
, a word divider is a glyph that separates written words. In languages which use the
Latin Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank
space Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually cons ...
, or ''whitespace''. This convention is spreading, along with other aspects of European punctuation, to Asia and Africa, where words are usually written without word separation. In computing, the word delimiter is used to refer to a character that separates two words. In
character encoding Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
,
word segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in compu ...
depends on which
characters Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
are defined as word dividers.


History

In Ancient Egyptian,
determinative A determinative, also known as a taxogram or semagram, is an ideogram used to mark semantic categories of words in logographic scripts which helps to disambiguate interpretation. They have no direct counterpart in spoken language, though they may ...
s may have been used as much to demarcate word boundaries as to disambiguate the semantics of words. Rarely in
Assyrian cuneiform Cuneiform is a logo- syllabic script that was used to write several languages of the Ancient Middle East. The script was in active use from the early Bronze Age until the beginning of the Common Era. It is named for the characteristic wedge-s ...
, but commonly in the later cuneiform Ugaritic alphabet, a vertical stroke 𒑰 was used to separate words. In Old Persian cuneiform, a diagonally sloping wedge 𐏐 was used. As the alphabet spread throughout the ancient world, words were often run together without division, and this practice remains or remained until recently in much of South and Southeast Asia. However, not infrequently in inscriptions a vertical line, and in manuscripts a single (·), double (:), or triple (⫶) interpunct (dot) was used to divide words. This practice was found in Phoenician,
Aramaic The Aramaic languages, short Aramaic ( syc, ܐܪܡܝܐ, Arāmāyā; oar, 𐤀𐤓𐤌𐤉𐤀; arc, 𐡀𐡓𐡌𐡉𐡀; tmr, אֲרָמִית), are a language family containing many varieties (languages and dialects) that originated in ...
,
Hebrew Hebrew (; ; ) is a Northwest Semitic language of the Afroasiatic language family. Historically, it is one of the spoken languages of the Israelites and their longest-surviving descendants, the Jews and Samaritans. It was largely preserved ...
,
Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...
, and
Latin Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
, and continues today with Ethiopic, though there whitespace is gaining ground.


Scriptio continua

The early
alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...
ic writing systems, such as the
Phoenician alphabet The Phoenician alphabet is an alphabet (more specifically, an abjad) known in modern times from the Canaanite and Aramaic inscriptions found across the Mediterranean region. The name comes from the Phoenician civilization. The Phoenician a ...
, had only signs for
consonant In articulatory phonetics, a consonant is a speech sound that is articulated with complete or partial closure of the vocal tract. Examples are and pronounced with the lips; and pronounced with the front of the tongue; and pronounced wi ...
s (although some signs for consonants could also stand for a
vowel A vowel is a syllabic speech sound pronounced without any stricture in the vocal tract. Vowels are one of the two principal classes of speech sounds, the other being the consonant. Vowels vary in quality, in loudness and also in quantity (leng ...
, so-called '' matres lectionis''). Without some form of visible word dividers, parsing a text into its separate words would have been a puzzle. With the introduction of letters representing vowels in the
Greek alphabet The Greek alphabet has been used to write the Greek language since the late 9th or early 8th century BCE. It is derived from the earlier Phoenician alphabet, and was the earliest known alphabetic script to have distinct letters for vowels as w ...
, the need for inter-word separation lessened. The earliest Greek inscriptions used interpuncts, as was common in the writing systems which preceded it, but soon the practice of '' scriptio continua'', continuous writing in which all words ran together without separation became common.


Types


None

Alphabetic writing without inter-word separation, known as '' scriptio continua'', was used in Ancient Egyptian. It appeared in Post-classical Latin after several centuries of the use of the interpunct. Traditionally, ''scriptio continua'' was used for the Indic alphabets of South and Southeast Asia and
hangul The Korean alphabet, known as Hangul, . Hangul may also be written as following South Korea's standard Romanization. ( ) in South Korea and Chosŏn'gŭl in North Korea, is the modern official writing system for the Korean language. The le ...
of Korea, but spacing is now used with hangul and increasingly with the Indic alphabets. Today Chinese and
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
are the most widely-used scripts consistently written without punctuation to separate words, though other scripts such as Thai and Lao also follow this writing convention. In Classical Chinese, a word and a character were almost the same thing, so that word dividers would have been superfluous. Although Modern Mandarin has numerous polysyllabic words, and each syllable is written with a distinct character, the conceptual link between character and word or at least
morpheme A morpheme is the smallest meaningful constituent of a linguistic expression. The field of linguistic study dedicated to morphemes is called morphology. In English, morphemes are often but not necessarily words. Morphemes that stand alone are ...
remains strong, and no need is felt for word separation apart from what characters already provide. This link is also found in the
Vietnamese language Vietnamese ( vi, tiếng Việt, links=no) is an Austroasiatic languages, Austroasiatic language originating from Vietnam where it is the national language, national and official language. Vietnamese is spoken natively by over 70 million people, ...
; however, in the
Vietnamese alphabet The Vietnamese alphabet ( vi, chữ Quốc ngữ, lit=script of the National language) is the modern Latin writing script or writing system for Vietnamese. It uses the Latin script based on Romance languages originally developed by Portuguese m ...
, virtually all syllables are separated by spaces, whether or not they form word boundaries.


Space

Space is the most common word divider, especially in
Latin script The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern I ...
.


Vertical lines

Ancient inscribed and cuneiform scripts such as
Anatolian hieroglyphs Anatolian hieroglyphs are an indigenous logographic script native to central Anatolia, consisting of some 500 signs. They were once commonly known as Hittite hieroglyphs, but the language they encode proved to be Luwian, not Hittite, and the ter ...
frequently used short vertical lines to separate words, as did Linear B. In manuscripts, vertical lines were more commonly used for larger breaks, equivalent to the Latin comma and period. This was the case for Biblical Hebrew (the
paseq Hebrew punctuation is similar to that of English and other Western languages, Modern Hebrew having imported additional punctuation marks from these languages in order to avoid the ambiguities sometimes occasioned by the relative paucity of such ...
) and continues with many Indic scripts today (the
danda In Indic scripts, the daṇḍa (Sanskrit: दण्ड ' "stick") is a punctuation mark. The glyph consists of a single vertical stroke. Use The daṇḍa marks the end of a sentence or line, comparable to a full stop (period) as commonly u ...
).


Interpunct, multiple dots, and hypodiastole

As noted above, the single and double interpunct were used in manuscripts (on paper) throughout the ancient world. For example, Ethiopic inscriptions used a vertical line, whereas manuscripts used double dots (፡) resembling a colon. The latter practice continues today, though the space is making inroads. Classical Latin used the interpunct in both paper manuscripts and stone inscriptions.(Wingo 1972:16) Ancient Greek orthography used between two and five dots as word separators, as well as the hypodiastole.


Different letter forms

In the modern
Hebrew Hebrew (; ; ) is a Northwest Semitic language of the Afroasiatic language family. Historically, it is one of the spoken languages of the Israelites and their longest-surviving descendants, the Jews and Samaritans. It was largely preserved ...
and Arabic alphabets, some letters have distinct forms at the ends and/or beginnings of words. This demarcation is used in addition to spacing.


Vertical arrangement

The Nastaʿlīq form of
Islamic calligraphy Islamic calligraphy is the artistic practice of handwriting and calligraphy, in the languages which use Arabic alphabet or the alphabets derived from it. It includes Arabic, Persian, Ottoman, and Urdu calligraphy.Chapman, Caroline (2012). ...
uses vertical arrangement to separate words. The beginning of each word is written higher than the end of the preceding word, so that a line of text takes on a sawtooth appearance. Nastaliq spread from Persia and today is used for
Persian Persian may refer to: * People and things from Iran, historically called ''Persia'' in the English language ** Persians, the majority ethnic group in Iran, not to be conflated with the Iranic peoples ** Persian language, an Iranian language of the ...
, Uyghur,
Pashto Pashto (,; , ) is an Eastern Iranian language in the Indo-European language family. It is known in historical Persian literature as Afghani (). Spoken as a native language mostly by ethnic Pashtuns, it is one of the two official langua ...
, and
Urdu Urdu (;"Urdu"
'' finger spelling and in Morse code, words are separated by a pause.


Unicode

For use with computers, these marks have
codepoint In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s in
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
: * * *


See also

* Whitespace *
Sentence spacing Sentence spacing concerns how spaces are inserted between sentences in typeset text and is a matter of typographical convention. Since the introduction of movable-type printing in Europe, various sentence spacing conventions have been used in ...
*
Speech segmentation Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language proces ...
*
Zero-width non-joiner The zero-width non-joiner (ZWNJ) is a non-printing character used in the computerization of writing systems that make use of ligatures. When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to b ...
*
Zero-width space The zero-width space , abbreviated ZWSP, is a non-printing character used in computerized typesetting to indicate word boundaries to text-processing systems in scripts that do not use explicit spacing, or after characters (such as the slash) that a ...
* Substitute blank *
Underscore An underscore, ; also called an underline, low line, or low dash; is a line drawn under a segment of text. In proofreading, underscoring is a convention that says "set this text in italic type", traditionally used on manuscript or typescript a ...


References


Further reading

* * * * * {{navbox punctuation Punctuation