HOME

TheInfoList



OR:

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In
written English English orthography is the writing system used to represent spoken English, allowing readers to connect the graphemes to sound and to meaning. It includes English's norms of spelling, hyphenation, capitalisation, word breaks, emphasis, and ...
, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in the Wall Street Journal corpus denote abbreviations.
Question mark The question mark (also known as interrogation point, query, or eroteme in journalism) is a punctuation mark that indicates an interrogative clause or phrase in many languages. History In the fifth century, Syriac Bible manuscripts used ...
s and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and
slang Slang is vocabulary (words, phrases, and usage (language), linguistic usages) of an informal register, common in spoken conversation but avoided in formal writing. It also sometimes refers to the language generally exclusive to the members of p ...
. Some languages including Japanese and Chinese have unambiguous sentence-ending markers.


Strategies

The standard ' vanilla' approach to locate the end of a sentence: :(a) If it's a period, it ends a sentence. :(b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence. :(c) If the next token is capitalized, then it ends a sentence. This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. "
D. H. Lawrence David Herbert Lawrence (11 September 1885 – 2 March 1930) was an English writer, novelist, poet and essayist. His works reflect on modernity, industrialization, sexuality, emotional health, vitality, spontaneity and instinct. His best-k ...
" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like " .hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%. Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model. Th
SATZ
architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.


Software

;Examples of use of Perl compatible
regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s (" PCRE") :* ((?<= -z0-9.?!]), (?<= -z0-9.?!]\"))(\s, \r\n)(?=\"? -Z :* $sentences = preg_split("/(??\!\.)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for PHP) ;Online use, libraries, and APIs :
sent_detector
ava :
Lingua-EN-Sentence
erl :
Sentence.pm
erl :
SATZ
n Adaptive Sentence Segmentation Systemby David D. PalmerC ;Toolkits that include sentence detection :* Apache OpenNLP

:* Freeling (software)

:* Natural Language Toolkit

:* Stanford NLP

:* GExp

:
CogComp-NLP


See also

* Sentence spacing * Word divider * Syllabification *
Punctuation Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. A ...
* Text segmentation * Speech segmentation *
Sentence extraction Sentence extraction is a technique used for automatic summarization of a text. In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more k ...
* Translation memory * Multiword expression


References


External links


pySBD - python Sentence Boundary Disambiguation
{{Natural language processing Tasks of natural language processing