Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in
natural language processing of deciding where
sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of
punctuation marks. In
written English
English orthography is the writing system used to represent spoken English, allowing readers to connect the graphemes to sound and to meaning. It includes English's norms of spelling, hyphenation, capitalisation, word breaks, emphasis, and ...
, a
period may indicate the end of a sentence, or may denote an
abbreviation, a
decimal point, an
ellipsis, or an email address, among other possibilities. About 47% of the periods in the
Wall Street Journal corpus denote abbreviations.
Question mark
The question mark (also known as interrogation point, query, or eroteme in journalism) is a punctuation mark that indicates an interrogative clause or phrase in many languages.
History
In the fifth century, Syriac Bible manuscripts used ...
s and
exclamation marks can be similarly ambiguous due to use in
emoticons,
computer code, and
slang
Slang is vocabulary (words, phrases, and usage (language), linguistic usages) of an informal register, common in spoken conversation but avoided in formal writing. It also sometimes refers to the language generally exclusive to the members of p ...
.
Some languages including Japanese and Chinese have unambiguous sentence-ending markers.
Strategies
The standard '
vanilla' approach to locate the end of a sentence:
:(a) If it's a period, it ends a sentence.
:(b) If the preceding token is in the hand-compiled
list of abbreviations, then it doesn't end a sentence.
:(c) If the next token is capitalized, then it ends a sentence.
This strategy gets about 95% of sentences correct. Things such as shortened names, e.g. "
D. H. Lawrence
David Herbert Lawrence (11 September 1885 – 2 March 1930) was an English writer, novelist, poet and essayist. His works reflect on modernity, industrialization, sexuality, emotional health, vitality, spontaneity and instinct. His best-k ...
" (with
whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like "
.hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%.
Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a
maximum entropy model.
Th
SATZarchitecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.
Software
;Examples of use of Perl compatible
regular expression
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
s ("
PCRE")
:*
((?<= -z0-9.?!]), (?<= -z0-9.?!]\"))(\s, \r\n)(?=\"? -Z
:*
$sentences = preg_split("/(??\!\.)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE); (for
PHP)
;Online use, libraries, and APIs
:
sent_detectorava
:
Lingua-EN-Sentenceerl
:
Sentence.pmerl
:
SATZn Adaptive Sentence Segmentation Systemby David D. PalmerC
;Toolkits that include sentence detection
:*
Apache OpenNLP
:*
Freeling (software)
:*
Natural Language Toolkit
:*
Stanford NLP
:*
GExp
:
CogComp-NLP
See also
*
Sentence spacing
*
Word divider
*
Syllabification
*
Punctuation
Punctuation (or sometimes interpunction) is the use of spacing, conventional signs (called punctuation marks), and certain typographical devices as aids to the understanding and correct reading of written text, whether read silently or aloud. A ...
*
Text segmentation
*
Speech segmentation
*
Sentence extraction Sentence extraction is a technique used for automatic summarization of a text.
In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more k ...
*
Translation memory
*
Multiword expression
References
External links
pySBD - python Sentence Boundary Disambiguation
{{Natural language processing
Tasks of natural language processing