Sentence extraction is a technique used for

automatic summarization Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are comm ...

of a text. In this shallow approach, statistical heuristics are used to identify the most salient sentences of a text. Sentence extraction is a low-cost approach compared to more knowledge-intensive deeper approaches which require additional knowledge bases such as

ontologies In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More ...

or linguistic knowledge. In short, sentence extraction works as a filter that allows only meaningful sentences to pass. The major downside of applying sentence-extraction techniques to the task of summarization is the loss of

coherence Coherence is, in general, a state or situation in which all the parts or ideas fit together well so that they form a united whole. More specifically, coherence, coherency, or coherent may refer to the following: Physics * Coherence (physics ...

in the resulting summary. Nevertheless, sentence extraction summaries can give valuable clues to the main points of a document and are frequently sufficiently intelligible to human readers.

Procedure

Usually, a combination of heuristics is used to determine the most important sentences within the document. Each heuristic assigns a (positive or negative) score to the sentence. After all heuristics have been applied, the highest-scoring sentences are included in the summary. The individual heuristics are weighted according to their importance.

Early approaches and some sample heuristics

Seminal papers which laid the foundations for many techniques used today have been published by

Hans Peter Luhn Hans Peter Luhn (July 1, 1896 – August 19, 1964) was a German-American researcher in the field of computer science and Library & Information Science for IBM, and creator of the Luhn algorithm, KWIC (Key Words In Context) indexing, and s ...

in 1958 and H. P Edmundson in 1969. Luhn proposed to assign more weight to sentences at the beginning of the document or a paragraph. Edmundson stressed the importance of title-words for summarization and was the first to employ stop-lists in order to filter uninformative words of low semantic content (e.g. most grammatical words such as ''of'', ''the'', ''a''). He also distinguished between bonus words and stigma words, i.e. words that probably occur together with important (e.g. the word form ''significant'') or unimportant information. His idea of using key-words, i.e. words which occur significantly frequently in the document, is still one of the core heuristics of today's summarizers. With large linguistic corpora available today, the

tf–idf In information retrieval, tf–idf (term frequency–inverse document frequency, TF*IDF, TFIDF, TF–IDF, or Tf–idf) is a measure of importance of a word to a document in a collection or Text corpus, corpus, adjusted for the fact that some words ...

value which originated in

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

, can be successfully applied to identify the key words of a text: If for example the word ''cat'' occurs significantly more often in the text to be summarized (TF = "term frequency") than in the corpus (IDF means "inverse document frequency"; here the corpus is meant by ''document''), then ''cat'' is likely to be an important word of the text; the text may in fact be a text about cats.

References

{{Natural Language Processing Computational linguistics Natural language processing

Procedure

Early approaches and some sample heuristics

See also

References