HOME

TheInfoList



OR:

Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. The resulting summary report allows individual users, such as professional information consumers, to quickly familiarize themselves with information contained in a large cluster of documents. In such a way, multi-document summarization systems are complementing the
news aggregators In computing, a news aggregator, also termed a feed aggregator, feed reader, news reader, RSS reader or simply an aggregator, is client software or a web application that aggregates syndicated web content such as online newspapers, blogs, podc ...
performing the next step down the road of coping with information overload.


Key benefits and difficulties

Multi-document summarization creates information reports that are both concise and comprehensive. With different opinions being put together & outlined, every topic is described from multiple perspectives within a single document. While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents, comprehensive multi-document summary should in theory contain the required information, hence limiting the need for accessing original files to cases when refinement is required. In practice, it is hard to summarize multiple documents with conflicting views and biases. In fact, it is almost impossible to achieve clear extractive summarization of documents with conflicting views. Abstractive summarization is the preferred venue in this case. Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased. The difficulties remain, if doing automatic extractive summaries of documents with conflicting views.


Technological challenges

The multi-document summarization task is more complex than summarizing a single document, even a long one. The difficulty arises from thematic diversity within a large set of documents. A good summarization technology aims to combine the main themes with completeness, readability, and concision. The Document Understanding Conferences, conducted annually by
NIST The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical sci ...
, have developed sophisticated evaluation criteria for techniques accepting the multi-document summarization challenge. An ideal multi-document summarization system not only shortens the source texts, but also presents information organized around the key aspects to represent diverse views. Success produces an overview of a given topic. Such text compilations should also basic requirements for an overview text compiled by a human. The multi-document summary quality criteria are as follows: *clear structure, including an outline of the main content, from which it is easy to navigate to the full text sections *text within sections is divided into meaningful paragraphs *gradual transition from more general to more specific thematic aspects *good
readability Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content (the complexity of its vocabulary and syntax) and its presentation (such as typographic aspects that a ...
. The latter point deserves an additional note. Care is taken to ensure that the automatic overview shows: *no paper-unrelated " information noise" from the respective documents (e.g., web pages) *no dangling references to what is not mentioned or explained in the overview *no text breaks across a sentence *no semantic redundancy.


Real-life systems

The multi-document summarization technology is now coming of age - a view supported by a choice of advanced web-based systems that are currently available.
ReviewChomp
presents summaries of customer reviews for any given product or service. Some products have thousands of online reviews which renders the reviews unreadable by humans in real time. Search for the product or service is performed by the website. * Ultimate Research Assistant - performs text mining on Internet search results to help summarize and organize them and make it easier for the user to perform online research. Specific text mining techniques used by the tool include concept extraction, text summarization, hierarchical concept clustering (e.g., automated taxonomy generation), and various visualization techniques, including tag clouds and mind maps. * iResearch Reporter - Commercial Text Extraction and Text Summarization system, free demo site accepts user-entered query, passes it on to Google search engine, retrieves multiple relevant documents, produces categorized, easily readable natural language summary reports covering multiple documents in retrieved set, all extracts linked to original documents on the Web, post-processing, entity extraction, event and
relationship extraction A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE add ...
, text extraction, extract clustering, linguistic analysis, multi-document, full text, natural language processing, categorization rules, clustering, linguistic analysis, text summary construction tool set. * Newsblaster is a system that helps users find news that is of the most interest to them. The system automatically collects, clusters, categorizes, and summarizes news from several sites on the web (
CNN CNN (Cable News Network) is a multinational cable news channel headquartered in Atlanta, Georgia, U.S. Founded in 1980 by American media proprietor Ted Turner and Reese Schonfeld as a 24-hour cable news channel, and presently owned by ...
,
Reuters Reuters ( ) is a news agency owned by Thomson Reuters Corporation. It employs around 2,500 journalists and 600 photojournalists in about 200 locations worldwide. Reuters is one of the largest news agencies in the world. The agency was estab ...
,
Fox News The Fox News Channel, abbreviated FNC, commonly known as Fox News, and stylized in all caps, is an American multinational conservative cable news television channel based in New York City. It is owned by Fox News Media, which itself is owne ...
, etc.) on a daily basis, and it provides users an interface to browse the results. * NewsInEssence may be used to retrieve and summarize a cluster of articles from the web. It can start from a URL and retrieve documents that are similar, or it can retrieve documents that match a given set of keywords. NewsInEssence also downloads news articles daily and produces news clusters from them. * NewsFeed Researcher is a news portal performing continuous
automatic summarization Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commo ...
of documents initially clustered by the
news aggregators In computing, a news aggregator, also termed a feed aggregator, feed reader, news reader, RSS reader or simply an aggregator, is client software or a web application that aggregates syndicated web content such as online newspapers, blogs, podc ...
(e.g.,
Google News Google News is a news aggregator service developed by Google. It presents a continuous flow of links to articles organized from thousands of publishers and magazines. Google News is available as an app on Android, iOS, and the Web. Google rel ...
). NewsFeed Researcher is backed by a free online engine covering major events related to business, technology, U.S. and international news. This tool is also available in on-demand mode allowing a user to build a summaries on selected topics. * Scrape This is like a search engine, but instead of providing links to the most relevant websites based on a query, it scrapes the pertinent information off of the relevant websites and provides the user with a consolidated multi-document summary, along with dictionary definitions, images, and videos. * JistWeb
is a query specific multiple document summariser. As auto-generated multi-document summaries increasingly resemble the overviews written by a human, their use of extracted text snippets may one day face
copyright A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, education ...
issues in relation to the
fair use Fair use is a doctrine in United States law that permits limited use of copyrighted material without having to first acquire permission from the copyright holder. Fair use is one of the limitations to copyright intended to balance the interests ...
copyright concept.


Bibliography

* * Dragomir R. Radev, Hongyan Jing, Malgorzata Styś, and Daniel Tam. Centroid-based summarization of multiple documents. Information Processing and Management, 40:919–938, December 2004

* Kathleen R. McKeown and Dragomir R. Radev. Generating summaries of multiple news articles. In Proceedings, ACM Conference on Research and Development in Information Retrieval SIGIR'95, pages 74–82, Seattle, Washington, July 1995

* C.-Y. Lin, E. Hovy,
From single to multi-document summarization: A prototype system and its evaluation
, In "Proceedings of the ACL", pp. 457–464, 2002 *Kathleen McKeown, Rebecca J. Passonneau, David K. Elson, Ani Nenkova, Julia Hirschberg, "Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization", SIGIR’05, Salvador, Brazil, August 15–19, 200

*R. Barzilay, N. Elhadad, K. R. McKeown, "Inferring strategies for sentence ordering in multidocument news summarization", Journal of Artificial Intelligence Research, v. 17, pp. 35–55, 2002 *M. Soubbotin, S. Soubbotin, "Trade-Off Between Factors Influencing Quality of the Summary", Document Understanding Workshop (DUC), Vancouver, B.C., Canada, October 9–10, 200

* C Ravindranath Chowdary, and P. Sreenivasa Kumar.
Esum: an efficient system for query-specific multi-document summarization
" In ECIR (Advances in Information Retrieval), pp. 724–728. Springer Berlin Heidelberg, 2009.


See also

*
Automatic summarization Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commo ...
*
Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
*
News aggregators In computing, a news aggregator, also termed a feed aggregator, feed reader, news reader, RSS reader or simply an aggregator, is client software or a web application that aggregates syndicated web content such as online newspapers, blogs, podc ...


References


External links


Document Understanding ConferencesNewsInEssence: Web-based News SummarizationReviewChomp
{{DEFAULTSORT:Multi-Document Summarization Natural language processing Information retrieval genres