In machine learning and

information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other co ...

, the cluster hypothesis is an assumption about the nature of the data handled in those fields, which takes various forms. In information retrieval, it states that documents that are clustered together "behave similarly with respect to relevance to information needs". In terms of

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

, it states that if points are in the same cluster, they are likely to be of the same class. There may be multiple clusters forming a single class.

Information retrieval

The cluster hypothesis was formulated first by van Rijsbergen: "closely associated documents tend to be relevant to the same requests". Thus, theoretically, a

search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

could try to locate only the appropriate cluster for a query, and then allow users to browse through this cluster. Although experiments showed that the cluster hypothesis as such holds, exploiting it for retrieval did not lead to satisfying results.

Machine learning

The cluster assumption is assumed in many machine learning algorithms such as the ''k''-nearest neighbor classification algorithm and the ''k''-means clustering algorithm. As the word "likely" appears in the definition, there is no clear border differentiating whether the assumption does hold or does not hold. In contrast the amount of adherence of data to this assumption can be quantitatively measured.

Properties

The cluster assumption is equivalent to the

Low density separation assumption Low or LOW or lows, may refer to: People * Low (surname), listing people surnamed Low Places * Low, Quebec, Canada * Low, Utah, United States * Lo Wu station (MTR code LOW), Hong Kong; a rail station * Salzburg Airport (ICAO airport code: LOW ...

which states that the decision boundary should lie on a low-density region. To prove this, suppose the decision boundary crosses one of the clusters. Then this cluster will contain points from two different classes, therefore it is violated on this cluster.

Notes

{{Reflist Data modeling