Overcategorization
   HOME

TheInfoList



OR:

{{unreferenced, date=November 2011 Overcategorization, overcategorisation or category clutter is the process of assigning too many categories, classes or index terms to a given
document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" o ...
. It is related to the Library and Information Science (LIS) concepts of
document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
and
subject indexing Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are ''about'', to summarize their contents or to increase findability. In other words, i ...
. In LIS, the ideal number of terms that should be assigned to classify an item are measured by the variables precision and recall. Assigning few category labels that are most closely related to the content of the item being classified will result in searches that have high precision, I.e., where a high proportion of the results are closely related to the query. Assigning more category labels to each item will reduce the precision of each search, but increase the recall, retrieving more relevant results. Related LIS concepts include exhaustivity of indexing and information overload.


Basic principles

If too many categories are assigned to a given document, the implications for users depend on how informative the links are. If the user is able to distinguish between useful and not useful links, the damage is limited: The user only wastes time selecting links. In many cases, however, the user cannot judge whether or not a given link will turn out to be fruitful. In that case he or she has to follow the link and to read or skim another document. The worst case scenario is, of course, that even after reading the new document the user is unable to decide whether or not it might be useful if its subject matter is not thoroughly investigated. Overcategorization also has another unpleasant implication: It makes the system (for example in Wikipedia) difficult to maintain in a
consistent In classical deductive logic, a consistent theory is one that does not lead to a logical contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consistent ...
way. If the system is inconsistent, it means that when the user considers the links in a given category, he or she will not find all documents relevant to that category. Basically, the problem of overcategorization should be understood from the perspective of
relevance Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the second topic when considering the first. The concept of relevance is studied in many different fields, including cognitive sci ...
and the traditional measures of recall and
precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
. If too few ''relevant'' categories are assigned to a document, recall may decrease. If too many non-relevant categories are assigned, precision becomes lower. The hard job is to say which categories are fruitful or
relevant Relevant is something directly related, connected or pertinent to a topic; it may also mean something that is current. Relevant may also refer to: * Relevant operator, a concept in physics, see renormalization group * Relevant, Ain, a commune ...
for future use of the document.


See also

* Exhaustivity * Information overload * Information pollution *
Relevance Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the second topic when considering the first. The concept of relevance is studied in many different fields, including cognitive sci ...
* Subject (documents) *
Subject indexing Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are ''about'', to summarize their contents or to increase findability. In other words, i ...
* Overfitting Library science Information science