{{unreferenced, date=November 2011
Overcategorization, overcategorisation or category clutter is the process of assigning too many categories, classes or
index term In information retrieval, an index term (also known as subject term, subject heading, descriptor, or keyword) is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic reco ...
s to a given
document
A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" o ...
document classification
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
and
subject indexing
Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are ''about'', to summarize their contents or to increase findability. In other words, i ...
.
In LIS, the ideal number of terms that should be assigned to classify an item are measured by the variables
precision and recall
In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.
Precision (also called ...
. Assigning few category labels that are most closely related to the content of the item being classified will result in searches that have high precision, I.e., where a high proportion of the results are closely related to the query. Assigning more category labels to each item will reduce the precision of each search, but increase the recall, retrieving more relevant results. Related LIS concepts include exhaustivity of indexing and
information overload
Information overload (also known as infobesity, infoxication, information anxiety, and information explosion) is the difficulty in understanding an issue and effectively making decisions when one has too much information (TMI) about that issue, ...
.
Basic principles
If too many categories are assigned to a given document, the implications for users depend on how
informative
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
the links are. If the user is able to distinguish between useful and not useful links, the damage is limited: The user only wastes time selecting links. In many cases, however, the user cannot judge whether or not a given link will turn out to be fruitful. In that case he or she has to follow the link and to read or skim another document. The worst case scenario is, of course, that even after reading the new document the user is unable to decide whether or not it might be useful if its subject matter is not thoroughly investigated.
Overcategorization also has another unpleasant implication: It makes the system (for example in Wikipedia) difficult to maintain in a
consistent
In classical deductive logic, a consistent theory is one that does not lead to a logical contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consisten ...
way. If the system is inconsistent, it means that when the user considers the links in a given category, he or she will not find all documents relevant to that category.
Basically, the problem of overcategorization should be understood from the perspective of
relevance
Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the second topic when considering the first. The concept of relevance is studied in many different fields, including cognitive sc ...
and the traditional measures of
recall
Recall may refer to:
* Recall (bugle call), a signal to stop
* Recall (information retrieval), a statistical measure
* ''ReCALL'' (journal), an academic journal about computer-assisted language learning
* Recall (memory)
* ''Recall'' (Overwat ...
and precision. If too few ''relevant'' categories are assigned to a document, recall may decrease. If too many non-relevant categories are assigned, precision becomes lower. The hard job is to say which categories are fruitful or
relevant
Relevant is something directly related, connected or pertinent to a topic; it may also mean something that is current.
Relevant may also refer to:
* Relevant operator, a concept in physics, see renormalization group
* Relevant, Ain, a commune ...
Information overload
Information overload (also known as infobesity, infoxication, information anxiety, and information explosion) is the difficulty in understanding an issue and effectively making decisions when one has too much information (TMI) about that issue, ...
*
Information pollution
Information pollution (also referred to as info pollution) is the contamination of information supply with irrelevant, redundant, unsolicited, hampering and low-value information. Examples include misinformation, junk e-mail and media violence.
...
*
Relevance
Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the second topic when considering the first. The concept of relevance is studied in many different fields, including cognitive sc ...
*
Subject (documents)
In library and information science documents (such as books, articles and pictures) are classified and searched by subject – as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this ...
*
Subject indexing
Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are ''about'', to summarize their contents or to increase findability. In other words, i ...
*
Overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...