HOME

TheInfoList



OR:

Document classification or document categorization is a problem in
library science Library and information science (LIS)Library and Information Sciences is the name used in the Dewey Decimal Classification for class 20 from the 18th edition (1971) to the 22nd edition (2003). are two interconnected disciplines that deal with info ...
, information science and
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
. The task is to assign a
document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes ...
to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach.


"Content-based" versus "request-based" classification

Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document. Request-oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230). Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as ''policy-based classification'': The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach.


Classification versus indexing

Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("
subject indexing Subject indexing is the act of describing or classifying a document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as ...
") but as
Frederick Wilfrid Lancaster Frederick Wilfrid ("Wilf") Lancaster (September 4, 1933 – August 25, 2013) was a British-American information scientist. He immigrated to the US in 1959 and worked as information specialist for the National Library of Medicine in Bethesda, Mar ...
has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a
thesaurus A thesaurus (: thesauri or thesauruses), sometimes called a synonym dictionary or dictionary of synonyms, is a reference work which arranges words by their meanings (or in simpler terms, a book where one can find different words with similar me ...
and vice versa (cf., Aitchison, 1986, 2004; Broughton, 2008; Riesthuis & Bliedung, 1991). Therefore, the act of labeling a document (say by assigning a term from a
controlled vocabulary A controlled vocabulary provides a way to organize knowledge for subsequent retrieval. Controlled vocabularies are used in subject indexing schemes, subject headings, thesauri, taxonomies and other knowledge organization systems. Controlled v ...
to a document) is at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). In other words, labeling a document is the same as assigning it to the class of documents indexed under that label.


Automatic document classification (ADC)

Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification, where parts of the documents are labeled by the external mechanism. There are several software products under various license models available.


Techniques

Automatic document classification techniques include: *
Artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...
* Concept Mining * Decision trees such as ID3 or C4.5 * Expectation maximization (EM) * Instantaneously trained neural networks *
Latent semantic indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the d ...
* Multiple-instance learning * Naive Bayes classifier *
Natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
approaches * Rough set-based classifier * Soft set-based classifier *
Support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
(SVM) * K-nearest neighbour algorithms *
tf–idf In information retrieval, tf–idf (term frequency–inverse document frequency, TF*IDF, TFIDF, TF–IDF, or Tf–idf) is a measure of importance of a word to a document in a collection or Text corpus, corpus, adjusted for the fact that some words ...


Applications

Classification techniques have been applied to *
spam filter Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly ap ...
ing, a process which tries to discern
E-mail spam Email spam, also referred to as junk email, spam mail, or simply spam, refers to unsolicited messages sent in bulk via email. The term originates from a Monty Python sketch, where the name of a canned meat product, "Spam," is used repetitively, m ...
messages from legitimate emails * email
routing Routing is the process of selecting a path for traffic in a Network theory, network or between or across multiple networks. Broadly, routing is performed in many types of networks, including circuit-switched networks, such as the public switched ...
, sending an email sent to a general address to a specific address or mailbox depending on topic * language identification, automatically determining the language of a text * genre classification, automatically determining the genre of a text * readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger
text simplification Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning an ...
system *
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
, determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. * health-related classification using social media in public health surveillance X. Dai, M. Bikdash and B. Meyer, "From social media to public health surveillance: Word embedding based clustering method for twitter classification," SoutheastCon 2017, Charlotte, NC, 2017, pp. 1-7. * article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology


See also

*
Categorization Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identi ...
* Classification (disambiguation) * Compound term processing *
Concept-based image indexing Concept-based image indexing, also variably named as "description-based" or "text-based" image indexing/retrieval, refers to retrieval from text-based indexing of images that may employ keywords, subject headings, captions, or natural language tex ...
*
Content-based image retrieval Content-based image retrieval, also known as query by image content ( QBIC) and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching ...
* Decimal section numbering *
Document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes ...
*
Document retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly natural language, unstructured text, such as newspaper articles, real estate records or paragraphs ...
* Document clustering *
Information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
* Knowledge organization * Knowledge Organization System *
Library classification A library classification is a system used within a library to organize materials, including books, sound and video recordings, electronic materials, etc., both on shelves and in catalogs and indexes. Each item is typically assigned a call number ...
*
Machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
* Native Language Identification * String metrics *
Subject (documents) In library and information science documents (such as books, articles and pictures) are classified and searched by subject – as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this ...
*
Subject indexing Subject indexing is the act of describing or classifying a document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as ...
*
Supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
,
unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
*
Text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
, web mining, concept mining


References


Further reading

* Fabrizio Sebastiani
Machine learning in automated text categorization
ACM Computing Surveys, 34(1):1–47, 2002. * Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack
Information Retrieval: Implementing and Evaluating Search Engines
. MIT Press, 2010.


External links








Text Classification
analysis page

(available online) * ttp://techtc.cs.technion.ac.il TechTC - Technion Repository of Text Categorization Datasets
David D. Lewis's Datasets
* {{Natural Language Processing Data mining Information science Knowledge representation Machine learning Natural language processing