HOME

TheInfoList



OR:

Document classification or document categorization is a problem in
library science Library science (often termed library studies, bibliothecography, and library economy) is an interdisciplinary or multidisciplinary field that applies the practices, perspectives, and tools of management, information technology, education, an ...
,
information science Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of informatio ...
and
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...
. The task is to assign a
document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" o ...
to one or more classes or
categories Category, plural categories, may refer to: Philosophy and general uses *Categorization, categories in cognitive science, information science and generally *Category of being * ''Categories'' (Aristotle) *Category (Kant) * Categories (Peirce) * ...
. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach.


"Content-based" versus "request-based" classification

Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document. Request-oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230). Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as ''policy-based classification'': The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach.


Classification versus indexing

Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("
subject indexing Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are ''about'', to summarize their contents or to increase findability. In other words, i ...
") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a
thesaurus A thesaurus (plural ''thesauri'' or ''thesauruses'') or synonym dictionary is a reference work for finding synonyms and sometimes antonyms of words. They are often used by writers to help find the best word to express an idea: Synonym dictionar ...
and vice versa (cf., Aitchison, 1986, 2004; Broughton, 2008; Riesthuis & Bliedung, 1991). Therefore, the act of labeling a document (say by assigning a term from a
controlled vocabulary Control may refer to: Basic meanings Economics and business * Control (management), an element of management * Control, an element of management accounting * Comptroller (or controller), a senior financial officer in an organization * Controllin ...
to a document) is at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). In other words, labeling a document is the same as assigning it to the class of documents indexed under that label.


Automatic document classification (ADC)

Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as
document clustering Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering. Overview Document cluster ...
), where the classification must be done entirely without reference to external information, and semi-supervised document classification, where parts of the documents are labeled by the external mechanism. There are several software products under various license models available.


Techniques

Automatic document classification techniques include: *
Expectation maximization Expectation or Expectations may refer to: Science * Expectation (epistemic) * Expected value, in mathematical probability theory * Expectation value (quantum mechanics) * Expectation–maximization algorithm, in statistics Music * ''Expectati ...
(EM) * Naive Bayes classifier *
tf–idf In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or c ...
*
Instantaneously trained neural networks Instantaneously trained neural networks are feedforward artificial neural networks that create a new hidden neuron node for each novel training sample. The weights to this hidden neuron separate out not only this training sample but others that are ...
*
Latent semantic indexing Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
* Support vector machines (SVM) *
Artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
* K-nearest neighbour algorithms * Decision trees such as ID3 or C4.5 * Concept Mining * Rough set-based classifier *
Soft set Soft set theory is a generalization of fuzzy set theory, that was proposed by Molodtsov in 1999 to deal with uncertainty in a parametric manner. A soft set is a parameterised family of sets - intuitively, this is "soft" because the boundary of the ...
-based classifier * Multiple-instance learning * Natural language processing approaches


Applications

Classification techniques have been applied to *
spam filter Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly appl ...
ing, a process which tries to discern
E-mail spam Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoida ...
messages from legitimate emails * email routing, sending an email sent to a general address to a specific address or mailbox depending on topic *
language identification In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solv ...
, automatically determining the language of a text * genre classification, automatically determining the genre of a text * readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger
text simplification Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and ...
system *
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
, determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. * health-related classification using social media in public health surveillance X. Dai, M. Bikdash and B. Meyer, "From social media to public health surveillance: Word embedding based clustering method for twitter classification," SoutheastCon 2017, Charlotte, NC, 2017, pp. 1-7. * article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology


See also

* Categorization * Classification (disambiguation) *
Compound term processing Compound-term processing, in information-retrieval, is search result matching on the basis of compound terms. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is ...
*
Concept-based image indexing Concept-based image indexing, also variably named as "description-based" or "text-based" image indexing/retrieval, refers to retrieval from text-based indexing of images that may employ keywords, subject headings, captions, or natural language text ...
*
Content-based image retrieval Content-based image retrieval, also known as query by image content ( QBIC) and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching ...
*
Decimal section numbering A paragraph () is a self-contained unit of discourse in writing dealing with a particular point or idea. Though not required by the orthographic conventions of any language with a writing system, paragraphs are a conventional means of organizing e ...
*
Document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" o ...
*
Document retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. Use ...
*
Document clustering Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering. Overview Document cluster ...
* Information retrieval *
Knowledge organization Knowledge organization (KO), organization of knowledge, organization of information, or information organization is an intellectual discipline concerned with activities such as document description, indexing, and classification that serve to ...
*
Knowledge Organization System Knowledge Organization Systems (KOS), concept system or concept scheme is a generic term used in knowledge organization for authority files, classification schemes, thesauri, topic maps, ontologies and similar works. Despite their differences in typ ...
*
Library classification A library classification is a system of organization of knowledge by which library resources are arranged and ordered systematically. Library classifications are a notational system that represents the order of topics in the classification and al ...
*
Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
*
Native Language Identification Native-language identification (NLI) is the task of determining an author's native language (L1) based only on their writings in a second language (L2). NLI works through identifying language-usage patterns that are common to specific L1 groups and ...
*
String metrics In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comp ...
* Subject (documents) *
Subject indexing Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are ''about'', to summarize their contents or to increase findability. In other words, i ...
*
Supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
,
unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
*
Text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
, web mining, concept mining


Further reading

* Fabrizio Sebastiani
Machine learning in automated text categorization
ACM Computing Surveys, 34(1):1–47, 2002. * Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack
Information Retrieval: Implementing and Evaluating Search Engines
. MIT Press, 2010.


References


External links








Text Classification
analysis page

(available online) * ttp://techtc.cs.technion.ac.il TechTC - Technion Repository of Text Categorization Datasets
David D. Lewis's Datasets

BioCreative III ACT (article classification task) dataset
{{Natural Language Processing Information science Natural language processing Knowledge representation Data mining Machine learning