HOME

TheInfoList




Controlled vocabularies provide a way to organize knowledge for subsequent retrieval. They are used in
subject indexing Subject indexing is the act of describing or classifying a document A document is a written Writing is a medium of human communication Communication (from Latin ''communicare'', meaning "to share") is the act of developing Semantics, m ...
schemes,
subject headingAn index term, subject term, subject heading, or descriptor, in information retrieval, is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic records. They are an integr ...
s,
thesauri A thesaurus (plural ''thesauri'' or ''thesauruses'') or synonym dictionary is a reference work for finding synonyms and sometimes antonyms of words. They are often used by writers to help find the best word to express an idea: Synonym dictiona ...
, taxonomies and other
knowledge organization systemKnowledge Organization Systems (KOS), concept system or concept scheme is a generic term used in knowledge organization about authority files, classification schemes, thesaurus (information retrieval), thesauri, topic maps, Ontology (information scie ...
s. Controlled vocabulary schemes mandate the use of predefined, authorised terms that have been preselected by the designers of the schemes, in contrast to
natural language In neuropsychology Neuropsychology is a branch of psychology. It is concerned with how a person's cognition and behavior are related to the brain and the rest of the nervous system. Professionals in this branch of psychology often focus on ...
vocabularies, which have no such restriction.


In library and information science

In
library and information science Library and information science (LIS) (sometimes given as the plural library and information sciences) is a branch of academic disciplines that deal generally with organization, access, and collection of information, whether in physical (for example ...
, controlled vocabulary is a carefully selected list of
word In linguistics Linguistics is the scientific study of language A language is a structured system of communication used by humans, including speech (spoken language), gestures (Signed language, sign language) and writing. Most lang ...
s and
phrase In syntax In linguistics, syntax () is the set of rules, principles, and processes that govern the structure of Sentence (linguistics), sentences (sentence structure) in a given Natural language, language, usually including word order. The ter ...

phrase
s, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search. Controlled vocabularies solve the problems of
homographs A homograph (from the el, ὁμός, ''homós'', "same" and γράφω, ''gráphō'', "write") is a word that shares the same written form as another word but has a different meaning. However, some dictionaries insist that the words must also ...
,
synonyms A synonym is a word, morpheme A morpheme is the smallest meaningful lexical item in a language. A morpheme is not a word. The difference between a morpheme and a word is that a morpheme bound and free morphemes, sometimes does not stand alone ...
and
polyseme Polysemy ( or ; from grc-gre, πολύ-, , "many" and , , "sign") is the capacity for a word or phrase to have multiple related meanings. Polysemy is thus distinct from homonymy—or homophone, homophony—which is an accidental similarity betwee ...
s by a
bijection In , a bijection, bijective function, one-to-one correspondence, or invertible function, is a between the elements of two , where each element of one set is paired with exactly one element of the other set, and each element of the other set is p ...

bijection
between concepts and authorized terms. In short, controlled vocabularies reduce ambiguity inherent in normal human languages where the same concept can be given different names and ensure consistency. For example, in the
Library of Congress Subject HeadingsThe Library of Congress Subject Headings (LCSH) comprise a thesaurus A thesaurus (plural ''thesauri'' or ''thesauruses'') or synonym dictionary is a reference work for finding synonyms and sometimes antonyms of words. They are often used by wr ...
(a subject heading system that uses a controlled vocabulary), authorized terms—subject headings in this case—have to be chosen to handle choices between variant spellings of the same word (American versus British), choice among scientific and popular terms (''cockroach'' versus ''Periplaneta americana''), and choices between synonyms (''automobile'' versus ''car''), among other difficult issues. Choices of authorized terms are based on the principles of ''user warrant'' (what terms users are likely to use), ''literary warrant'' (what terms are generally used in the literature and documents), and ''structural warrant'' (terms chosen by considering the structure, scope of the controlled vocabulary). Controlled vocabularies also typically handle the problem of
homographs A homograph (from the el, ὁμός, ''homós'', "same" and γράφω, ''gráphō'', "write") is a word that shares the same written form as another word but has a different meaning. However, some dictionaries insist that the words must also ...
with qualifiers. For example, the term ''pool'' has to be qualified to refer to either ''swimming pool'' or the game ''pool'' to ensure that each authorized term or heading refers to only one concept.


Types used in libraries

There are two main kinds of controlled vocabulary tools used in libraries: subject headings and thesauri. While the differences between the two are diminishing, there are still some minor differences. Historically subject headings were designed to describe books in library catalogs by catalogers while thesauri were used by indexers to apply index terms to documents and articles. Subject headings tend to be broader in scope describing whole books, while thesauri tend to be more specialized covering very specific disciplines. Also because of the card catalog system, subject headings tend to have terms that are in indirect order (though with the rise of automated systems this is being removed), while thesaurus terms are always in direct order. Subject headings also tend to use more pre-coordination of terms such that the designer of the controlled vocabulary will combine various concepts together to form one authorized subject heading. (e.g., children and terrorism) while thesauri tend to use singular direct terms. Lastly thesauri list not only equivalent terms but also narrower, broader terms and related terms among various authorized and non-authorized terms, while historically most subject headings did not. For example, the
Library of Congress Subject HeadingThe Library of Congress Subject Headings (LCSH) comprise a thesaurus (information retrieval), thesaurus (in the information science sense, a controlled vocabulary) of subject headings, maintained by the United States Library of Congress, for use in b ...
itself did not have much syndetic structure until 1943, and it was not until 1985 when it began to adopt the thesauri type term " Broader term" and " Narrow term". The
terms
terms
are chosen and organized by trained professionals (including librarians and information scientists) who possess expertise in the subject area. Controlled vocabulary terms can accurately describe what a given document is actually about, even if the terms themselves do not occur within the document's text. Well known subject heading systems include the Library of Congress system,
MeSH A mesh is a barrier made of connected strands of metal A metal (from Ancient Greek, Greek μέταλλον ''métallon'', "mine, quarry, metal") is a material that, when freshly prepared, polished, or fractured, shows a lustrous appearan ...
, and
Sears Sears, Roebuck and Co., commonly known as Sears, is an American chain of department stores founded by Richard Warren Sears and Alvah Curtis Roebuck in 1892, and reincorporated by Richard Sears and Julius Rosenwald in 1906. Formerly based at ...
. Well known thesauri include the
Art and Architecture Thesaurus Art is a diverse range of (products of) human behavior, human activities involving creative imagination to express technical proficiency, beauty, emotional power, or conceptual ideas. There is no generally agreed definition of what constitute ...
and the
ERIC The given name Image:FML names-2.png, Diagram of naming conventions, using John F. Kennedy as an example. "First names" can also be called given names; "last names" can also be called surnames or family names. This shows a structure typical f ...
Thesaurus. Choosing authorized terms to be used is a tricky business, besides the areas already considered above, the designer has to consider the specificity of the term chosen, whether to use direct entry, inter consistency and stability of the language. Lastly the amount of pre-co-ordinate (in which case the degree of enumeration versus synthesis becomes an issue) and post co-ordinate in the system is another important issue. Controlled vocabulary elements (terms/phrases) employed as
tags Tag, TAG, or tagging could refer to: Identification and tracking * Tag, a label used in electronic article surveillance to prevent shoplifting * Tagging (graffiti), a form of graffiti signature * Dog tag (military), an ID tag worn by military p ...
, to aid in the content identification process of documents, or other information system entities (e.g. DBMS, Web Services) qualifies as
metadata Metadata is "data Data (; ) are individual facts, statistics, or items of information, often numeric. In a more technical sense, data are a set of values of qualitative property, qualitative or quantity, quantitative variable (research), v ...

metadata
.


Indexing languages

There are three main types of indexing languages. * Controlled indexing language – only approved terms can be used by the indexer to describe the document *
Natural language In neuropsychology Neuropsychology is a branch of psychology. It is concerned with how a person's cognition and behavior are related to the brain and the rest of the nervous system. Professionals in this branch of psychology often focus on ...
indexing language – any term from the document in question can be used to describe the document * Free indexing language – any term (not only from the document) can be used to describe the document When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example, using low indexing exhaustivity, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document. In recent years
free text search In Text retrieve, text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the or ...
as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is ''indexed''). Many studies have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well chosen controlled vocabulary descriptors.


Advantages

Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items (
false positives False or falsehood may refer to: *False (logic) In logic Logic (from Ancient Greek, Greek: grc, wikt:λογική, λογική, label=none, lit=possessed of reason, intellectual, dialectical, argumentative, translit=logikḗ)Also related t ...
) are often caused by the inherent ambiguity of
natural language In neuropsychology Neuropsychology is a branch of psychology. It is concerned with how a person's cognition and behavior are related to the brain and the rest of the nervous system. Professionals in this branch of psychology often focus on ...
. Take the English word ''football'' for example. ''Football'' is the name given to a number of different
team sport A team sport includes any sport where individuals are organized into opposing sports team, teams which compete to win. Team members act together towards a shared objective. This can be done in a number of ways such as outscoring the opposing ...
s. Worldwide the most popular of these team sports is
association football Association football, more commonly known as simply football or soccer, is a team sport A team sport includes any sport Sport pertains to any form of Competition, competitive physical activity or game that aims to use, maintain ...
, which also happens to be called ''
soccer Association football, more commonly known as simply football or soccer, is a team sport A team sport includes any sport Sport pertains to any form of Competition, competitive physical activity or game that aims to use, maintain ...

soccer
'' in several countries. The word ''football'' is also applied to
rugby football Rugby football is a collective name for the family of team sports of rugby union and rugby league, as well as the earlier forms of football from which both games, as well as Australian rules football and gridiron football, evolved. The two v ...
(
rugby union Rugby union, commonly known simply as rugby, is a Contact sport#Terminology, close-contact team sport that originated in England in the first half of the 19th century. One of the Comparison of rugby league and rugby union, two codes of rugby f ...
and
rugby league Rugby league football, commonly known as just rugby league or simply league, rugby, football, or footy, is a full-contact sport played by two teams of thirteen players on a rectangular field Field may refer to: Expanses of open ground * Fi ...
),
American football American football, referred to simply as football in the United States and Canada and also known as gridiron, is a team sport played by two teams of eleven players on a rectangular American football field, field with goalposts at each end. ...

American football
,
Australian rules football Australian rules football, officially known as Australian football, or simply called "Aussie rules", "football Football is a family of team sport A team is a roup (disambiguation), group of individuals (human or non-human) working ...
, Gaelic football, and Canadian football. A search for ''football'' therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by Tag (metadata), tagging the documents in such a way that the ambiguities are eliminated. Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually
relevant
relevant
to the search topic). In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct authorized term is searched, there is no need to search for other terms that might be synonyms of that term.


Problems

A controlled vocabulary search may lead to unsatisfactory
recall Recall may refer to: * Recall (bugle call), a signal to stop * Recall (information retrieval), a statistical measure * ReCALL (journal), ''ReCALL'' (journal), an academic journal about computer-assisted language learning * Recall (memory) * Recal ...
, in that it will fail to retrieve some documents that are actually relevant to the search question. This is particularly problematic when the search question involves terms that are sufficiently tangential to the subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with that of the indexer. Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity is low. For example, an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A
free text search In Text retrieve, text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the or ...
would automatically pick up that article regardless. On the other hand, free text searches have high exhaustivity (every word is searched) so although it has much lower precision, it has potential for high recall as long as the searcher overcome the problem of synonyms by entering every combination. Controlled vocabularies may become outdated rapidly in fast developing fields of knowledge, unless the authorized terms are updated regularly. Even in an ideal scenario, a controlled vocabulary is often less specific than the words of the text itself. Indexers trying to choose the appropriate index terms might misinterpret the author, while this precise problem is not a factor in a free text, as it uses the author's own words. The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision. Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including
faceted classificationA faceted classification is a classification scheme In information science and ontology (information science), ontology, a classification scheme is the product of arranging things into kinds of things (classes) or into ''groups'' of classes; this be ...
, which enables a given data record or document to be described in multiple ways.


Applications

Controlled vocabularies, such as the
Library of Congress Subject HeadingsThe Library of Congress Subject Headings (LCSH) comprise a thesaurus A thesaurus (plural ''thesauri'' or ''thesauruses'') or synonym dictionary is a reference work for finding synonyms and sometimes antonyms of words. They are often used by wr ...
, are an essential component of
bibliography Image:Library-shelves-bibliographies-Graz.jpg, 250px, Bibliographies at the University Library of Graz Bibliography (from and ), as a discipline, is traditionally the academic study of books as physical, cultural objects; in this sense, it is a ...
, the study and classification of books. They were initially developed in
library and information science Library and information science (LIS) (sometimes given as the plural library and information sciences) is a branch of academic disciplines that deal generally with organization, access, and collection of information, whether in physical (for example ...
. In the 1950s, government agencies began to develop controlled vocabularies for the burgeoning journal literature in specialized fields; an example is the
Medical Subject Headings Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary A vocabulary, also known as a wordstock or word-stock, is a set of familiar words within a person's language. A vocabulary, usually developed with age, serves as a useful ...
(MeSH) developed by the U.S. National Library of Medicine. Subsequently, for-profit firms (called Abstracting and indexing services) emerged to index the fast-growing literature in every field of knowledge. In the 1960s, an online bibliographic database industry developed based on dialup
X.25 X.25 is an ITU-T The ITU Telecommunication Standardization Sector (ITU-T) coordinates standards for telecommunications and Information Communication Technology such as X.509 for cybersecurity, Y.3172 and Y.3173 for machine learning, and H.264/MPE ...
networking. These services were seldom made available to the public because they were difficult to use; specialist librarians called search intermediaries handled the searching job. In the 1980s, the first
full text In Text retrieve, text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the or ...
databases appeared; these databases contain the full text of the index articles as well as the bibliographic information. Online bibliographic databases have migrated to the Internet and are now publicly available; however, most are proprietary and can be expensive to use. Students enrolled in colleges and universities may be able to access some of these services without charge; some of these services may be accessible without charge at a public library.


Technical communication

In large organizations, controlled vocabularies may be introduced to improve
technical communication Technical communication is used to convey scientific, engineering, or other technical information. Individuals in a variety of contexts and with varied professional credentials engage in technical communication. Some individuals are designated as te ...
. The use of controlled vocabulary ensures that everyone is using the same word to mean the same thing. This consistency of terms is one of the most important concepts in
technical writing Technical writing is writing or drafting technical communication Technical communication is used to convey scientific, engineering, or other technical information. Individuals in a variety of contexts and with varied professional credentials engage ...
and
knowledge management Knowledge management (KM) is the collection of methods relating to creating, sharing, using and managing the knowledge Knowledge is a familiarity, awareness, or understanding of someone or something, such as facts ( descriptive knowledge), s ...

knowledge management
, where effort is expended to use the same word throughout a
document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', whic ...

document
or
organization An organization, or organisation (Commonwealth English The use of the English language English is a of the , originally spoken by the inhabitants of . It is named after the , one of the ancient that migrated from , a peninsu ...

organization
instead of slightly different ones to refer to the same thing.


Semantic web and structured data

Web searching could be dramatically improved by the development of a controlled vocabulary for describing Web pages; the use of such a vocabulary could culminate in a
Semantic Web The Semantic Web (sometimes known as Web 3.0) is an extension of the World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system An information system (IS) is a formal, sociotechnical Sociotechnica ...

Semantic Web
, in which the content of Web pages is described using a machine-readable
metadata Metadata is "data Data (; ) are individual facts, statistics, or items of information, often numeric. In a more technical sense, data are a set of values of qualitative property, qualitative or quantity, quantitative variable (research), v ...

metadata
scheme. One of the first proposals for such a scheme is the
Dublin Core file:DCMI-logo.svg, 220px, Logo image of DCMI, which formulates Dublin Core The Dublin Core, also known as the Dublin Core Metadata Element Set, is a set of fifteen "core" elements (properties) for describing resources. This fifteen-element Dublin ...
Initiative. An example of a controlled vocabulary which is usable for indexing web pages is PSH. It is unlikely that a single metadata scheme will ever succeed in describing the content of the entire Web. To create a Semantic Web, it may be necessary to draw from two or more metadata systems to describe a Web page's contents. The eXchangeable Faceted Metadata Language (XFML) is designed to enable controlled vocabulary creators to publish and share metadata systems. XFML is designed on
faceted classificationA faceted classification is a classification scheme In information science and ontology (information science), ontology, a classification scheme is the product of arranging things into kinds of things (classes) or into ''groups'' of classes; this be ...
principles. Controlled vocabularies of the
Semantic Web The Semantic Web (sometimes known as Web 3.0) is an extension of the World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system An information system (IS) is a formal, sociotechnical Sociotechnica ...

Semantic Web
define the concepts and relationships (terms) used to describe a field of interest or area of concern. For instance, to declare a person in a machine-readable format, a vocabulary is needed that has the formal definition of "Person", such as the Friend of a Friend ( FOAF) vocabulary, which has a Person class that defines typical properties of a person including, but not limited to, name, honorific prefix, affiliation, email address, and homepage, or the Person vocabulary of
Schema.org Schema.org is a reference website that publishes documentations and guidelines to using structured data A data model (or datamodel) is an abstract model that organizes elements of data and standardizes how they relate to one another and to the prop ...
. Similarly, a book can be described using the Book vocabulary of
Schema.org Schema.org is a reference website that publishes documentations and guidelines to using structured data A data model (or datamodel) is an abstract model that organizes elements of data and standardizes how they relate to one another and to the prop ...
and general publication terms from the
Dublin Core file:DCMI-logo.svg, 220px, Logo image of DCMI, which formulates Dublin Core The Dublin Core, also known as the Dublin Core Metadata Element Set, is a set of fifteen "core" elements (properties) for describing resources. This fifteen-element Dublin ...
vocabulary, an event with the Event vocabulary of
Schema.org Schema.org is a reference website that publishes documentations and guidelines to using structured data A data model (or datamodel) is an abstract model that organizes elements of data and standardizes how they relate to one another and to the prop ...
, and so on. To use machine-readable terms from any controlled vocabulary, web designers can choose from a variety of annotation formats, including RDFa, HTML5 Microdata, or
JSON-LD JSON-LD (JavaScript Object Notation for Linked Data) is a method of encoding linked data using JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format, and data interchange format, that uses Human-readable m ...

JSON-LD
in the markup, or RDF serializations (RDF/XML, Turtle, N3, TriG, TriX) in external files.


See also

*
Authority control In library and information science, library science, authority control is a process that organizes bibliographic information, for example in library catalogs by using a single, distinct spelling of a name (heading) or a numeric identifier for ea ...
*
Controlled natural language Controlled natural languages (CNLs) are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types: ...
* Defining vocabulary * IMS Vocabulary Definition Exchange *
Named-entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
*
Nomenclature Nomenclature (, ) is a system A system is a group of interacting Interaction is a kind of action that occurs as two or more objects have an effect upon one another. The idea of a two-way effect is essential in the concept of interaction, a ...

Nomenclature
*
Ontology (computer science) In computer science Computer science deals with the theoretical foundations of information, algorithms and the architectures of its computation as well as practical techniques for their application. Computer science is the study of Algor ...
*
Terminology Terminology is a general word for the group of specialized words or meanings relating to a particular field, and also the study of such terms and their use. This is also known as terminology science. Terms are words and compound words or multi-w ...

Terminology
*
Universal Data Element FrameworkThe Universal Data Element Framework (UDEF) was a controlled vocabulary developed by The Open Group. It provided a framework for categorizing, naming, and indexing data. It assigned to every item of data a structured alphanumeric tag plus a controll ...
* Vocabulary-based transformation


References

{{Reflist, 2 Information retrieval techniques Library cataloging and classification Knowledge representation Technical communication Semantic Web Ontology (information science) Information science Identifiers


External Links


Directory of Linked Open Vocabularies (LOV)