Computational lexicology is a branch of

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

, which is concerned with the use of computers in the study of

lexicon A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...

. It has been more narrowly described by some scholars (Amsler, 1980) as the use of computers in the study of '' machine-readable dictionaries''. It is distinguished from ''computational lexicography'', which more properly would be the use of computers in the construction of dictionaries, though some researchers have used computational lexicography as

synonymous A synonym is a word, morpheme, or phrase that means precisely or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are a ...

History

Computational lexicology emerged as a separate discipline within computational linguistics with the appearance of machine-readable dictionaries, starting with the creation of the machine-readable tapes of the ''Merriam-Webster Seventh Collegiate Dictionary'' and the ''Merriam-Webster New Pocket Dictionary'' in the 1960s by John Olney et al. at

System Development Corporation System Development Corporation (SDC) was a computer software company based in Santa Monica, California. Initially created as a division of the RAND Corporation in December 1955 (under the name System Development Division) and established as an ind ...

. Today, computational lexicology is best known through the creation and applications of

WordNet WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions and usage examples. It can thu ...

. As the computational processing of the researchers increased over time, the use of computational lexicology has been applied ubiquitously in the text analysis. In 1987, amongst others Byrd, Calzolari, Chodorow have developed computational tools for text analysis. In particular the model was designed for coordinating the associations involving the senses of

polysemous Polysemy ( or ; ) is the capacity for a sign (e.g. a symbol, morpheme, word, or phrase) to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from '' monosemy'', where a word has a single meani ...

words.Byrd, Roy J., Nicoletta Calzolari, Martin S. Chodorow, Judith L. Klavans, Mary S. Neff, and Omneya A. Rizk. "Tools and methods for computational lexicology."''Computational Linguistics'' 13, no. 3-4 (1987): 219-240.

Study of lexicon

Computational lexicology has contributed to the understanding of the content and limitations of print dictionaries for computational purposes (i.e. it clarified that the previous work of lexicography was not sufficient for the needs of computational linguistics). Through the work of computational lexicologists almost every portion of a print dictionary entry has been studied ranging from: # what constitutes a

headword In morphology and lexicography, a lemma (: lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of word forms. In English, for example, ''break'', ''breaks'', ''broke'', ''broken'' and ''breaking'' are forms of the s ...

- used to generate spelling correction lists; #what variants and inflections the headword forms - used to empirically understand morphology; #how the headword is delimited into syllables; #how the headword is pronounced - used in speech generation systems; #the parts of speech the headword takes on - used for POS taggers; #any special subject or usage codes assigned to the headword - used to identify text document subject matter; #the headword's definitions and their syntax - used as an aid to disambiguation of word in context; #the etymology of the headword and its use to characterize vocabulary by languages of origin - used to characterize text vocabulary as to its languages of origin; #the example sentences; #the run-ons (additional words and multi-word expressions that are formed from the headword); and #related words such as

synonym A synonym is a word, morpheme, or phrase that means precisely or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are a ...

s and

antonym In lexical semantics, opposites are words lying in an inherently incompatible binary relationship. For example, something that is ''even'' entails that it is not ''odd''. It is referred to as a 'binary' relationship because there are two members i ...

s. Many computational linguists were disenchanted with the print dictionaries as a resource for computational linguistics because they lacked sufficient

syntactic In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency ...

and

semantic Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...

information for computer programs. The work on computational lexicology quickly led to efforts in two additional directions.

Successors to Computational Lexicology

First, collaborative activities between computational linguists and lexicographers led to an understanding of the role that corpora played in creating dictionaries. Most computational lexicologists moved on to build large corpora to gather the basic data that lexicographers had used to create dictionaries. The ACL/DCI ( Data Collection Initiative) and the LDC (

Linguistic Data Consortium The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and develop ...

) went down this path. The advent of markup languages led to the creation of tagged corpora that could be more easily analyzed to create computational linguistic systems. Part-of-speech tagged corpora and semantically tagged corpora were created in order to test and develop POS taggers and word

semantic disambiguation Polysemy ( or ; ) is the capacity for a sign (e.g. a symbol, morpheme, word, or phrase) to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from ''monosemy'', where a word has a single meanin ...

technology. The second direction was toward the creation of Lexical Knowledge Bases (LKBs). A Lexical Knowledge Base was deemed to be what a dictionary should be for computational linguistic purposes, especially for computational lexical semantic purposes. It was to have the same information as in a print dictionary, but totally explicated as to the meanings of the words and the appropriate links between senses. Many began creating the resources they wished dictionaries were, if they had been created for use in computational analysis.

can be considered to be such a development, as can the newer efforts at describing syntactic and semantic information such as the FrameNet work of Fillmore. Outside of computational linguistics, the Ontology work of artificial intelligence can be seen as an evolutionary effort to build a lexical knowledge base for AI applications.

Standardization

Optimizing the production, maintenance and extension of computational lexicons is one of the crucial aspects impacting NLP. The main problem is the

interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader de ...

: various lexicons are frequently incompatible. The most frequent situation is: how to merge two lexicons, or fragments of lexicons? A secondary problem is that a lexicon is usually specifically tailored to a specific NLP program and has difficulties being used within other NLP programs or applications. To this respect, the various data models of Computational lexicons are studied by ISO/TC37 since 2003 within the project

lexical markup framework Language resource management – Lexical markup framework (LMF; ISO 24613), produced by ISO/TC 37, is the ISO standard for natural language processing (NLP) and machine-readable dictionary (MRD) lexicons. The scope is standardization of principles ...

leading to an ISO standard in 2008.

References

{{Reflist Amsler, Robert A. 1980. Ph.D. Dissertation, "The Structure of the Merriam-Webster Pocket Dictionary". The University of Texas at Austin.

External links

Computational lexicology issue in ACL Wiki
*
1.ACL Wiki
*
2.Association for Computational Linguistics, Official page

Lexical Markup Framework (LMF)
Computational linguistics Lexicology Computational fields of study