natural language processing Natural language processing (NLP) is a subfield of , , and concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of data. The goal is a computer capab ...
, semantic compression is a process of compacting a lexicon used to build a textual document (or a set of documents) by reducing language heterogeneity, while maintaining text
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference Reference is a relationship between objects in which one object designates, or acts as a means by which to connect to or link to, another ...
. As a result, the same ideas can be represented using a smaller set of words. In most applications, semantic compression is a lossy compression, that is, increased prolixity does not compensate for the lexical compression, and an original document cannot be reconstructed in a reverse process.

By generalization

Semantic compression is basically achieved in two steps, using frequency dictionaries and
semantic network A semantic network, or frame network is a knowledge base A knowledge base (KB) is a technology used to information storage, store complex structured data, structured and unstructured information used by a computer system. The initial use of ...
: # determining cumulated term frequencies to identify target lexicon, # replacing less frequent terms with their hypernyms (
generalization A generalization is a form of abstraction Abstraction in its main sense is a conceptual process where general rules and concept Concepts are defined as abstract ideas or general notions that occur in the mind, in speech, or in thought. They ...

) from target lexicon. Step 1 requires assembling word frequencies and information on semantic relationships, specifically
hyponymy In linguistics Linguistics is the science, scientific study of language. It encompasses the analysis of every aspect of language, as well as the methods for studying and modeling them. The traditional areas of linguistic analysis include ...
. Moving upwards in word hierarchy, a cumulative concept frequency is calculating by adding a sum of hyponyms' frequencies to frequency of their hypernym: cum f(k_) = f(k_) + \sum_ cum f(k_) where k_ is a hypernym of k_. Then, a desired number of words with top cumulated frequencies are chosen to build a targed lexicon. In the second step, compression mapping rules are defined for the remaining words, in order to handle every occurrence of a less frequent hyponym as its hypernym in output text. ;Example The below fragment of text has been processed by the semantic compression. Words in bold have been replaced by their hypernyms.
They are both nest building social insects, but paper wasps and honey bees organize their colonies in very different ways. In a new study, researchers report that despite their differences, these insects rely on the same network of genes to guide their social behavior.The study appears in the Proceedings of the Royal Society B: Biological Sciences. Honey bees and paper wasps are separated by more than 100 million years of evolution, and there are striking differences in how they divvy up the work of maintaining a colony.
The procedure outputs the following text:
They are both facility building insect, but insects and honey insects arrange their biological groups in very different structure. In a new study, researchers report that despite their difference of opinions, these insects act the same network of genes to steer their party demeanor. The study appears in the proceeding of the institution bacteria Biological Sciences. Honey insects and insect are separated by more than hundred million years of organic processes, and there are impinging differences of opinions in how they divvy up the work of affirming a biological group.

Implicit semantic compression

A natural tendency to keep natural language expressions concise can be perceived as a form of implicit semantic compression, by omitting unmeaningful words or redundant meaningful words (especially to avoid
pleonasm Pleonasm (; , ) is the use of more words or parts of words than are necessary or sufficient for clear expression (for instance, "black darkness", "burning fire"). Such Redundancy (linguistics), redundancy is a manifestation of Tautology (languag ...

Applications and advantages

In the
vector space modelVector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, index (search e ...

vector space model
, compacting a lexicon leads to a reduction of
dimensionality File:Dimension levels.svg, thumb , 236px , The first four spatial dimensions, represented in a two-dimensional picture. In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum numb ...
, which results in less computational complexity and a positive influence on efficiency. Semantic compression is advantageous in
information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text search, fu ...
tasks, improving their effectiveness (in terms of both precision and recall).{{cite book , first1=D. , last1=Ceglarek , first2=K. , last2=Haniewicz , first3=W. , last3=Rutkowski , chapter=Quality of semantic compression in classification , chapter-url=https://dl.acm.org/doi/10.5555/1947662.1947683 , title=Proceedings of the 2nd International Conference on Computational Collective Intelligence: Technologies and Applications , year=2010 , publisher=Springer , isbn=978-3-642-16692-1 , pages=162–171 , volume=1 This is due to more precise descriptors (reduced effect of language diversity – limited language redundancy, a step towards a controlled dictionary). As in the example above, it is possible to display the output as natural text (re-applying inflexion, adding stop words).

See also

Controlled natural language Controlled natural languages (CNLs) are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types: ...
Information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of Digital data, digital information. The field was fundamentally established by the ...
* Lexical substitution * Quantities of information *
Text simplification Text simplification is an operation used in natural language processing Natural language processing (NLP) is a subfield of , , and concerned with the interactions between computers and human language, in particular how to program computers to ...


External links

Semantic compression on Project SENECA (Semantic Networks and Categorization) website
Information retrieval techniques Natural language processing Quantitative linguistics Computational linguistics