Semantic similarity is a

metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathe ...

defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving". Computationally, semantic similarity can be estimated by defining a topological similarity, by using ontologies to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a

partially ordered set In mathematics, especially order theory, a partially ordered set (also poset) formalizes and generalizes the intuitive concept of an ordering, sequencing, or arrangement of the elements of a set. A poset consists of a set together with a binary ...

and represented as nodes of a

directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ...

(e.g., a

taxonomy Taxonomy is the practice and science of categorization or classification. A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types. ...

), would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language (e.g., words, sentences) can also be estimated using statistical means such as a

vector space model Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing an ...

correlate In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

words and textual contexts from a suitable

text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...

. The evaluation of the proposed semantic similarity / relatedness measures are evaluated through two main ways. The former is based on the use of datasets designed by experts and composed of word pairs with semantic similarity / relatedness degree estimation. The second way is based on the integration of the measures inside specific applications such as information retrieval, recommender systems, natural language processing, etc.

Terminology

The concept of semantic similarity is more specific than semantic relatedness, as the latter includes concepts as

antonymy In lexical semantics, opposites are words lying in an inherently incompatible binary relationship. For example, something that is ''long'' entails that it is not ''short''. It is referred to as a 'binary' relationship because there are two members ...

and meronymy, while similarity does not. However, much of the literature uses these terms interchangeably, along with terms like semantic distance. In essence, semantic similarity, semantic distance, and semantic relatedness all mean, "How much does term A have to do with term B?" The answer to this question is usually a number between -1 and 1, or between 0 and 1, where 1 signifies extremely high similarity.

Visualization

An intuitive way of visualizing the semantic similarity of terms is by grouping together terms which are closely related and spacing wider apart the ones which are distantly related. This is also common in practice for mind maps and

concept maps A concept map or conceptual diagram is a diagram that depicts suggested relationships between concepts. Concept maps may be used by instructional designers, engineers, technical writers, and others to organize and structure knowledge. A con ...

. A more direct way of visualizing the semantic similarity of two linguistic items can be seen with the Semantic Folding approach. In this approach a linguistic item such as a term or a text can be represented by generating a

pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device. In most digital display devices, pixels are the ...

for each of its active semantic features in e.g. a 128 x 128 grid. This allows for a direct visual comparison of the semantics of two items by comparing image representations of their respective feature sets.

Applications

In biomedical informatics

Semantic similarity measures have been applied and developed in biomedical ontologies. They are mainly used to compare

genes In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...

and

proteins Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

based on the similarity of their functions rather than on their sequence similarity, but they are also being extended to other bioentities, such as diseases. These comparisons can be done using tools freely available on the web: * ProteInOn can be used to find interacting proteins, find assigned GO terms and calculate the functional semantic similarity of

UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...

proteins and to get the information content and calculate the functional semantic similarity of GO terms. * CMPSim provides a functional similarity measure between chemical compounds and metabolic pathways using

ChEBI Chemical Entities of Biological Interest, also known as ChEBI, is a chemical database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies (OBO) effort at the European Bioinfor ...

based semantic similarity measures. * CESSM provides a tool for the automated evaluation of GO-based semantic similarity measures.

In geoinformatics

Similarity is also applied in geoinformatics to find similar

geographic feature A feature (also called an object or entity), in the context of geography and geographic information science, is a discrete phenomenon that exists at a location in the space and scale of relevance to geography; that is, at or near the surface ...

s or feature types: * SIM-DL similarity server can be used to compute similarities between concepts stored in geographic feature type ontologies. * Similarity Calculator can be used to compute how well related two geographic concepts are in the Geo-Net-PT ontology. * Th
OSM
semantic network can be used to compute the semantic similarity of tags in

OpenStreetMap OpenStreetMap (OSM) is a free, open geographic database updated and maintained by a community of volunteers via open collaboration. Contributors collect data from surveys, trace from aerial imagery and also import from other freely licensed g ...

In computational linguistics

Several metrics use

WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short defin ...

, a manually constructed lexical database of English words. Despite the advantages of having human supervision in constructing the database, since the words are not automatically learned the database cannot measure relatedness between multi-word term, non-incremental vocabulary.

In natural language processing

Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

(NLP) is a field of computer science and linguistics. Sentiment analysis, Natural language understanding and Machine translation (Automatically translate text from one human language to another) are a few of the major areas where it is being used. For example, knowing one information resource in the internet, it is often of immediate interest to find similar resources. The Semantic Web provides semantic extensions to find similar data by content and not just by arbitrary descriptors.

Deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...

methods have become an accurate way to gauge semantic similarity between two text passages, in which each passage is first embedded into a continuous vector representation.

Measures

Topological similarity

There are essentially two types of approaches that calculate topological similarity between ontological concepts: * Edge-based: which use the edges and their types as the data source; * Node-based: in which the main data sources are the nodes and their properties. Other measures calculate the similarity between ontological instances: * Pairwise: measure functional similarity between two instances by combining the semantic similarities of the concepts they represent * Groupwise: calculate the similarity directly not combining the semantic similarities of the concepts they represent Some examples:

Edge-based

* Pekar et al. * Cheng and Cline * Wu et al. * Del Pozo et al. * IntelliGO: Benabderrahmane et al.

Node-based

* Resnik ** based on the notion of information content. The information content of a concept (term or word) is the logarithm of the probability of finding the concept in a given corpus. ** only considers the information content of lowest common subsumer (lcs). A lowest common subsumer is a concept in a lexical taxonomy ( e.g. WordNet), which has the shortest distance from the two concepts compared. For example, animal and mammal both are the subsumers of cat and dog, but mammal is lower subsumer than animal for them. * Lin ** based on Resnik's similarity. ** considers the information content of lowest common subsumer (lcs) and the two compared concepts. * Maguitman, Menczer, Roinestad and Vespignani ** Generalizes Lin's similarity to arbitrary ontologies (graphs). * Jiang and Conrath ** based on Resnik's similarity. ** considers the information content of lowest common subsumer (lcs) and the two compared concepts to calculate the distance between the two concepts. The distance is later used in computing the similarity measure.
Align, Disambiguate, and Walk
Random walks on Semantic Networks

Node-and-Relation-Content-based

* applicable to ontology * consider properties (content) of nodes * consider types (content) of relations * based on eTVSM * based on Resnik's similarity

Pairwise

* maximum of the pairwise similarities * composite average in which only the best-matching pairs are considered (best-match average)

Groupwise

* Jaccard index

Statistical similarity

Statistical similarity approaches can be learned from data, or predefined.

Similarity learning Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. ...

can often outperform predefined similarity measures. Broadly speaking, these approaches build a statistical model of documents, and use it to estimate similarity. * LSA ( Latent semantic analysis)(+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times * PMI (

Pointwise mutual information In statistics, probability theory and information theory, pointwise mutual information (PMI), or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be i ...

) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents * SOC-PMI (

Second-order co-occurrence pointwise mutual information In computational linguistics, second-order co-occurrence pointwise mutual information is a semantic similarity measure. To assess the degree of association between two given words, it uses pointwise mutual information (PMI) to sort lists of import ...

) (+) sort lists of important neighbor words from a large corpus; (−) cannot measure relatedness between whole sentences or documents * GLSA (Generalized Latent Semantic Analysis) (+) vector-based, adds vectors to measure multi-word terms; (−) non-incremental vocabulary, long pre-processing times * ICAN (Incremental Construction of an Associative Network) (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (−) cannot measure relatedness between multi-word terms, long pre-processing times * NGD ( Normalized Google distance) (+) large vocab, because it uses any search engine (like Google); (−) can measure relatedness between whole sentences or documents but the larger the sentence or document the more ingenuity is required, Cilibrasi & Vitanyi (2007), reference below. * TSS
Twitter Semantic Similaritypdf
large vocab, because it use online tweets from Twitter to compute the similarity. It has high temporary resolution that allows to capture high frequency events. Open Source * NCD ( Normalized Compression Distance)
ESA (Explicit Semantic Analysis)
based on

Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...

and the ODP
SSA (Salient Semantic Analysis)
which indexes terms using salient concepts found in their immediate context.
n° of Wikipedia (noW)
inspired by the gam
Six Degrees of Wikipedia
is a distance metric based on the hierarchical structure of Wikipedia. A directed-acyclic graph is first constructed and later, Dijkstra's shortest path algorithm is employed to determine the noW value between two terms as the geodesic distance between the corresponding topics (i.e. nodes) in the graph.
VGEM
(Vector Generation of an Explicitly-defined Multidimensional Semantic Space) (+) incremental vocab, can compare multi-word terms (−) performance depends on choosing specific dimensions * SimRank
NASARI
Sparse vector representations constructed by applying the hypergeometric distribution over the Wikipedia corpus in combination wit
BabelNet
taxonomy. Cross-lingual similarity is currently also possible thanks to the multilingual and unified extension.

Semantics-based similarity

* Marker Passing: Combining Lexical Decomposition for automated Ontology Creation and Marker Passing the approach of Fähndrich et al. introduces a new type of semantic similarity measure. Here markers are passed from the two target concepts carrying an amount of activation. This activation might increase or decrease depending on the relations weight with which the concepts are connected. This combines edge and node based approaches and includes connectionist reasoning with symbolic information. * Good Common Subsumer-(GCS)-based Semantic Similarity Measure

Gold standards

Researchers have collected datasets with similarity judgements on pairs of words, which are used to evaluate the cognitive plausibility of computational measures. The golden standard up to today is an old 65 word list where humans have judged the word similarity. For a list of datasets, and an overview of the state of the art se
https://www.aclweb.org/
* RG65 * MC30 * WordSim353

References

Sources

* * * * Gabrilovich, E. and Markovitch, S. (2007)
Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis
Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, January 2007. * Lee, M. D., Pincombe, B., & Welsh, M. (2005)
An empirical evaluation of models of text document similarity
In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1254–1259). Austin, Tx: The Cognitive Science Society, Inc. * Lemaire, B., & Denhiére, G. (2004)
Incremental construction of an associative network from a corpus
In K. D. Forbus & D. Gentner & T. Regier (Eds.), 26th Annual Meeting of the Cognitive Science Society, CogSci2004. Hillsdale, NJ: Lawrence Erlbaum Publisher. * * Navigli, R., Lapata, M. (2010)
"An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation"
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(4), IEEE Press, 2010, pp. 678–692. * * Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering. In: M. Song and Y. Wu; Handbook of Research on Text and Web Mining Technologies; IGI Global. (the use of NGD and noW for term and URI clustering)

External links

List of related literature

Survey articles

* ''Conference article'': C. d'Amato, S. Staab, N. Fanizzi. 2008
On the Influence of Description Logics Ontologies on Conceptual Similarity
In Proceedings of the 16th international conference on Knowledge Engineering: Practice and Patterns Pages 48 – 63. Acitrezza, Italy, Springer-Verlag * ''Journal article'' on the more general topic of relatedness, also including similarity: Z. Zhang, A. Gentile, F. Ciravegna. 2013
Recent advances in methods of lexical semantic relatedness - a survey
Natural Language Engineering 19 (4), 411–479, Cambridge University Press * ''Book'': S. Harispe, S. Ranwez, S. Janaqi, J. Montmain. 2015
Semantic Similarity from Natural Language and Ontology Analysis
Morgan & Claypool Publishers. {{Natural language processing Computational linguistics Statistical distance