SimRank is a general
similarity measure
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...
, based on a simple and intuitive
graph-theoretic model.
SimRank is applicable in any
domain
A domain is a geographic area controlled by a single person or organization. Domain may also refer to:
Law and human geography
* Demesne, in English common law and other Medieval European contexts, lands directly managed by their holder rather ...
with object-to-object
relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects.
Effectively, SimRank is a measure that says "two objects are considered to be similar if they are referenced by similar objects." Although SimRank is widely adopted, it may output unreasonable similarity scores which are influenced by different factors, and can be solved in several ways, such as introducing an evidence weight factor,
[I. Antonellis, H. Garcia-Molina and C.-C. Chang. Simrank++: Query Rewriting through Link Analysis of the Click Graph. In VLDB '08: Proceedings of the 34th International Conference on Very Large Data Bases, pages 408--421]
/ref> inserting additional terms that are neglected by SimRank or using PageRank-based alternatives.[H. Chen, and C. L. Giles. "ASCOS++: An Asymmetric Similarity Measure for Weighted Networks to Address the Problem of SimRank." ACM Transactions on Knowledge Discovery from Data (TKDD) 10.2 201]
/ref>
Introduction
Many Application software, applications require a measure of "similarity" between objects.
One obvious example is the "find-similar-document" query,
on traditional text corpora or the World-Wide Web
The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
.
More generally, a similarity measure
In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...
can be used to cluster objects, such as for collaborative filtering
Collaborative filtering (CF) is, besides content-based filtering, one of two major techniques used by recommender systems.Francesco Ricci and Lior Rokach and Bracha ShapiraIntroduction to Recommender Systems Handbook, Recommender Systems Handbo ...
in a recommender system
A recommender system (RecSys), or a recommendation system (sometimes replacing ''system'' with terms such as ''platform'', ''engine'', or ''algorithm'') and sometimes only called "the algorithm" or "algorithm", is a subclass of information fi ...
, in which “similar” users and items are grouped based on the users’ preferences.
Various aspects of objects can be used to determine similarity, usually depending on the domain and the appropriate definition of similarity for that domain.
In a document corpus, matching text may be used, and for collaborative filtering, similar users may be identified by common preferences.
SimRank is a general approach that exploits the object-to-object relationships found in many domains of interest.
On the Web
Web most often refers to:
* Spider web, a silken structure created by the animal
* World Wide Web or the Web, an Internet-based hypertext system
Web, WEB, or the Web may also refer to:
Computing
* WEB, a literate programming system created by ...
, for example, two pages are related if there are hyperlink
In computing, a hyperlink, or simply a link, is a digital reference providing direct access to Data (computing), data by a user (computing), user's point and click, clicking or touchscreen, tapping. A hyperlink points to a whole document or to ...
s between them.
A similar approach can be applied to scientific papers and their citations, or to any other document corpus with cross-reference
The term cross-reference (abbreviation: xref) can refer to either:
* An instance within a document which refers to related information elsewhere in the same document. In both printed and online dictionaries cross-references are important because ...
information.
In the case of recommender systems, a user’s preference for an item constitutes a relationship between the user and the item.
Such domains are naturally modeled as graphs
Graph may refer to:
Mathematics
*Graph (discrete mathematics), a structure made of vertices and edges
**Graph theory, the study of such graphs and their properties
* Graph (topology), a topological space resembling a graph in the sense of discre ...
, with nodes representing objects and edges representing relationships.
The intuition behind the SimRank algorithm is that, in many domains, similar objects are referenced by similar objects.
More precisely, objects and are considered to be similar if they are pointed from objects and , respectively, and and are themselves similar.
The base case is that objects are maximally similar to themselves
.[G. Jeh and J. Widom. SimRank: A Measure of Structural-Context Similarity. In KDD'02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538-543. ACM Press, 2002. ]
It is important to note that SimRank is a general algorithm that determines only the similarity of structural context.
SimRank applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships.
Obviously, similarity of other domain-specific aspects are important as well; these can — and should be combined with relational structural-context similarity for an overall similarity measure.
For example, for Web page
A web page (or webpage) is a World Wide Web, Web document that is accessed in a web browser. A website typically consists of many web pages hyperlink, linked together under a common domain name. The term "web page" is therefore a metaphor of pap ...
s SimRank can be combined with traditional textual similarity; the same idea applies to scientific papers or other document corpora.
For recommendation systems, there may be built-in known similarities between items (e.g., both computers, both clothing, etc.), as well as similarities between users (e.g., same gender, same spending level).
Again, these similarities can be combined with the similarity scores that are computed based on preference patterns, in order to produce an overall similarity measure.
Basic SimRank equation
For a node in a directed graph, we denote by and the set of in-neighbors and out-neighbors of , respectively.
Individual in-neighbors are denoted as , for , and individual
out-neighbors are denoted as , for .
Let us denote the similarity between objects and by