SimRank is a general

similarity measure In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such mea ...

, based on a simple and intuitive graph-theoretic model. SimRank is applicable in any

domain A domain is a geographic area controlled by a single person or organization. Domain may also refer to: Law and human geography * Demesne, in English common law and other Medieval European contexts, lands directly managed by their holder rather ...

with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects. Effectively, SimRank is a measure that says "two objects are considered to be similar if they are referenced by similar objects." Although SimRank is widely adopted, it may output unreasonable similarity scores which are influenced by different factors, and can be solved in several ways, such as introducing an evidence weight factor,I. Antonellis, H. Garcia-Molina and C.-C. Chang. Simrank++: Query Rewriting through Link Analysis of the Click Graph. In VLDB '08: Proceedings of the 34th International Conference on Very Large Data Bases, pages 408--421

/ref> inserting additional terms that are neglected by SimRank or using PageRank-based alternatives.H. Chen, and C. L. Giles. "ASCOS++: An Asymmetric Similarity Measure for Weighted Networks to Address the Problem of SimRank." ACM Transactions on Knowledge Discovery from Data (TKDD) 10.2 201

/ref>

Introduction

Many Application software, applications require a measure of "similarity" between objects. One obvious example is the "find-similar-document" query, on traditional text corpora or the

World-Wide Web The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...

. More generally, a

can be used to cluster objects, such as for

collaborative filtering Collaborative filtering (CF) is, besides content-based filtering, one of two major techniques used by recommender systems.Francesco Ricci and Lior Rokach and Bracha ShapiraIntroduction to Recommender Systems Handbook, Recommender Systems Handbo ...

in a

recommender system A recommender system (RecSys), or a recommendation system (sometimes replacing ''system'' with terms such as ''platform'', ''engine'', or ''algorithm'') and sometimes only called "the algorithm" or "algorithm", is a subclass of information fi ...

, in which “similar” users and items are grouped based on the users’ preferences. Various aspects of objects can be used to determine similarity, usually depending on the domain and the appropriate definition of similarity for that domain. In a document corpus, matching text may be used, and for collaborative filtering, similar users may be identified by common preferences. SimRank is a general approach that exploits the object-to-object relationships found in many domains of interest. On the

Web Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...

, for example, two pages are related if there are

hyperlink In computing, a hyperlink, or simply a link, is a digital reference providing direct access to Data (computing), data by a user (computing), user's point and click, clicking or touchscreen, tapping. A hyperlink points to a whole document or to ...

s between them. A similar approach can be applied to scientific papers and their citations, or to any other document corpus with

cross-reference The term cross-reference (abbreviation: xref) can refer to either: * An instance within a document which refers to related information elsewhere in the same document. In both printed and online dictionaries cross-references are important because ...

information. In the case of recommender systems, a user’s preference for an item constitutes a relationship between the user and the item. Such domains are naturally modeled as

graphs Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties * Graph (topology), a topological space resembling a graph in the sense of discre ...

, with nodes representing objects and edges representing relationships. The intuition behind the SimRank algorithm is that, in many domains, similar objects are referenced by similar objects. More precisely, objects

a

and

b

are considered to be similar if they are pointed from objects

c

and

d

, respectively, and

c

and

d

are themselves similar. The base case is that objects are maximally similar to themselves .G. Jeh and J. Widom. SimRank: A Measure of Structural-Context Similarity. In KDD'02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538-543. ACM Press, 2002. It is important to note that SimRank is a general algorithm that determines only the similarity of structural context. SimRank applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships. Obviously, similarity of other domain-specific aspects are important as well; these can — and should be combined with relational structural-context similarity for an overall similarity measure. For example, for

Web page A web page (or webpage) is a World Wide Web, Web document that is accessed in a web browser. A website typically consists of many web pages hyperlink, linked together under a common domain name. The term "web page" is therefore a metaphor of pap ...

s SimRank can be combined with traditional textual similarity; the same idea applies to scientific papers or other document corpora. For recommendation systems, there may be built-in known similarities between items (e.g., both computers, both clothing, etc.), as well as similarities between users (e.g., same gender, same spending level). Again, these similarities can be combined with the similarity scores that are computed based on preference patterns, in order to produce an overall similarity measure.

Basic SimRank equation

For a node

v

in a directed graph, we denote by

I(v)

and

O(v)

the set of in-neighbors and out-neighbors of

v

, respectively. Individual in-neighbors are denoted as

I_i(v)

, for

1 \le i \le \left, I(v)\

, and individual out-neighbors are denoted as

O_i(v)

, for

1 \le i \le \left, O(v)\

. Let us denote the similarity between objects

a

and

b

s(a, b) \in

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

/math>. Following the earlier motivation, a recursive equation is written for

s(a, b)

. If

a = b

then

s(a, b)

is defined to be

1

. Otherwise, :

s(a, b) = \frac
 \sum_^\sum_^
 s(I_i(a), I_j(b))

where

C

is a constant between

0

and

1

. A slight technicality here is that either

a

b

may not have any in-neighbors. Since there is no way to infer any similarity between

a

and

b

in this case, similarity is set to

s(a, b) = 0

, so the summation in the above equation is defined to be

0

when

I(a) = \emptyset

I(b) = \emptyset

Matrix representation of SimRank

Given an arbitrary constant

C

between

0

and

1

, let

\mathbf

be the similarity matrix whose entry

mathbf

denotes the similarity score

s(a,b)

, and

\mathbf

be the column normalized

adjacency matrix In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph (discrete mathematics), graph. The elements of the matrix (mathematics), matrix indicate whether pairs of Vertex (graph theory), vertices ...

whose entry

=\tfrac

if there is an edge from

a

b

, and 0 otherwise. Then, in matrix notations, SimRank can be formulated as :

= \max\,

where

\mathbf

is an identity matrix.

Computing SimRank

A solution to the SimRank equations for a graph

G

can be reached by

iteration Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...

to a fixed-point. Let

n

be the number of nodes in

G

. For each iteration

k

, we can keep

n^2

entries

s_k(*, *)

, where

s_k(a, b)

gives the score between

a

and

b

on iteration

k

. We successively compute

s_(*, *)

based on

s_k(*, *)

. We start with

s_0(*, *)

where each

s_0(a, b)

is a lower bound on the actual SimRank score

s(a, b)

: :

s_0(a, b) =
 \begin
  1 \mbox , \mbox \mbox a = b  \mbox , \\
  0 \mbox , \mbox \mbox a \neq b \mbox .
 \end

To compute

s_(a, b)

from

s_k(*, *)

, we use the basic SimRank equation to get: :

s_(a, b) = 
 \frac
 \sum_^\sum_^
  s_k(I_i(a), I_j(b))

for

a \ne b

, and

s_(a, b) = 1

for

a = b

. That is, on each iteration

k + 1

, we update the similarity of

(a, b)

using the similarity scores of the neighbours of

(a, b)

from the previous iteration

k

according to the basic SimRank equation. The values

s_k(*, *)

are nondecreasing as

k

increases. It was shown in that the values

converge Converge may refer to: * Converge (band), American hardcore punk band * Converge (Baptist denomination), American national evangelical Baptist body * Limit (mathematics) In mathematics, a limit is the value that a function (or sequence) app ...

limits Limit or Limits may refer to: Arts and media * ''Limit'' (manga), a manga by Keiko Suenobu * ''Limit'' (film), a South Korean film * Limit (music), a way to characterize harmony * "Limit" (song), a 2016 single by Luna Sea * "Limits", a 2009 ...

satisfying the basic SimRank equation, the SimRank scores

s(*, *)

, i.e., for all

a, b \in V

\lim_ s_k(a, b) = s(a, b)

. The original SimRank proposal suggested choosing the decay factor

C = 0.8

and a fixed number

K = 5

of iterations to perform. However, the recent research D. Lizorkin, P. Velikhov, M. Grinev and D. Turdakov. Accuracy Estimate and Optimization Techniques for SimRank Computation. In VLDB '08: Proceedings of the 34th International Conference on Very Large Data Bases, pages 422--433. showed that the given values for

C

and

K

generally imply relatively low

accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''. ''Precision'' is how close the measurements are to each other. The ...

of iteratively computed SimRank scores. For guaranteeing more accurate computation results, the latter paper suggests either using a smaller decay factor (in particular,

C = 0.6

) or taking more iterations.

CoSimRank

CoSimRank is a variant of SimRank with the advantage of also having a local formulation, i.e. CoSimRank can be computed for a single node pair.S. Rothe and H. Schütze. CoSimRank: A Flexible & Efficient Graph-Theoretic Similarity Measure. In ACL '14: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1392-1402

/ref> Let

\mathbf

be the similarity matrix whose entry

mathbf

denotes the similarity score

s(a,b)

, and

\mathbf

be the column normalized adjacency matrix. Then, in matrix notations, CoSimRank can be formulated as: :

= C\cdot (\mathbf^ \cdot \cdot  ) + ,

where

\mathbf

is an identity matrix. To compute the similarity score of only a single node pair, let

p^(i) = e_i

, with

e_i

being a vector of the standard basis, i.e., the

i

-th entry is 1 and all other entries are 0. Then, CoSimRank can be computed in two steps: #

p^ = A p^

s(i,j) = \sum_^ C^k \langle p^(i), p^(j) \rangle

Step one can be seen a simplified version of Personalized

PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordin ...

. Step two sums up the vector similarity of each iteration. Both, matrix and local representation, compute the same similarity score. CoSimRank can also be used to compute the similarity of sets of nodes, by modifying

p^(i)

Further research on SimRank

* Fogaras and Racz D. Fogaras and B. Racz. Scaling link-based similarity search. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 641--650, New York, NY, USA, 2005. ACM

/ref> suggested speeding up SimRank computation through Probability theory, probabilistic computation using the

Monte Carlo method Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be ...

. * Antonellis et al.Antonellis, Ioannis, Hector Garcia Molina, and Chi Chao Chang. "Simrank++: query rewriting through link analysis of the click graph." Proceedings of the VLDB Endowment 1.1 (2008): 408-421. extended SimRank equations to take into consideration (i) evidence factor for incident nodes and (ii) link weights. * Yu et al.W. Yu, X. Lin, W. Zhang. Towards Efficient SimRank Computation on Large Networks. In ICDE '13: Proceedings of the 29th IEEE International Conference on Data Engineering, pages 601--612. further improved SimRank computation via a fine-grained

memoization In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur ag ...

method to share small common parts among different partial sums. * Chen and Giles discussed the limitations and proper use cases of SimRank.

Partial Sums Memoization

Lizorkin et al. proposed three optimization techniques for speeding up the computation of SimRank: # Essential nodes selection may eliminate the computation of a fraction of node pairs with a-priori zero scores. # Partial sums memoization can effectively reduce repeated calculations of the similarity among different node pairs by caching part of similarity summations for later reuse. # A threshold setting on the similarity enables a further reduction in the number of node pairs to be computed. In particular, the second observation of partial sums memoization plays a paramount role in greatly speeding up the computation of SimRank from

\mathcal(Kd^2n^2)

\mathcal(Kdn^2)

, where

K

is the number of iterations,

d

is average degree of a graph, and

n

is the number of nodes in a graph. The central idea of partial sums memoization consists of two steps: First, the partial sums over

I(a)

are memoized as :

\text_^(j)=\sum_s_(i,j), \qquad (\forall j \in I(b))

and then

s_ (a,b)

is iteratively computed from

\text_^(j)

as :

s_( a,b )=\frac\sum_ \text_^(j).

Consequently, the results of

\text_^(j)

\forall j \in I(b)

, can be reused later when we compute the similarities

s_(a,*)

for a given vertex

a

as the first argument.

Citations

Sources

* {{Cite book, last1=Cai, first1=Y., last2=Cong, first2=G., last3=Jia, first3=X., last4=Liu, first4=H., last5=He, first5=J., last6=Lu, first6=J., last7=Du, first7=X., title=2009 Ninth IEEE International Conference on Data Mining , chapter=Efficient Algorithm for Computing Link-Based Similarity in Real World Networks , date=2009-12-01, pages=734–739, doi=10.1109/ICDM.2009.136, isbn=978-1-4244-5242-2, s2cid=9799597 , chapter-url=https://zenodo.org/record/890821 Cluster analysis algorithms Similarity measures

Introduction

Basic SimRank equation

Matrix representation of SimRank

Computing SimRank

CoSimRank

Further research on SimRank

Partial Sums Memoization

See also

Citations

Sources