FaRS: A High-Performance Automorphism-Aware Algorithm for Graph

Similarity Matching

Fan Wang

, Weiren Yu

, Hai H. Wang

and Victor Chang

Aston University, Birmingham B4 7ET, U.K.

University of Warwick CV4 7AL, U.K.

Keywords:

Web Search, Similarity Search, Link Analysis.

Abstract:

Role-based similarity search, predicated on the topological structure of graphs, is a highly effective and widely

applicable technique for various real-world information extraction applications. Although the prominent role-

based similarity algorithm, RoleSim, successfully provides the automorphic (role) equivalence of similarity

between pairs of nodes, it does not effectively differentiate nodes that exhibit exact automorphic equivalence

but differ in terms of structural equivalence within a given graph. This limitation arises from disregarding most

adjacency similarity information between pairs of nodes during the RoleSim computation. To address this

research gap, we propose a novel single-source role similarity search algorithm, named FaRS, which employs

the top Γ maximum similarity matching technique to capture more information from the classes of neighboring

nodes, ensuring both automorphic equivalence and structural equivalence of role similarity. Furthermore, we

establish the convergence of FaRS and demonstrate its adherence to various axioms, including uniqueness,

symmetry, boundedness, and triangular inequality. Additionally, we introduce the Opt FaRS algorithm, which

optimizes the computation of FaRS through two acceleration components: path extraction tracking and pre-

computation (P-speedup and Out-speedup approach). Experimental results on real datasets demonstrate that

FaRS and Opt FaRS outperform baseline algorithms in terms of both accuracy and efﬁciency.

1 INTRODUCTION

In the era of information technology, the research con-

cerning the extraction of valuable information from

topological structures has witnessed a substantial in-

crease (Rao et al., 2009; Shahabi et al., 2001; Yang,

2022; Wang et al., 2018; Li et al., 2015). Role simi-

larity analysis (Everett, 1985) is one of the signiﬁcant

techniques for analyzing complex graph structures,

particularly social networks. It enables the accurate

identiﬁcation of role equivalence, also known as auto-

morphic equivalence, of pairwise nodes within a net-

work. The basic principle of role similarity search is

that two nodes have a similar role only if they inter-

act with similar objects. To demonstrate the practical

application and importance of role and role similarity

detection based on graph topology in our daily life,

we present an example below. Figure 1 illustrates a

social network representing a project team, where the

nodes correspond to the employees involved in the

project, and the edges denote their interactions and

connections. The team is organized into three groups

based on their assigned tasks, and each group further

Figure 1: A Social Network of A Project Team.

comprises three job levels: project managers (level 1),

senior employees (level 2), and junior staffs (level 3).

An employee’s role within the project is determined

by his/her position.

The primary objective of a role similarity detec-

tion algorithm is to efﬁciently and accurately iden-

tify nodes in a social network that fulﬁll similar roles

as a given query node. This algorithm takes a spe-

ciﬁc node as the query and computes role similar-

ity scores between that node and all other nodes

in the graph, which is commonly referred to as a

single-source role similarity search algorithm. For

instance, if node J3 is selected as the query node,

Wang, F., Yu, W., Wang, H. and Chang, V.

FaRS: A High-Performance Automorphism-Aware Algorithm for Graph Similarity Matching.

DOI: 10.5220/0012724000003708

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 9th International Conference on Complexity, Future Information Systems and Risk (COMPLEXIS 2024), pages 17-29

ISBN: 978-989-758-698-9; ISSN: 2184-5034

nodes [J4, J5, J6, J7, J1, J2] are expected to exhibit

higher role similarity scores with J3 compared to

other nodes like S1 or M1, as they share simi-

lar roles within the project, speciﬁcally as junior

staff. Role similarity detection ﬁnds practical appli-

cations in diverse domains, including co-authorship

networks (Lee, 2012) and e-commerce website rec-

ommendation systems (Diao et al., 2019).

The RoleSim algorithm, developed by Jin (2014?),

computes role similarity scores by taking the mean

of the maximum matchings between a node’s in-

neighbors (Rothe and Sch

utze, 2014). However, de-

spite successfully addressing pairwise automorphic

equivalence, Jin’s RoleSim algorithm has several lim-

itations, including limited accuracy and poor perfor-

mance on large networks (Section 4).

To overcome the limitations of RoleSim, we

present a novel model, namely FaRS, for conducting

single-source role similarity searches based on graph

topology. FaRS exhibits the capability to efﬁciently

generate more accurate role similarity scores, even for

large graphs. Additionally, we introduce Opt FaRS,

an optimization algorithm for FaRS that signiﬁcantly

reduces computational costs further by minimizing

the number of calls to the maximum matching algo-

rithm. In summary, FaRS and Opt FaRS have the fol-

lowing distinctive characteristics:

• Accuracy: They can provide more precise role

classiﬁcation results compared to the best-known

existing competitors.

• Efﬁciency: FaRS is capable of efﬁciently com-

puting single-source role similarity search results

over extensive graphs. The performance is further

enhanced by Opt FaRS.

• Index-free: No additional disk space is required

to store indexing results, thereby minimizing the

cost of RAM usage.

This paper is organized as follows. In Section 2,

we provide a comprehensive review of the current

RoleSim algorithm (Rothe and Sch

utze, 2014). Sec-

tion 3 presents an in-depth analysis of the limita-

tions associated with RoleSim-like algorithms when

applied to graphs. In Section 4, we introduce a

novel role similarity search algorithm, FaRS, and pro-

vide proofs of its convergence, uniqueness, symmetry,

boundedness, and triangular inequality. Furthermore,

Section 5 presents the Opt FaRS algorithm, which op-

timizes the performance of FaRS. Finally, in Sec-

tion 6, we conduct experiments using real datasets

to demonstrate the superior performance of our FaRS

and Opt FaRS algorithms in terms of both accuracy

and efﬁciency when compared to state-of-the-art sim-

ilarity search algorithms.

2 RELATED WORK

Several related researches have been derived based on

the RoleSim (Shao et al., 2019; Chen et al., 2021;

Chen et al., 2020). (Shao et al., 2019) proposed a

seedless de-anonymization method called RoleMatch,

which consists of two parts: the novel role similar-

ity detection algorithm RoleSim++, and the Neigh-

borMatch matching algorithm based on the calculated

role similarity score. RoleSim++ calculates the role

similarity value of a node pair based on the maxi-

mum matching value extracted from the in-neighbors’

and out-neighbors’ role similarity value matrix of the

node pair, which is different from the RoleSim al-

gorithm. To improve the computational efﬁciency

of RoleSim++, they proposed the α-RoleSim++ al-

gorithm, which extracts information only from node

pairs’ role similarity scores greater than the threshold,

and ignores other pairs.

Currently, StructSim (Chen et al., 2020) is the

most state-of-the-art algorithm for role similarity

search. StructSim calculates the role similarity scores

through the maximum matching value of the horizon-

tal similarity between each k-neighborhood subgraph.

To improve the computational efﬁciency of the Struct-

Sim algorithm, the maximum match in the algorithm

is replaced with the BinCount match. In the BinCount

matching algorithm, the index of the nodes of each

layer needs to be recorded. To create the index of the

nodes of each layer more efﬁciently, Flajolet-Martin

Sketch was proposed.

3 PRELIMINARY

Here, we provide an overview of the formulae and

properties of the RoleSim algorithm. In the context

of network similarity analysis, there are four types

of equivalence that are considered: structural equiv-

alence, automorphic equivalence, exact coloration,

and regular equivalence (Rothe and Sch

utze, 2014).

Among these, automorphic equivalence is the most

fundamental type of equivalence for role similarity.

The RoleSim algorithm (Rothe and Sch

utze, 2014) fo-

cuses on exploring the role similarity between pairs of

nodes in networks (graphs) and provides a real-valued

measure of role similarity that veriﬁes automorphic

equivalence. The algorithm is based on the recur-

sive principle that “two nodes share the same role if

they interact with equivalent sets of neighbors”. By

evaluating the role similarity scores and performing

role classiﬁcation based on how node pairs interact

with other nodes, the algorithm effectively captures

the role similarities. Before delving into the details

COMPLEXIS 2024 - 9th International Conference on Complexity, Future Information Systems and Risk

of RoleSim, we present the fundamental intuition and

various notations used throughout this paper.

In a directed graph G = (V, E), V and E denote

the vertices and edges in G, respectively. A node

u is an in-neighbor of node v if (u, v) ∈ E. Simi-

larly, an out-neighbor can be deﬁned as a node that

has an outgoing edge to another node. The sets of in-

neighbors and out-neighbors of a node v in the graph

are denoted by I (v) and O(v), respectively. The in-

degree and out-degree of a node v in the graph rep-

resent the number of in-neighbors and out-neighbors

of v and are denoted by deg

−

and deg

, respectively.

Furthermore, mindeg

−

(u, v) is the smaller in-degree

between node u and node v, which can be expressed

mathematically as mindeg

−

(u, v) = min(deg

−

, deg

−

Similarly, maxdeg

−

(u, v) is the larger in-degree be-

tween node u and node v, i.e., maxdeg

−

(u, v) =

max(deg

−

, deg

−

). For example, consider the node

pair (S1, J1) in Figure 1. The in-degree of node

S1 is 2, and mindeg

−

(S1, J1) = min(2, 1) = 1 and

maxdeg

−

(S1, J1) = max(2, 1) = 2.

The RoleSim algorithm is founded upon the con-

cept of maximal matching of neighbors’ similarity,

which recursively establishes the similarity between

nodes as the mean similarity of the maximum weight

matching among their neighbors. Maximum Weighted

Matching (MWM) is a well-known problem in graph

theory where the objective is to ﬁnd, in a weighted

graph, a matching that has the highest possible sum

of weights. The RoleSim algorithm calculates the role

similarity rs(u, v) between nodes u and v using the

following formula. The complete matrix of pairwise

similarity values between all nodes is referred to as R:

rs(u, v) = (1 −C) max

M A(u,v)

∑

(x,y)∈M A(u,v)

rs(x, y)

deg

−

+ deg

−

− mindeg

−

(u, v)

(1)

Here, x ∈ I(u), y ∈ I(v), M A(u, v) denotes a

matching between I(u) and I(v), C signiﬁes the decay

factor (0 < C < 1), and deg

−

+ deg

−

−mindeg

−

(u, v)

is equivalent to maxdeg

−

(u, v).

A weighted bipartite matching of M A (u, v) can

be deﬁned using rs(x, y) scores as the weights. The

weight of the matching is given by the sum of the

rs(x, y) scores for all (x, y) pairs in M A(u, v), de-

noted by w(M A(u, v)). Mathematically, it means

w(M A(u, v)) =

∑

(x,y)∈M A (u,v)

rs(x, y). A matching

M A(u, v) is said to be maximal if its weight is the

maximum among all possible matchings, denoted as

M (u, v), and the weight of

M (u, v) is denoted by

M(u, v), i.e., M(u, v) = w(

M (u, v)).

Using the notation of

M (u, v) and M(u, v), the

deﬁnition of rs(u, v) from Equation 1 can also be ex-

pressed as follows (Rothe and Sch

utze, 2014):

rs(u, v) = (1 −C)

M (u, v)

maxdeg

−

(u, v)

(2)

The matching selection process used by RoleSim

is explained using the following example.

Example 3.1. Consider a directed graph G = (V, E),

where (u, v) ∈ V are two nodes. The set of in-

neighbors of node u is denoted as I(u) = {a, b, c},

while the set of in-neighbors of node v is denoted as

I(v) = {d, e, f , g, h} in G. A subset of the RoleSim

matrix of values (R) is presented in Figure 2, where

each value represents the similarity of the pairings of

neighbors between these two vertices. Assume that

these values have the following ordering: rs(a, d) =

max(rs(a, :)), rs(b, f ) = max(rs(b, :)), and rs(c, e) =

max(rs(c, :)).

In RoleSim, a matching involves selecting a single

cell from each row and column. When the number of

rows is different from the number of columns, the size

of the matching is limited to mindeg

−

(u, v). In this ex-

ample, the matching size is restricted to 3. A maximal

matching is a matching where the sum of the selected

cells is maximized. As depicted in Figure 2, follow-

ing the principle of maximum weighted matching, the

maximal matching results of the in-neighbor similar-

ity matrix are enclosed by a solid square and can be

expressed as M (u, v) = rs(a, d) + rs(b, f ) + rs(c, e).

In the subsequent sections of this paper, M

(u, v) will

be used to refer to this maximal weighted matching re-

sult of the in-neighbor similarity matrix for the node

pair (u, v) generated by the RoleSim algorithm, and

it is referred to as the ﬁrst-order maximal weighted

matching result. This distinction is made to differen-

tiate it from the higher Γ

order maximal weighted

matching used in the proposed FaRS algorithm (Sec-

tion 5). For instance, M

(u, v) denotes the second-

largest weighted matching result.

Figure 2: In-Neighbour Similarity Matrix of Node-Pair

(u, v).

The RoleSim algorithm, which follows an iterative

process to calculate the role similarity score between

node pairs (u, v) ∈ V, consists of two phases. First,

the role similarity search scores matrix R is initial-

ized. In the second phase, during the k

iteration,

the role similarity score between the node pair (u, v)

is computed based on the role similarity scores from

the previous (k − 1)

iteration. This computation is

performed using the following equation:

FaRS: A High-Performance Automorphism-Aware Algorithm for Graph Similarity Matching

(u, v) = (1 −C)

k−1

(u, v)

maxdeg

−

(u, v)

(3)

The task of retrieving similarity relationships be-

tween a given node q and all other nodes of graph is

known as single-source similarity search and is de-

noted as rs(:, q), which is a vector of similarity scores

between q and rest nodes. It can be calculated as:

(:, q) = (1 −C)M

k−1

(:, q) ⊘ MAXDEG

−

(:, q) + C

n×1

(4)

⊘ is element-wise division of the respected posi-

tions of two vectors, and n is the number of nodes

in the graph. The vector M

k−1

(:, q) contains the

ﬁrst order maximum weighted matching of node-pair

((i, q)|i ∈ V), while the vector MAXDEG

−

(:, q) con-

tains the maximum in-degree of node-pair ((i, q)|i ∈

V).

The RoleSim algorithm satisﬁes the following

properties (Rothe and Sch

utze, 2014). In Section 5.2,

we will demonstrate that these properties also held in

our algorithm FaRS and Opt FaRS.

1. Boundedness: The similarity score rs(∗, ∗) al-

ways exists and is unique, and C ≤ rs(∗, ∗) ≤ 1.

2. Monotone Convergence: The value of rs

(∗, ∗)

is the upper bound of rs

(∗, ∗), that is, rs

(∗, ∗) ≥

k+1

(∗, ∗).

3. Convergence: The result of rs

(∗, ∗) converges

to rs(∗, ∗), that is, as k approaches inﬁnity,

lim

k→∞

(∗, ∗) = rs(∗, ∗).

4. Triangle inequality: The RoleSim algorithm sat-

isﬁes the triangle inequality.

4 LIMITATIONS OF RoleSim

Despite its usefulness, the RoleSim algorithm suffers

from two main limitations, namely inaccuracy and

computational inefﬁciency. We provide a detailed

analysis of the causes of these limitations here.

4.1 Limitation 1 (Inaccuracy)

The role similarity score in RoleSim is determined by

computing the maximum matching from the neigh-

bor similarity score matrix (Gabow et al., 2001). As

mentioned earlier, the maximum matching refers to

selecting the highest sum value of one cell from each

row and column in the in-neighbor similarity ma-

trix. If the number of rows differs from the num-

ber of columns, the matching size is constrained to

be the minimum of the two. It means that RoleSim

excludes the in-neighbor nodes that represent the sur-

plus rows/columns in the similarity matrix from the

matching selection process. Moreover, even when

an in-neighbor row/column is included in the match-

ing selection, most similarity scores associated with

it are disregarded. This is because RoleSim only

considers one value from each row and column of

the in-neighbor score matrix (the ﬁrst-order maximal

weighted matching). Consequently, these two factors

can compromise the accuracy of RoleSim and result

in ambiguous role classiﬁcation. To demonstrate the

limitations of the RoleSim algorithm, we employ Ex-

ample 3.1 once again and refer to Figure 3.

Figure 2 demonstrates a limitation of the RoleSim

algorithm, where it only captures information from a

speciﬁc column (green area) while disregarding an-

other column (red area). This is due to the algo-

rithm’s focus on maximizing the matching based on

the smaller in-degree between the compared nodes.

As a result, the algorithm neglects the in-neighbors’

information that corresponds to the difference in in-

degrees between the nodes. Additionally, beyond the

ﬁrst-order maximal matching (the three solid square

cells), all information from the column (d, e, f ) is ig-

nored. These contribute to the inaccuracy of RoleSim.

In Figure 1, we compare the role similarity search

results of RoleSim with our proposed FaRS algorithm

(J3 as a query). The right table in Figure 3 shows

the results obtained from RoleSim, where all nodes

[M6, J3, J5, J7, J1, J2, J4, J6] have the same role simi-

larity scores with node J3. It suggests that these nodes

belong to the same role classiﬁcation. However, in

reality, [J3, J5, J7, J1, J2, J4, J6] are part of the junior

staff classiﬁcation (level 3), whereas M6 belongs to

the manager classiﬁcation (level 1). Thus, the results

of RoleSim may contain errors that require correction.

The left table in Figure 3 displays the results pro-

duced by our proposed algorithm, FaRS, which can

correctly reﬂect this fact.

Moreover, the RoleSim algorithm suffers from an

accuracy problem in recognizing the structural equiv-

alence (group classiﬁcation in Figure 1). In con-

trast, the scores produced by FaRS in the left table

are different from those of RoleSim, and thus, the

similarity levels can be ordered and ranked. For in-

stance, the role similarity scores between node J3

and nodes [J4, J5] are 0.4, and both [J4, J5] belong

to group 2. Similarly, we can identify [J6, J7] and

[J3, J1, J2] as belonging to group 1 and group 3, re-

spectively. Note that although J3 and [J1, J2] have

different in-neighbors (the in-neighbor of node J3 is

node S3, and the in-neighbor of [J1, J2] is node S1),

both nodes S3 and S1 belong to group 1. Thus, J3 and

[J1, J2] are more structurally related. FaRS recog-

COMPLEXIS 2024 - 9th International Conference on Complexity, Future Information Systems and Risk

Figure 3: Role similarity search results of RoleSim & FaRS.

nizes this fact, which is evident in the results of FaRS

(FaRS(J1, J3) = FaRS(J2, J3) = FaRS(J3, J3)).

This example illustrates that FaRS not only cor-

rects the errors in RoleSim, improves the accuracy

of automorphic equivalence, but also yields structural

equivalence, which RoleSim entirely overlooked.

Limitation 2 (Computational Inefﬁciency). The

single-source similarity search in RoleSim is compu-

tationally expensive due to redundant computations

and repeated use of the maximum matching algo-

rithm. In contrast, FaRS improves computational

efﬁciency by minimizing the number of maximum

matching calls and focusing on shared information,

resulting in faster processing times.

5 PROPOSED SCHEMA

This section introduces FaRS, a new role-based sim-

ilarity search algorithm that surpasses existing algo-

rithms such as RoleSim in identifying and categoriz-

ing nodes in graphs.

5.1 FaRS

Most existing role-based similarity algorithms, in-

cluding RoleSim, have limitations (as discussed in

Section 4) when applied to real-life applications. To

overcome these limitations and improve role-based

similarity search, we propose the FaRS algorithm.

FaRS selects the top Γ best matching pairs in the max-

imum matching process, providing more accurate role

similarity scores. We denote the role similarity score

vector between a single query q and node set set

(set

, q), the role similarity score matrix between

node sets set

and set

as RS

(set

, set

), and the role

similarity score between node pair (u, v) as RS

(u, v).

maximum weighted matching is deﬁned as the

maximum weighted matching of the matrix exclud-

ing the top (Γ− 1) largest weighted matching (Murty,

1968). In Section 7, we show that setting Γ equal to

3 can achieve the best balance between accuracy and

efﬁciency of FaRS.

Deﬁnition 1 (FaRS). Given a directed graph G =

(V, E), a query q ∈ V , and |V | = n, the single-source

FaRS role similarity scores w.r.t. q, denoted by RS

, q), are deﬁned as follows:

(:, q) = (1 −C)( max

(:,q)

∑

(x,y)∈MA

(:,q)

(x, y)+

λ · max

(:,q)

∑

(x,y)∈MA

(:,q)

(x, y)+

··· + λ

(Γ−1)

· max

(:,q)

∑

(x,y)∈MA

(:,q)

(x, y))

⊘(1 + λ + · · · + λ

(Γ−1)

)(DEG

−

i=1:n

+ [deg

−

]

n×1

− MINDEG

−

(:, q)) + C

n×1

(5)

where MA

(:, q) is the top Γ

order matching of the

in-neighbour similarity matrix of nodes (i = 1 : n) and

query q, and the corresponding maximum weighted

matching values can be represented M

(:, q). The

normalization coefﬁcient for the match of each or-

der is denoted by λ

(0 ≤ λ

≤ 1 and 1 ≤ γ ≤ Γ). ⊘

denotes the element-wise division of the correspond-

ing positions of two vectors. The vector DEG

−

i=1:n

is the in-degree of all the nodes i(i ∈ V). [deg

−

]

n×1

is a vector whose values are the in-degree of node

q. MINDEG

−

(:, q) is a vector, and the values

of this vector are the minimum value of node-pair

[(deg

−

, deg

−

)|i ∈ V]. C is the decay factor (0 < C <

1).

Deﬁnition 4.1 presents the FaRS algorithm for

role similarity which captures the top Γ maximum

weighted matching values from a node pair’s in-

neighbour similarity matrix. In order to prevent divi-

sion by zero in the numerator of Eq. 5, the following

special cases are incorporated into the equation.

(

(u, q) = C deg

−

= 0 or deg

−

= 0

(:, q) = C

n×1

deg

−

= 0

Lemma 5.1. Given the in-neighbor similarity matrix

of a node pair (u, v) in G, the value of mindeg

−

(u, v)

is the upper bound for the various orders of maxi-

mal weighted matching M

(u, v) (where 1 ≤ γ ≤ Γ).

These matching values are arranged in descending

order, i.e., mindeg

−

(u, v) ≥ M

(u, v) ≥

··· ≥ M

(u, v) ≥ 1.

The proof of this lemma is omitted from this paper

due to space limitations and it can be found from the

technical report [removed].

Computation Of FaRS. Given a graph G = (V, E),

the computation of the single-source FaRS role sim-

ilarity scores with respect to the query q, denoted as

(:, q), follows an iterative process until conver-

gence. The total number of iterations in the algorithm

is denoted as K, with each iteration represented by k

(K = max(k)). Initially, the matrix RS

(V,V ) is ini-

tialized as ones

n×n

. Then, the role similarity scores

(:, q) at iteration k are computed using the fol-

lowing equation. The second phase is repeated until

convergence is reached.

FaRS: A High-Performance Automorphism-Aware Algorithm for Graph Similarity Matching

(:, q) = (1 −C)(M

k−1

(:, q) + λ · M

k−1

(:, q) + ··· + λ

(Γ−1)

· M

k−1

(:, q))

⊘(1 + λ + · · · + λ

(Γ−1)

)MAXDEG

−

(:, q) + C

n×1

(6)

Theorem 5.2. Convergence: For a directed graph

G = (V, E), and any query q ∈ V , the FaRS role sim-

ilarity search algorithm is converged with the initiali-

sation of RS

k=0

= Ones

n×n

, and the iterative compu-

tation of the FaRS algorithm w.r.t. query q at iteration

k satisﬁes lim

k→∞

(:, q) = RS

(:, q).

Proof. To save space, we set Γ = 2 without loss of

generality. We aim to demonstrate the convergence of

the role similarity scores RS

(:, q) to RS

(:, q) as k

approaches inﬁnity. This requires proving the con-

vergence of node-pair role similarity search scores.

Speciﬁcally, we need to show that lim

k→∞

(i, q) =

(i, q) holds for a randomly selected node i from V.

When k = 0, according to the initialisation, RS

Ones

n×n

, and the deﬁnition of FaRS (Eq. 5), RS

(:, q)

can be calculated as follows:

(i, q) = (1 − c)

∑

(x,y)∈M

(i,q)

(x, y) + λΣ

(

′

)

∈M

(i,q)

′

, y

′

)

(1 + λ) maxdeg

−

(i, q)

= (1 − c)

(1 + λ) mindeg

−

(i, q)

(1 + λ) maxdeg

−

(i, q)

| {z }

≤1

+C = 1 = RS

(i, q)

Next we assume that RS

(∗, ∗) ⩽ RS

k−1

(∗, ∗) holds

for any node pair in the graph , we will proof

k+1

(∗, ∗) ⩽ RS

(∗, ∗) holds next. Refer to the com-

putation of FaRS (Eq. 6), RS

k+1

(:, q) can be gener-

ated as follows:

k+1

(i, q) = (1 −C)

(i, q) + λM

(i, q)

(1 + λ) maxdeg

−

(i, q)

= (1 − c)

(x,y)∈M

(i,q)

(x, y) + λΣ

′

∈M

(i,q)

′

, y

′

)

(1 + λ) maxdeg

−

(i, q) +C

⇕ hypothesisRS

(∗, ∗) ⩽ RS

k−1

(∗, ∗)

≤ (1 −C)

∑

(x,y)∈M

k−1

(i,q)

k−1

(x, y) + λΣ

(

′

)

∈M

k−1

(i,q)

k−1

′

, y

′

)

(1 + λ) maxdeg

−

(i, q)

= (1 −C)

k−1

(i, q) + λM

k−1

(i, q)

(1 + λ) maxdeg

−

(i, q)

| {z }

refer to Eq. 6

= RS

(i, q)

Thus we conclude RS

k+1

(∗, ∗) ⩽ RS

(∗, ∗) holds for

any node-pair in the graph, when the assumption

(∗, ∗) ⩽ RS

k−1

(∗, ∗) holds.

We have demonstrated that lim

k→∞

(i, q) =

(i, q) for any randomly chosen node i in the net-

work V. This result holds true for all nodes in

V. Consequently, as the number of iterations k in-

creases towards inﬁnity, the role similarity scores of

FaRS eventually converge. Speciﬁcally, we have

lim

k→∞

(:, q) = RS

(:, q).

5.2 The Axiomatic Properties of FaRS

In this subsection, we establish that FaRS preserves

the crucial axiomatic properties of RoleSim. Note

that, for brevity, without loss of generality, we assume

that Γ in the FaRS algorithm is ﬁxed at 2 in all proofs.

In the following, let G = (V, E) be a graph, q be a ran-

domly selected query, and k be an iteration number.

Theorem 5.3. Symmetry: The role similarity scores

generated by Eq. 6 satisfy RS

(:, q) = RS

(q, :),

where RS

(q, :) denotes the q

row of the role sim-

ilarity matrix.

Theorem 5.4. Monotone Convergence: The role

similarity scores generated by Eq. 6 satisfy RS

, q) ≤ RS

k−1

(:, q).

For the detailed proof of Theorem 5.3 and Theo-

rem 5.4, please refer to the technical report [removed]

due to limited space.

Theorem 5.5. Boundedness: The role similarity

scores generated by Eq. 6 satisfy C

n×1

≤ RS

(:, q) ≤

n×1

. Here C

n×1

is a vector whose values are all C,

and 1

n×1

= ones(n, 1).

Proof. We prove Theorem 5.5 by showing that for

any node u ∈ V, C ≤ RS

(u, q) ≤ 1. We begin by

initializing RS

= ones(n, n), where n is the number

of nodes of the graph. Eq. 6 is as follows:

(u, q) = (1 −C)

k−1

(u, q) + λM

k−1

(u, q)

(1 + λ)maxdeg

−

(u, q)

≤ (1 −C)

(1 + λ)mindeg

−

(u, q)

(1 + λ)maxdeg

−

(u, q)

| {z }

0≤∗≤1

+C ≤ 1

And

k−1

(u,q)+λM

k−1

(u,q)

(1+λ)maxdeg

−

(u,q)

is a non-negative number.

If the in-degree of node pair (u, q) equals zero, then

(u, q) = C. Therefore, C ≤ RS

(u, q) ≤ 1 is satis-

ﬁed. Since u ∈ V is a random node, we can infer that

n×1

≤ RS

(:, q) ≤ 1

n×1

Theorem 5.6. Triangle inequality: For any nodes

(a, b) ∈ V, the role similarity scores satisfy the follow-

ing inequality: d

(a, b) ≤ d

(a, q) + d

(b, q) where

(a, q) = 1 − RS

(a, q).

Proof. The proof of Theorem 5.6 employs the math-

ematical induction method. Since d

(a, q) = 1 −

(a, q), Eq. (6) can be rewritten as follows:

(a, q) + d

(b, q) ≤ d

(a, b)

⇕ d

(a, q) = 1 − RS

(a, q)

1 − RS

(a, q) + 1 − RS

(b, q) − 1 + RS

(a, b) ≤ 0

⇕

(a, q) + RS

(b, q) − RS

(a, b) ≤ 1

(7)

COMPLEXIS 2024 - 9th International Conference on Complexity, Future Information Systems and Risk

To ensure that Eq. 7 is satisﬁed, we ﬁrst initialise

the role similarity scores at the iteration k = 0, as

= Ones

n×n

, where n denotes the number of nodes

in the graph. At iteration k = 0, Eq. 7 can be written

as: RS

(a, q) + RS

(b, q) − RS

(a, b) = 1 +1 − 1 ≤ 1.

Thus, at iteration k = 0, Eq.7 holds. Next, assuming

that Eq.7 is satisﬁed at iteration k, we need to prove

that it also holds at the k + 1 iteration.

k+1

(a, q) + RS

k+1

(b, q) − RS

k+1

(a, b)

(1 −C)

(1 + λ)













∑

(x,y)∈M

(a,q)

(x, y)

maxdeg

−

(a, q)

∑

(y,z)∈M

(b,q)

(y, z)

maxdeg

−

(b, q)

−

∑

(x,z)∈M

(a,b)

(x, z)

maxdeg

−

(a, b)







| {z }

β1

+λ







∑

(

′

)

∈M

(a,q)



′

, y

′



maxdeg

−

(a, q)

∑

(

′

)

∈M

(b,q)



′

, z

′



maxdeg

−

(b, q)

−

∑

(

′

)

∈M

(a,b)



′

, z

′



maxdeg

−

(a, b)







{z }

β2







For general purposes, we assume deg

−

≤ deg

−

≤

deg

−

. The equation β1 can be generated as follows:

β1 = (

deg

−

deg

−

)

∑

(x,y)∈M

(a,q)

(x, y)

deg

−

(

∑

(x,y)∈M

(a,q)

(x, y) +

∑

(y,z)∈M

(b,q)

(y, z) −

∑

(x,z)∈M

(a,b)

(x, z))

Then we deﬁne a matching M A

(b, q) =

{(y, z)|(x, y) ∈ M

(a, q)

(x, z) ∈ M

(a, b)}. I(q) can be devided into

two parts I(q1) and I(q2), where I(q1)

−

{y|y ∈ M

(a, q)} and I(q2) = I(q) − I(q1). So

M A

(b, q) = M AI

(b, q) + M A II

(b, q), where

M AI

(b, q) = {(y, z)|y ∈ I(q1), z ∈ I(b)}, and

M AII

(b, q) = {(y, z)|y ∈ I(q2), z ∈ I(b)}. Accord-

ing to the Lemma 5.1, we have:

β1 ≤ (

deg

−

deg

−

)deg

−

deg

−

(

∑

(x,y)∈M

(a,q)

(x, y) +

∑

(y,z)∈M A I

(b,q)

(y, z)

∑

(y,z)∈M A II

(b,q)

(y, z) −

∑

(x,z)∈M

(a,b)

(x, z))

≤ (

deg

−

deg

−

)deg

−

deg

−

(deg

−

+ deg

−

+ deg

−

− deg

−

− deg

−

) ≤ 1

β2 has the similar trend as β1, which is

β2 ≤ 1. RS

k+1

(a, q) + RS

k+1

(b, q) − RS

k+1

(a, b) ≤

(1−C)

(1+λ)

(1 + λ) + C ≤ 1 The proof shows that

k+1

(a, q) + RS

k+1

(b, q) − RS

k+1

(a, b) ≤ 1, which

implies RS

(a, q) + RS

(b, q) − RS

(a, b) ≤ 1 holds

with random choose k. This in turn implies d

(a, q) +

(b, q) ≤ d

(a, b).

6 COMPUTATION

OPTIMIZATION

This section introduces two techniques aimed at ac-

celerating the computation of FaRS. These tech-

niques effectively reduce the number of calls made

to the maximum matching algorithm and exploit

“shared” information to minimize repetitive opera-

tions. The resulting accelerated algorithm is referred

to as Opt FaRS.

6.1 Pruning Approach

The methodology of Opt FaRS comprises two stages:

the pre-processing phase and the iterative computa-

tion phase. In the pre-processing phase, the algorithm

involves extracting the tracking path and comput-

ing the candidate pool.

Deﬁnition 2 (Multi-Hop Backward Tracking

Path). Given a connected graph G = (V, E), a query

q ∈ V, and the number of total iterations K deﬁned in

FaRS, the tracking path P with respect to query q is

denoted as P(q) =< p

, p

. . . , p

>, where p

repre-

sents the set of i

-hop backward tracking nodes with

respect to query q, and L is the actual number of itera-

tions performed by the FaRS algorithm before conver-

gence, which is also known as the level of the tracking

path. It satisﬁes the condition 1 ≤ L ≤ K. P is itera-

tively deﬁned as follows and any repeated nodes in p

are e removed from the set to ensure uniqueness:

(

= {q}

= I (x

)

I (x

)···

I (x

l−1

)where x

, x

, . . . , x

l−1

∈ p

l−1

(8)

The tracking path is determined by the query node

and the structure of the graph. This calculation can be

illustrated using the following example.

Example 6.1. Consider a graph G with ﬁve nodes,

a query q = d, and the number of iterations K = 6 in

FaRS, as shown in the left side of Figure 4. The track-

ing paths generated according to Eq. (8) are depicted

on the right side of Figure 4.

Figure 4: Left side: Example of Graph G. Right side:

Multi-Hop Backward Tracking Path of the Graph G.

In Figure 4, the track path represents a traverse

starting from the query node d. As per Deﬁnition 2,

the second element in the track path represents the in-

neighbour nodes of the query node d; thus, we ob-

tain p

= {b, c}. To generate p

, we apply Eq.(8)

FaRS: A High-Performance Automorphism-Aware Algorithm for Graph Similarity Matching

and obtain the following: p

= I (p

) = I (b)∪ I (c) =

{a, a, b, c}. After eliminating the repeated elements,

the ﬁnal result is p

= {a, b, c}. p

and p

can be cal-

culated in a similar way. It is worth mentioning that

the tracking path terminated at p

= a and where a

has an in-degree of zero, ensuring convergence.

The objectives of extracting the tracking path P of

a graph G in relation to q are twofold: ﬁrstly, it can

signiﬁcantly reduce the computation of redundant in-

formation; secondly, it can decrease the number of

iterations required for FaRS. When the level number

L of the tracking path is less than the given iteration

number K, it sufﬁces to perform only L times itera-

tions. This is because, based on the structure of the

graph, the role similarity scores converge after L iter-

ations, implying that FaRS

= FaRS

(L ≤ k ≤ K).

Given the deﬁnition of the graph’s tracking path,

we can explain how to generate candidate pools (CP)

by using the tracking path elements as indices.

Deﬁnition 3 (Candidate pool). Given P =<

, p

. . . , p

> as a track path of a connected graph

G, a candidate pool is deﬁned for each p

in P. Each

candidate pool is a subset of the similarity matrix,

consisting of a ﬁxed number of rows that include

all the nodes in G with out-neighbors. These nodes

are denoted as out, deﬁned as out = {x ∈ V|O(x) ̸=

0}. The columns of each candidate pool are deter-

mined by the corresponding tracking path element p

During the kth iteration of the FaRS algorithm, the

candidate pool CP

can be represented as: CP

(out, p

) (l = K − k + 1, 1 < k ≤ L).

Here, RS

denotes the role similarity score matrix

of all node pairs in the graph generated by the FaRS

algorithm, and CP

is the candidate pool for the kth

iteration.

It can be observed that during the computation of

, only those similarity values from CP

would be

updated at each iteration. The size of CP

is typi-

cally much smaller than RS

because the number of

nodes with out-neighbors (i.e., the candidate pool’s

row) is smaller or equal to the total number of nodes

in the graph, and the length of each element in the

track path (i.e., the candidate pool’s column) is much

smaller than the total number of nodes in the graph

(i.e., |out| ≤ n and |p

| ≪ n, where n is the total num-

ber of nodes in the graph). In previous studies on

RoleSim, all node pairs’ (n × n) role similarity scores

had to be computed at each iteration. In contrast,

our proposed Opt FaRS algorithm leverages the can-

didate pool concept to reduce the computation cost

of each iteration to the information retrieval range

(|out| × |p

|).

Based on the computation formula of FaRS

(Eq. 6) and the candidate pool deﬁnition, we propose

an efﬁcient single-source similarity search algorithm

called Opt FaRS, which can be expressed mathemati-

cally as follows:

Theorem 6.1. Let G = (V, E) be a connected graph,

and let q be a random query with corresponding track

path P =< p

, p

. . . , p

>. Then, the candidate pool

at iteration k can be updated as follows:

(out, j) = CP

(:, j) =

(1 −C)(M

k−1

(:, j) + λM

k−1

(:, j) + · · · + λ

Γ−1

k−1

(:, j))

⊘ (1 + λ + · · · + λ

Γ−1

)MAXDEG

−

(:, j) +C

|out|×1

j ∈ p

L−k+1

(9)

Here, j represents a node in track path p

L−k+1

The track path p

L−k+1

determines the column index

of the candidate pool, so the candidate pool at it-

eration k is given by RS

(out, p

L−k+1

) = CP

. The

vector MAXDEG

−

(:, j) represents the maximum in-

degree between node j and each node in out, respec-

tively. The vector M

k−1

(:, j) represents the top Γ max-

imum weighted matching of the node pair (i, j) in-

neighbour similarity matrix, where i ∈ out at iteration

(k −1).

The Opt FaRS algorithm comprises two key steps.

Firstly, the algorithm retrieves the tracking path P of

the connected graph G starting from the query node

q. Secondly, the algorithm generates the candidate

pool for the tracking path at iteration k. Finally, the

role similarity scores with respect to the query node

q are computed as follows: Opt FaRS(:, q) = RS

, q) = CP

. Here, the size of CP

K−1

is |out| × p

and the size of CP

is n × 1 (|out| ≤ n). Next, we

set the value of the difference index between n and

out to C to ensure that the sizes of CP

and RS

, q) are consistent. It is important to note that when

MAXDEG

−

(out, j) = 0, the result of Eq. 9 is equal

to C

out×1

. The proof of this method is omitted here to

save space.

In addition to Theorem 6.1, in the next, we present

exceptional cases that contribute positively to the

speed-up of the Opt FaRS algorithm. We introduce

two speed-up approaches to reduce the computational

complexity of Opt FaRS for candidate pool compu-

tation. One approach optimizes column generation,

while the other optimizes row generation.

6.2 P-Speedup Approach

Based on Eq. 9, the column indexes of candidate

pools are determined by the corresponding track path.

Therefore, we have named our speed-up approach on

column generation the P-Speedup Approach.

There are two exceptional cases of the Fast Role

Selection (FRS) algorithm, where FRS can retrieve

COMPLEXIS 2024 - 9th International Conference on Complexity, Future Information Systems and Risk

the role similarity scores without the need to use the

maximum matching algorithm.

Deﬁnition 4 (Exceptional cases). Given a graph

G(V, E), we deﬁne two special sets of nodes:

• One-hop. This set consists of all the nodes in G

with in-degrees equal to zero. We denote this set

as V

(1)

, and it can be formally deﬁned as: V

(1)

{i | deg

−

= 0, i ∈ V }.

• Two-hop. This includes all the nodes in G where

all their in-neighbors have in-degrees equal to

zero. We deﬁne the two-hop node set (denoted

as V

(2)

) mathematically as: V

(2)

= {i | deg

−

I (i)

= 0, i ∈ V },

where I (i) is the set of in-neighbours of node i in

the graph.

We observe that in the presence of exceptional cases

in a graph node, such as belonging to the one-hop set

or two-hop set, the candidate pool value can be gen-

erated directly.

Lemma 6.2. Given a graph G is a connected graph,

and let j be a query column ( j ∈ p

• If node j belongs to the one-hop set V

(1)

, then the

candidate pool value CP

(:, j) can be generated

as follows: CP

(:, j) = C

out×1

, where C

out×1

is a

column vector (with length equal to the number of

|out|) containing Cs in all entries. Note that the

role similarity scores of CP

(:, j) will not change

in the subsequent iteration.

• If node j belongs to V

(2)

, then the candidate

pool value CP

(:, j) can be generated as follows:

(:, j) = (1 − C)(C · MINDEG

−

(out, j) ⊘

MAXDEG

−

(out, j) + 1

out×1

). Here, 1

out×1

is a

column vector (with size equal to |out|, containing

all 1s. In this exceptional case, the role similarity

scores converge at iteration k = 2.

6.3 Out-Speedup Approach

Lemma 6.2 indicates that optimizing the computation

of CP

k involves considering column index nodes

that belong to exceptional cases. In this subsection,

we further enhance the computation of CP

k by fo-

cusing on speciﬁc rows of the CP

matrix. The row

indices for each candidate pool are determined by out,

and thus we refer to the method of accelerating the

row nodes as the out-speedup approach. This opti-

mization is based on the observation that it is unnec-

essary to compute a similarity score using the com-

putationally expensive maximum weighted matching

algorithm if a node can reach a root node of the graph

(a node with no incoming edges) within two hops of

traversal. This observation enables us to propose an

optimization strategy that accelerates the computation

of FaRS by avoiding unnecessary calls to the maxi-

mum weighted matching algorithm for certain node

pairs, as described in Lemma 6.3.

Lemma 6.3. Given a graph G(V, E), an iteration

number k, a query column j ( j ∈ p

), and any node

i(i ∈ out), we have the following:

1. If node i belongs to the one-hop set V

(1)

, then

(i, j) = 0.

2. If node i belongs to the two-hop set V

(2)

, then

(i, j) = C.

3. Otherwise, the maximum weighted matching of

k−1

is generated. Before introducing the com-

putation method, we deﬁne several notions. The

maximum matching result of CP

k−1

is denoted as

M [CP

k−1

], and the matched set of the maximum

matching on CP

k−1

is represented by

M [CP

k−1

The in-neighbor similarity matrix of the node pair

(i, j) is deﬁned as Bi j. The maximum match-

ing result of B i j is denoted as M [B i j], and the

matched set of the maximum matching on Bi j

is denoted as

M [Bi j]. The matched values of

k−1

in Bi j are deﬁned as M [B

i j

, CP

k−1

], and

the number of matched values of CP

k−1

in B

i j

denoted as |M [B

i j

, CP

k−1

]|.

- If the minimum value between deg

−

and deg

−

is equal to |M [B

i j

, CP

k−1

]|, then we can assert

that M

(i, j) = M [CP

k−1

- If the minimum value between deg

−

and deg

−

exceeds |M [B

i j

, CP

k−1

]|, then we can decom-

pose this into two cases:

- The matched values in |M [B

i j

, CP

k−1

]| are

the maximum value of both the column and the

row of the bipartite matrix B

i j

. We can then

calculate the value of M

(i, j) by M

(i, j) =

sum(M [B

i j

, CP

k−1

]) + M (B

rem

i j

) . Here, the operation

sum denotes the sum of the matched values be-

tween the bipartite graph B

i j

and M [CP

k−1

We then eliminate the matched elements’ rows

and columns from the bipartite graph, which

we deﬁne as B

rem

i j

. The value M (B

rem

i j

) repre-

sents the maximum matching score of the re-

maining bipartite graph.

- otherwise, the value of M

(i, j) is as follows:

(i, j) = M [B

i j

], where M [B

i j

] is the max-

imum matching value of the in-neighbour sim-

ilarity matrix B

i j

In summary, our research has revealed that not all

node pairs in each iteration have a signiﬁcant im-

pact on the ﬁnal role similarity scores of column

q. Building upon this insight, the Opt FaRS algo-

rithm efﬁciently extracts the inﬂuential information

FaRS: A High-Performance Automorphism-Aware Algorithm for Graph Similarity Matching

during each iteration, eliminating the calculation of

unnecessary scores. These optimization techniques

greatly reduce the number of calls to the maximum

matching algorithm, which is particularly beneﬁcial

for large graphs where this algorithm can be com-

putationally expensive. Furthermore, Opt FaRS cap-

tures and reuses the “shared” information, avoiding

redundant computations. The key advantage of the

Opt FaRS algorithm is its signiﬁcant improvement in

computational efﬁciency without compromising ac-

curacy.

7 EXPERIMENTAL EVALUATION

We empirically evaluate the performance of our pro-

posed algorithms, FaRS and Opt FaRS, on real-world

datasets. We compare them with baseline algorithms

and assess their efﬁciency using three metrics: the im-

pact of coefﬁcient choices on FaRS accuracy, as well

as accuracy and time efﬁciency.

7.1 Experimental Settings

Datasets. We evaluate our algorithms using the

publicly available email-Eu-core-temporal dataset

(EU) obtained from SNAP (https://snap.stanford.edu/

index.html). This dataset consists of anonymized

email data from a research organization, represent-

ing the network of incoming and outgoing emails be-

tween members. The dataset sizes are summarized in

Table 1.

Table 1: Description of Datasets.

Datasets #-Nodes #-Edges Type

email-Eu-core 986 24,929 Directed

(EU)

Department 1 309 3,031 Directed

(Dept-1)

Department 2 162 1,772 Directed

(Dept-2)

Department 3 89 1,506 Directed

(Dept-3)

Department 4 142 1,375 Directed

(Dept-4)

Compared Algorithms. We evaluated FaRS and

Opt FaRS on the aforementioned real-life datasets,

comparing them with three state-of-the-art similar-

ity search competitors (CSR, RoleSim, and FaRS N).

CSR (Rothe and Sch

utze, 2014) calculates the

CoSimRank score using the dot product of Person-

alized PageRank vectors. RoleSim (Lee, 2012) is a

state-of-the-art role similarity search algorithm based

on average maximum matching. FaRS N is an al-

ternative version of FaRS that computes the aver-

age maximum matching of the remaining in-neighbor

similarity matrix instead of using the top Γ maximum

matching.

Parameters. We use the following default parame-

ters: (a) the decay factor C = 0.2; (b) the number of

iterations K = 5; (c) the order of maximum matching

Γ = 3; and (d) the relative weight λ = 0.7.

Evaluation Metrics. The evaluation of role similar-

ity ranking on real-life datasets was performed using

k-means clustering (Arthur and Vassilvitskii, 2006;

Lloyd, 1982; Bock, 2007). To establish the ground

truth, we initially computed the role similarity score

matrix using our algorithms and other baseline al-

gorithms. Subsequently, k-means clustering was ap-

plied to the various role similarity score matrices to

group the data into multiple clusters. According to the

inherent characteristics of k-means clustering, nodes

within the same cluster exhibit higher role similarity

scores. For each query, we extracted the top 20 nodes

that displayed the greatest similarity to the query node

in each algorithm. The cluster to which the query

node belonged was identiﬁed through k-means clus-

tering. Finally, we determined the number of nodes

that were common between the top 20 nodes most

similar to the query in each algorithm and the nodes

within the query cluster. A higher overlap ratio signi-

ﬁes greater accuracy.

Experiments were carried out on a PC with an In-

tel Core i7-6700 3.40GHz CPU, 64GB of memory,

and Windows 10. Each experiment was repeated ﬁve

times, and the average results are reported.

7.2 Experimental Results

The experimental results consist of three parts: pa-

rameter evaluation, accuracy assessment, and time ef-

ﬁciency analysis.

Hyper-Parameter Evaluation. For demonstrate the

effectiveness of the FaRS approach, we ﬁrst apply it

to a real-life dataset to determine the optimal values of

two parameters, namely λ and Γ. We implement our

algorithm on the Dept-3 dataset and evaluate its accu-

racy using different parameter values. To assess ac-

curacy, we generate a role similarity score matrix for

the graph and utilize the k-means clustering method.

Figure 5 shows the accuracy of the FaRS algo-

rithm on the Dept-3 dataset for various λ values. The

y-axis represents the accuracy ratio, calculated based

on the number of duplicate nodes between two sets.

One set consists of the top 20 nodes with the high-

COMPLEXIS 2024 - 9th International Conference on Complexity, Future Information Systems and Risk

Figure 5: Hyper-Parameter Evaluation( λ).

Figure 6: Hyper-Parameter Evaluation( Γ).

est role similarity scores to the query node, while the

other set consists of nodes in the same k-means group

as the query node. We selected six different λ val-

ues (λ = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]) and query nodes

(Q = [83, 84, 85, 86, 87, 88]). Figure 5 shows that λ =

[0.6, 0.7] yield relatively high accuracy ratios, partic-

ularly for query 87. The line chart in Figure 5 repre-

sents the average accuracy ratio for different λ values.

It can be observed the FaRS algorithm achieves the

highest accuracy ratio when λ = 0.7. Similar results

were obtained for other datasets, conﬁrming the se-

lection of λ = 0.7 for future experiments.

Similarly, Figure 6 presents the accuracy of the

FaRS algorithm on the Dept-3 dataset for different Γ

values. We selected ﬁve Γ values (Γ = [2, 3, 4, 5, 6])

and query nodes Q = [70, 71, 72, 73, 74]. The accu-

racy for each Γ value is the average of ﬁve trials due

to the varying k-means clustering groups. The bar

chart in Figure 6 displays the accuracy of different

queries for each Γ value, while the line chart shows

the average accuracy ratio for the ﬁve queries. The al-

gorithm achieves improved accuracy as Γ increases,

with a signiﬁcant enhancement at Γ = 3. Beyond

Γ = 3, the accuracy improvement becomes less pro-

nounced. Hence, the optimal performance is achieved

with Γ = 3, providing high accuracy with minimal

time consumption.

Accuracy Evaluation. The accuracy of the algo-

rithms is assessed using two methods: k-means clus-

tering and evaluation based on the network’s char-

acteristics. We evaluate the performance on the EU

dataset, which represents the communication network

within an organization. The dataset includes four

departments: Department 1 (Dept-1), Department 2

(Dept-2), Department 3 (Dept-3), and Department 4

(Dept-4), and the remaining employees. Given the

network structure, employees within the same depart-

ment exhibit higher role similarity compared to those

from different departments. Consequently, we divide

the EU dataset into ﬁve partitions based on the depart-

ments and the remaining employees.

To evaluate the accuracy of the algorithms, we

randomly select a node from each part of the dataset

as a query and test the number of nodes in the corre-

sponding department among the top 20 similar nodes

to the query in different algorithms. This approach

justiﬁes our choice of datasets for evaluating the al-

gorithms.

We evaluate the algorithms using k-means clus-

tering on the mail exchange network of each depart-

ment individually. With a partition number of k = 6

and a list of 20 ordered nodes for each query, we as-

sess the accuracy ratios of the different algorithms

on the four datasets. Figure 7a illustrates the results.

We observe that the CSR algorithm exhibits relatively

low accuracy in role similarity search on each dataset.

The RoleSim algorithm achieves higher accuracy than

CSR but falls short of the FaRS algorithm. Notably,

the FaRS algorithm consistently outperforms all other

algorithms in role similarity detection on the four

datasets. The FaRS N algorithm demonstrates bet-

ter accuracy than RoleSim and CSR, but it does not

match the performance of the FaRS algorithm.

Next, we evaluate the accuracy of our algorithms

and other baselines on the EU dataset. Four nodes

are randomly selected from each department to form

the query set, with each query corresponding to its re-

spective department. The top 20 nodes are ranked,

and we assess the highest role similarity scores of

each algorithm with respect to the query. The number

of nodes belonging to the query’s department among

the top 20 nodes is determined to measure the al-

gorithm’s accuracy. Figure 7b presents the results.

The FaRS algorithm consistently achieves a high

level of accuracy, followed by the RoleSim algorithm.

Conversely, the CSR algorithm consistently exhibits

lower accuracy in role similarity search. Consider-

ing the ﬁndings from Figure 7a and Figure 7b, we

can conclude that the FaRS algorithm outperforms the

well-known algorithms, CSR and RoleSim, on the ﬁve

real-life datasets.

Lastly, we assess the accuracy of the Opt FaRS

algorithm on the EU dataset, which is an accelerated

method based on FaRS. To compare its accuracy with

that of FaRS, we focus on the Dept-4 dataset. Random

query sets are selected, varying in size from 10 to 30.

For each query set Q, we measure the similarity rank-

ing results using Normalized Discounted Cumulative

Gain (NDCG) (Wang et al., 2013) based on the role

similarity scores obtained from Opt FaRS. An NDCG

score of 1 indicates that the results of the compared al-

gorithm perfectly match those of FaRS, without any

loss in accuracy. Figure 7c illustrates the NDCGs of

FaRS: A High-Performance Automorphism-Aware Algorithm for Graph Similarity Matching

(a) FaRS(Dept Dataset). (b) FaRS(EU Dataset). (c) Opt FaRS.

Figure 7: Accuracy Evaluation of FaRS & Opt FaRS.

(a) Time Efﬁciency.

Figure 8: Efﬁciency.

Opt FaRS for each query set Q, all of which are 1.

This implies that achieves the same level of accuracy

as FaRS, afﬁrming the correctness of Lemma 6.3.

Efﬁciency Analysis. Figure 8a shows the efﬁciency

of our proposed algorithms and baseline approaches

across ﬁve authentic networks. The elapsed time re-

sults from computing the single-source role similar-

ity scores for each query, where |Q| = 20 queries are

randomly selected for each dataset. Our proposed

FaRS and RoleSim algorithms exhibit comparable ef-

ﬁciency performance on all ﬁve datasets. While the

CSR algorithm entails relatively lower time costs, its

role classiﬁcation accuracy is suboptimal. Notably,

the Opt FaRS algorithm demonstrates signiﬁcant ef-

ﬁciency gains, outperforming both the FaRS and

RoleSim algorithms across all ﬁve real-life datasets.

8 CONCLUSION

This paper presents a precise algorithm for single-

source role similarity search, namely FaRS, which

is based on graph topology. The FaRS algorithm

is capable of capturing more information from the

node-pair in-neighbour role similarity scores matrix

than the RoleSim algorithm, thus ensuring greater

accuracy. Additionally, we provide proofs for the

convergence, uniqueness, symmetry, boundedness,

and triangular inequality of the FaRS algorithm.

Furthermore, we propose an accelerated algorithm,

Opt FaRS, based on FaRS to enable more efﬁcient

computation. Lastly, we evaluate our algorithms and

compare them with baseline algorithms using ﬁve real

datasets. The experimental results indicate that FaRS

algorithm yields a more precise role similarity value

compared to the baseline algorithms. Furthermore,

Opt FaRS algorithm signiﬁcantly enhances the com-

putation speed of FaRS algorithm without compro-

mising its accuracy.

REFERENCES

Arthur, D. and Vassilvitskii, S. (2006). k-means++: The

advantages of careful seeding. Technical report, Stan-

ford.

Bock, H.-H. (2007). Clustering methods: a history of

k-means algorithms. Selected contributions in data

analysis and classiﬁcation, pages 161–172.

Chen, X., Lai, L., Qin, L., and Lin, X. (2020). Structsim:

Querying structural node similarity at billion scale.

In 2020 IEEE 36th International Conference on Data

Engineering (ICDE), pages 1950–1953. IEEE.

Chen, X., Lai, L., Qin, L., and Lin, X. (2021). Efﬁcient

structural node similarity computation on billion-scale

graphs. The VLDB Journal, 30(3):471–493.

Diao, L., Wang, H., Alsarra, S., Yen, I.-L., and Bastani, F.

(2019). A smart role mapping recommendation sys-

tem. In 2019 IEEE 43rd Annual Computer Software

and Applications Conference (COMPSAC), volume 2,

pages 135–140. IEEE.

Everett, M. G. (1985). Role similarity and complexity in

social networks. Social Networks, 7(4):353–359.

Gabow, H. N., Kaplan, H., and Tarjan, R. E. (2001). Unique

maximum matching algorithms. Journal of Algo-

rithms, 40(2):159–183.

Lee, V. E. (2012). RoleSim and RoleMatch: Role-based

similarity and graph matching. Kent State University.

Li, L., Qian, L., Lee, V. E., Leng, M., Chen, M., and Chen,

X. (2015). Fast and accurate computation of role

similarity via vertex centrality. In International Con-

ference on Web-Age Information Management, pages

123–134. Springer.

Lloyd, S. (1982). Least squares quantization in pcm. IEEE

transactions on information theory, 28(2):129–137.

COMPLEXIS 2024 - 9th International Conference on Complexity, Future Information Systems and Risk

Murty, K. G. (1968). An algorithm for ranking all the as-

signments in order of increasing cost. Operations re-

search, 16(3):682–687.

Rao, P. N., Devi, T., Kaladhar, D., Sridhar, G., and Rao,

A. A. (2009). A probabilistic neural network approach

for protein superfamily classiﬁcation. Journal of The-

oretical & Applied Information Technology, 6(1).

Rothe, S. and Sch

utze, H. (2014). Cosimrank: A ﬂexible &

efﬁcient graph-theoretic similarity measure. In Pro-

ceedings of the 52nd Annual Meeting of the Associa-

tion for Computational Linguistics (Volume 1: Long

Papers), pages 1392–1402.

Shahabi, C., Banaei-Kashani, F., Chen, Y.-S., and McLeod,

D. (2001). Yoda: An accurate and scalable web-based

recommendation system. In International Conference

on Cooperative Information Systems, pages 418–432.

Springer.

Shao, Y., Liu, J., Shi, S., Zhang, Y., and Cui, B. (2019). Fast

de-anonymization of social networks with structural

information. Data Science and Engineering, 4(1):76–

92.

Wang, Y., Lian, X., and Chen, L. (2018). Efﬁcient sim-

rank tracking in dynamic graphs. In 2018 IEEE 34th

international conference on data engineering (ICDE),

pages 545–556. IEEE.

Wang, Y., Wang, L., Li, Y., He, D., and Liu, T.-Y. (2013). A

theoretical analysis of ndcg type ranking measures. In

Conference on learning theory, pages 25–54. PMLR.

Yang, R. (2022). Efﬁcient and effective similarity search

over bipartite graphs. In Proceedings of the ACM Web

Conference 2022, pages 308–318.

FaRS: A High-Performance Automorphism-Aware Algorithm for Graph Similarity Matching