Efﬁcient Subgraph Indexing for Biochemical Graphs

∗

Keywords:

Subgraph Indexing, Transaction Graph Database, Radix Tree, Path Compression.

Abstract:

The dynamic nature of graph-structured data demands fast subgraph query processing to solve real-world

problems such as identifying spammers in social networks, fraud detection in the ﬁnancial system, and ﬁnding

motifs in biological networks. The need for an efﬁcient subgraph search has motivated the study for ﬁltering

the candidate graphs using the ﬁlter-then-verify framework with minimal indexing size. This paper presents

an efﬁcient in-memory index structure for indexing the paths in the transaction graph database. Our radix

tree-based index structure addresses the issue of high memory consumption related to trie for representing

biochemical datasets. Furthermore, we also contrast various containers used in the radix nodes. We demon-

strate empirically the beneﬁts of compressing the common preﬁxes in the path by achieving 20% reduction in

the indexing size than a trie-based implementation.

1 INTRODUCTION

Graphs are a natural representation of datasets, widely

adopted in various areas such as mining chemical

structure information for drug discovery, prediction

of trafﬁc accidents, and recommendation of prod-

ucts on e-commerce websites. Over the decades, the

vast availability of interconnected graph datasets and

the growing popularity of graph databases has moti-

vated the study of graph indexing to speed up query

processing. Other motivation came from improving

data structures to efﬁciently store graph data for sub-

graph processing (or subgraph search) and algorithms

to solve subgraph isomorphism (or subgraph match-

ing) for large-sized and numerous numbers of small

graphs, respectively. Subgraph searching and sub-

graph isomorphism problems are some of the classic

NP-hard problems in graphs. To overcome the chal-

lenges of slow construction time and large space con-

sumption of subgraph indexes, our main contributions

are listed as follows:

1. The earlier works focused mostly on solving the

problem of subgraph query processing without

concerning themselves with space-efﬁcient index-

ing techniques, while the number of data graphs in

the database increases. In contrast to solely solv-

ing subgraph queries, we design a compressed

https://orcid.org/0000-0002-8921-398X

https://orcid.org/0000-0003-3515-9209

∗

Supported by a DAAD PhD scholarship.

sufﬁx tree indexing structure to store paths of the

graphs.

2. We propose an index structure in which each

node contains graph occurrence information. The

graph information comprised of the data graph

identiﬁer as the key and the number of occur-

rences of the path in the data graph as a value.

These keys are compressed using path compres-

sion optimization.

3. We propose a detailed comparative study of the

different index structure used for representing the

sufﬁx nodes in the compact trie data structure.

The paper is organized as follows. The section 2 pro-

vides the background on the subgraph indexing. The

section 3 brieﬂy deﬁnes some terminologies associ-

ated with graphs. Section 4 describes the compressed

sufﬁx tree data structure for graph. The section 5

discusses the experimental setup and the analysis of

the results. Finally, the section 6 provides the con-

clusion while, the section 7 discusses regarding the

challenges and objective for future works.

2 RELATED WORK

In the subgraph query processing, the main goal is to

ﬁnd all the data graphs that contain the query graph

q. Generally, subgraph pattern matching (Katsarou,

2018; Giugno et al., 2013) is a straightforward ap-

proach to ﬁnding all the subgraphs in a graph, that are

Chimi Wangmo

and Lena Wiese

Goethe University Frankfurt, Institute of Computer Science, Robert-Mayer-Str. 10, 60629

Frankfurt am Main, Germany

Wangmo, C. and Wiese, L.

Efﬁcient Subgraph Indexing for Biochemical Graphs.

DOI: 10.5220/0011350100003269

In Proceedings of the 11th International Conference on Data Science, Technology and Applications (DATA 2022), pages 533-540

ISBN: 978-989-758-583-8; ISSN: 2184-285X

533

isomorphic to the query graph q. One approach for

subgraph searching could be to exhaustively perform

subgraph matching for all the graphs g

in the graph

database G and verify using subgraph isomorphism

between q and g

, where g

∈ G (as deﬁned below).

Such a naive method is however computationally ex-

pensive.

The recent heuristics methods developed to solve the

subgraph isomorphism (or subgraph matching) prob-

lem have shown signiﬁcant performance improve-

ment. Most existing algorithms for subgraph isomor-

phism are based on the idea of backtracking, where

the query vertices are matched to vertices in the data

graph incrementally (Sun et al., 2022a; Sun et al.,

2022b; Kim et al., 2021; Min et al., 2021; Han et al.,

2019). The general framework for the subgraph iso-

morphism consists of two tasks: ﬁltering and match-

ing. In the ﬁltering phase, the number of vertices in

the data graph mapped to the vertices in the query

graph is reduced, thereby producing the potential can-

didate vertex sets. In the matching phase, the back-

tracking approach is performed by recursively extend-

ing each vertex in the candidate sets and then check-

ing for the subgraph isomorphism of the query graph.

Nevertheless, the subgraph isomorphism approaches

are generally evaluated for small-sized query graphs

and only a single data graph.

On the other hand, some approaches answer the sub-

graph query by typically utilizing the two phase ﬁlter-

then-verify (FTV) framework to build the subgraph

indices (Licheri et al., 2021; Luaces et al., 2021). The

ﬁrst phase, index construction phase, involves enu-

meration of the graph patterns (such as paths or trees

or cycles, etc.) either through mining frequent pat-

terns or exhaustive enumeration. This is then fol-

lowed by query processing phase, which includes

ﬁltering out the data graphs in the graph database

that does not contain query graph, thereby generat-

ing pruned candidate graphs. Finally, in the veriﬁ-

cation step, the subgraph matching is performed on

the candidate graphs. Therefore, the pruning power

of the indexes reduces the number of subgraph iso-

morphic veriﬁcation to the less number of candidate

graphs. However, the works on subgraph indexing

are mainly targeted towards setting where there are

many numbers of small data graphs. Further, the algo-

rithms those uses mining-based approach to generate

frequent patterns and then index them produces sta-

ble indexing structure but consumes longer construc-

tion time. While the methods that perform exhaustive

enumeration of small patterns, require more memory

space to store all the permutations of patterns in the

data graphs. Our approach is developed to provide

a compact sufﬁx tree representation for subgraph in-

dexing.

3 PRELIMINARIES AND

PROBLEM DEFINITION

In this paper, we consider undirected, connected,

vertex-labelled graphs in transaction graph database.

The transaction graph database is a set of large num-

ber of small graphs (or connected components), called

data graphs, G = {g

, g

, ..., g

}. Each graph g

is deﬁned as a triplet g = {V

, E

, L

}, composed of

three elements where V

is the set of vertices, E

the set of edges between vertices in the graph, and L

is the set of labels associated with the vertices in the

graph. More precisely, L

is a mapping function that

maps vertex to a label in σ, where σ is the distinct la-

bel list. Subgraphs correspond to a subset of nodes of

the original graphs. A graph h is said to be induced

subgraph of a data graph g if the vertices in h is the

subset of g, V

⊆ V

, and the corresponding edge set

consists of all the edges in E

that have both the end-

points in V

. The label of the vertices in h is the same

as the label of the corresponding vertex in g.

Deﬁnition 1 (Subgraph isomorphism (Kim et al.,

2021)). Given a query graph q and data graph g, an

embedding of q in g is a mapping M : V

→ V

such

that: (1) M is injective (i.e. M(u) ̸= M(u

′

) for every

u ̸= u

′

∈ V

, (2) L

(u) = L

(M(u)) for every u ∈ V

(3)

(M(u), M(u

′

)) ∈ E

for every (u, u

′

) ∈ E

. q is said to

be isomorphic to g, denoted by q ⊆ g if there exists an

embedding of q in G.

Our proposed indexing structure aims to speed

up the search for subgraphs while maintaining small-

sized indexes.

Deﬁnition 2 (Subgraph searching). For a given trans-

action graph database with n data graphs G =

, g

, ..., g

} and a query graph q, the subgraph

search is to ﬁnd all the graphs g

in the database that

contain q.

4 OUR COMPRESSED TRIE

We now derive our construction of a sufﬁx tree that

supports the two above-mentioned inner node struc-

tures.

4.1 Compressed Sufﬁx Tree

Construction

In this section, we discuss the representation and the

process of building the compressed sufﬁx tree to in-

dex the paths in the graph for obtaining the candidate

graphs. Inspired by the idea of text compression, our

method implements a radix tree like index structure

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

534

A B

Query graph Q

Verification engine

Data graphs G

, G

Search for the paths in Q

A B

Query answer

Feature

extraction

Build indexes

Compress index

Indexing engine

Filtering

candidate graphs

Processing engine

Feature

extraction

Build indexes

Compress index

Return data graph id

Query workloadGraph database

Figure 1: System architecture.

for multiple paths in the data graphs. The repetition

of labels in the graph path is similar to the text rep-

etition. The general observation is that the existing

trie-based index structure for labelled paths consists

of the repeating sufﬁx nodes in multiple paths and

sufﬁx node’s graph identiﬁer and occurrences infor-

mation. As a result, our compressed sufﬁx tree pro-

vides two-fold compression. The ﬁrst one is in terms

of compressing the sufﬁx tree by merging the com-

mon sufﬁx nodes into one. The second one offers a

compact representation of the graph identiﬁer and oc-

currences information in the sufﬁx nodes. Our com-

pressed sufﬁx tree will help to compress the Trie in

terms of a common preﬁx, common labels in the sin-

gle path, and the single labelled path to one node.

4.2 Index Building Phase

The construction of the radix tree involves the inser-

tion of the extracted paths during depth-ﬁrst search

(DFS) traversals in an unsorted and incremental man-

ner. We execute a DFS up to a certain maximum

length l

starting from each vertex V

= (v

, v

, ..., v

)

in each data graph g

. Thus, all the subsequent pre-

ﬁx and sufﬁx paths are generated and indexed. Dur-

ing each path traversal in the graph, the corresponding

radix node is built, containing vertex labels as a par-

tial key and a corresponding list of graph identiﬁers

and occurrences as a value. Formally, path indexing

can be deﬁned as follows:

Deﬁnition 3 (Path indexing). Path indexing is the

process of building indexes of the labelled paths be-

ing extracted by traversing each vertex {v

, v

, ..., v

}

in each data graph {g

, g

, ..., g

} using depth-ﬁrst

search up to maximum length l

4.2.1 Structure of Compressed Sufﬁx Tree

Generally, the issue with Trie-based data structures is

the space overhead due to the large number of point-

ers associated with each inner node which are often

found to be empty. On the other hand, the radix tree

utilized fewer nodes as opposed to trie by providing a

mechanism to merge common sufﬁx and preﬁx partial

keys. The radix tree structure consists of a rooted tree

starting with an empty node.

Our method includes storing all the possible sufﬁxes

generated during depth-ﬁrst search traversal of each

data graph in the radix tree. Normally, the sufﬁxes

generated are often redundant, and the labels are al-

ready indexed, hence only the occurrences informa-

tion and graph identiﬁer are updated. Unlike the

GRAPES (Giugno et al., 2013) method, our method

implements path compression and lazy expansion.

In our implementation, each DFS generated path

inserts a new node and upon encountering common

preﬁx paths, only the graph identiﬁer and occurrences

information are updated. Further, if a new node is a

single child of a node, then the child is merged with

the parent node.

Additionally, the choice of data structure used for

the inner node should maintain the trade-off be-

tween faster retrieval and space optimization. The

typical representations are based on arrays, linked

lists, binary search tree, and hashmaps. In contrast

to GRAPES which represents inner nodes using a

linked list only, our implementation includes for both

hashmap-based radix tree and linkedlist-based radix

tree in order to enable a comparison in a uniform

environment. We now specify the basic data struc-

tures. Hashmap-based radix tree is the simplest form

Efﬁcient Subgraph Indexing for Biochemical Graphs

535

Indexing engine

2 3

Data graphs

3 4

CountGraph

Trie

Offline- Index construction

CountGraph

3 1

Compact trie

23 3

2 43

1234

12 21

123

4 2

2 1 3

4 2

2 1

2 43 234 432 123 2

1 3 1 2

2 4

42 1

123

1 2

Figure 2: Index construction.

of node providing linear search time as well as com-

pact storage space. However, it does not support

predecessor-successor relationship.

Deﬁnition 4 (Indexed node of the Hashmap-based

radix tree). An indexed node of the hashmap-based

radix tree γ

(HRTree) is a quadruplet ⟨λ, C, G

, l⟩,

which is node γ

of a hashmap-based radix tree

HRTree (see Deﬁnition 5). The element λ is

a node identiﬁer, which consists of ordered se-

quence of integers λ = {λ

, λ

, ..., λ

}, where the

length of the node identiﬁer belonging to an in-

dexed node in HRTree is length(γ

(HRTree). The

term “node identiﬁer” is used in conjunction to

an indexed node of the hashmap-based radix tree,

which should not be confused with the term “la-

bel” for a vertex in a graph. The element C is

the set of children keys C = {c

, c

, ..., c

}, where

= {λ

∈ λ(child

(γ

(HRTree))) | 1 ≤ k ≤ m}.

child

(γ

(HRTree)) is the k

child of the current in-

dexed node γ

(HRTree) inserted at the k

position.

Each child key is mapped to the children of the in-

dexed node M : c

→ child

(γ

(HRTree)), where 1 ≤

k ≤ m. The element G

is the graph information con-

sisting of graph identiﬁers and corresponding occur-

rences count. The element l has a boolean value as-

sociated with it; with 1 means that the indexed node

is a leaf whereas 0 means it is not.

Deﬁnition 5 (Hashmap-based radix tree). The

hashmap-based radix tree HRTree is a triplet

⟨R, I , L ⟩.

1. HRTree is rooted with an indexed node called as

a root node, denoted by R. Formally, R is deﬁned

as,

R = {γ

(HRTree) | λ

(R) = −1} (1)

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

536

children:

: {1,1,1,1}

12343

children:

Root

(a)

children:

: {1}

children:

: {1,2}

children:

234

: {1}

: {1,1}

: {1}

: {2,2}

: {1}

: {2,1,1}

: {2}

children:

432

: {1,1,1}

(b)

Figure 3: (a) Initial insertion of the labelled path (b) Indexing of the labelled paths originated from vertex v

2. The element I is an indexed node referred to as

an inner node. Formally, I is deﬁned as,

I = {γ

(HRTree) | I = ¬R ∧ ¬L } (2)

3. The element L is an indexed node referred to as a

leaf node containing the value of the boolean leaf

as 1. Formally, L is deﬁned as,

L = {γ

(HRTree) | l = 1} (3)

LinkedList-based radix tree is a dynamic data

structure composed of node label, run-time created

dynamic children list, and next sibling pointer. We

implemented two options, namely Hashmap-based

radix tree HRTree, and Linkedlist-based radix tree

LRTree with the path compression and lazy expansion

optimization. The data structures are implemented

in C++, and the current implementation supports the

vector of integers for the labelled node. The maxi-

mum size of each node in the radix tree is dependent

on the type of data structure utilized for the node rep-

resentation and whether the node stores common pre-

ﬁxes.

4.2.2 HashMap-based Radix Tree

Deﬁnition 6 (Labelled path). A labelled path LP is

an ordered sequence of integers LP = (σ

, σ

, ..., σ

1), where each vertex identiﬁer v

∈ V

is mapped to

exactly one label σ

∈ L ∀ 1 ≤ i ≤ k +1. Formally, LP

is deﬁned as,

LP = {σ

, σ

, ..., σ

+ 1 | M : v

→ σ

, σ

∈ L,

∀1 ≤ i ≤ k + 1} (4)

where, k is the maximum length of the labelled path.

Deﬁnition 7 (Indexed path). An indexed path IP is

an ordered sequence of integers in node identiﬁers λ

associated with each indexed node γ

from a root node

R to a leaf node L . Formally, IP is deﬁned as,

IP = {λ(γ

), λ(γ

), ..., λ(γ

)|1 ≤ n ≤ N} (5)

p refers to as the length of the indexed path, which

is the total number of integers in the node identiﬁer

λ for each indexed nodes γ

in the indexed path IP.

Formally, p is deﬁned as,

p =

∑

i=1

length(γ

) (6)

Example 1. In the Figure 2, the depth-ﬁrst search tra-

versed id-path {v

, v

} of g

is represented

as a labelled path LP = {1, 2, 3, 4, 3} obtained from

graph translate to one of the indexed path IP in the

index structure. Here, each value in {1, 2, 3, 4, 3} is

the label mapped to the corresponding vertex identi-

ﬁer {v

, v

Example 2. Assuming that the aforementioned la-

belled path LP has not been indexed, the Figure 3a

shows the labelled path LP = {1, 2, 3, 4, 3} will be in-

serted as a child to the root node child

(R) with 1 as

a key c

. The value will be the new node γ

(HRTree)

containing {1, 2, 3, 4, 3} as a node identiﬁer λ.

Example 3. Given the labelled path LP in Figure

3a is traversed from g

. The indexed node with the

node identiﬁer λ = {1, 2, 3, 4, 3} will have graph oc-

currence information G

. G

is represented using un-

ordered map which contains g

as the key, denoting

Efﬁcient Subgraph Indexing for Biochemical Graphs

537

the ﬁrst data graph in the transaction database. Fur-

ther, g

is mapped with {1, 1, 1, 1} as the occurrences

count associated with each labelled sub path. Here,

the ﬁrst value in {1, 1, 1, 1} implies that occurrences

count associated with 1 sub-path, the second value

1 implies the occurrences count associated with 1, 2

sub-path, the third value implies 1 the occurrences

count associated with 1, 2, 3 sub-path.

Deﬁnition 8 (Key containment). Given a new la-

belled path LP

new

and a root node R, the key con-

tainment is true, if the ﬁrst integer of the labelled path

is contained in the children key of the root node. For-

mally, Key(LP

new

) ∈ R is deﬁned as,

Key(LP

new

) ∈ R =

(

1, if σ

(LP

new

) ∈ C(R)

0, otherwise.

(7)

Deﬁnition 9 (Full path containment). Given a new

labelled path LP

new

and an indexed path IP in the

hashmap-based radix tree HRTree, the full path con-

tainment Full(LP

new

) ∈ HRTree is true, if there ex-

ist one-to-one ordered, mapping from each integer in

LPnew to each integer in IP. Formally, Full(LP

new

) ∈

HRTree is deﬁned as,

Full(LP

new

) ∈ HRTree = 1, M : σ

(LP

new

) →

(IP) | 1 ≤ i ≤ j} (8)

Deﬁnition 10 (Longest common preﬁx). Given a

newly traversed labelled path LP

new

and an indexed

node λ(γ

), the longest common preﬁx LCP(LP

new

, γ

)

is a sub-path from 0

to min length of LP

new

that maps

to λ(γ

). Formally, LCP(LP

new

, γ

) is deﬁned as,

LCP(LP

new

, γ

) = M : σ

(LP

new

) → λ

(γ

∀ 0 ≤ i ≤ min. (9)

Deﬁnition 11 (Preﬁx of the labelled path). Given a

newly traversed labelled path LP

new

and a speciﬁed

position i, the preﬁx of the labelled path Pre(LP

new

, i)

is a sub-path that starts from 0

to the i

position of

the the labelled path. Formally, the Pre(LP

new

, i) is

deﬁned as,

Pre(LP

new

, i) = {LP

new

[0, .., i]|i = 1,if i = NULL}

(10)

Deﬁnition 12 (Sufﬁx of the labelled path). Given a

newly traversed labelled path LP

new

and indexed node

, the sufﬁx of the labelled path Su f (LP

new

, γ

) is a

sub-path that starts from j

to k

position of the la-

belled path LP

new

. Formally, the Su f (LP

new

, γ

) is de-

ﬁned as,

Su f (LP

new

, γ

) = {LP

new

[ j, ., k] | j = |LCP(LP

new

, γ

)|}

(11)

The insertion process involves performing hop

constrained depth-ﬁrst search traversal and the addi-

tion of the LP as a node γ

to the hashmap-based radix

tree HRTree. Typically, there are three scenarios con-

cerning the insertion of children nodes:

1. If the newly traversed labelled path LP

new

and root

node R(HRTree) is not able to satisfy the key

containment, create new node with LP

new

as the

node identiﬁer λ. The child of the root node is rep-

resented using an unordered map container. The

child will contain the ﬁrst character λ

∈ λ as key

and the new child node as the value. Moreover,

we will update the G

by incrementing the occur-

rences count of the labelled path LP in the g

2. If the newly traversed path LP

new

is fully

contained in the hashmap-based radix tree

Full(LP

new

) ∈ HRTree, then only the g

∈ G

as-

sociated with indexed node γ

(LRTree) is incre-

mented. However, if the length of the newly tra-

versed path k(LP

new

) is longer than the lenght of

the indexed path p(IP). In such scenario, we

would have to increment the occurrences count

associated with g

∈ G

for the longest common

preﬁx between the λ

, λ

, ..., λ

. In addition, we

would perform recursive insertion with the suf-

ﬁx of the newly traversed path Su f (LP

new

, γ

) as

a new child’s node identiﬁer λ.

3. If a preﬁx of the newly traversed path is

in the indexed node of the hashmap-based

radix tree Pre(LP

new

, i) ∈ γ

(HRTree), but

only partially shares a certain common pre-

ﬁx LCP(LP

new

, γ

(HRTree)) = {σ

, σ

, ..., σ

}

to the label λ of an existing indexed node

(HRTree). This causes an splitting of the

indexed node into two nodes: new node γ

new

and the existing indexed node γ

(HRTree).

The new node γ

new

contains the common

preﬁx λ(γ

new

) = LCP(LP

new

, γ

(HRTree)). Fur-

ther, the existing indexed node γ

(HRTree)

is updated as the child to the new node

λ(γ

(HRTree))child

k+1

(γ

new

) = γ

(HRTree).

In addition, the node identiﬁer of an ex-

isting indexed node is revised to its sufﬁx

Su f (LP

new

, γ

(HRTree)). Similarly, the graph

information G

is updated by incrementing the

occurrences count for the common preﬁx label of

the newly indexed node γ

new

in g

The lookup operation is similar to how insertion oc-

curs.

4.2.3 Linkedlist-based Radix Tree

In case of a linkedlist-based radix tree LRTree, each

indexed node γ

(LRTree) has a node identiﬁer λ rep-

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

538

resented using the vector of integers. The node identi-

ﬁer of the indexed node λ(γ

(LRTree)) corresponds to

the traversed labelled path LP

new

. Further, the pointer

to a child node is stored in the indexed node. In addi-

tion, pointer to a next sibling node, if any, is as well

stored in the indexed node. Most importantly, graph

occurrences information G

associated with the in-

dexed node γ

(LRTree) is inserted. The insertion pro-

cess is similar to the implementation for the hashmap-

based radix tree HRTree. Initially, the linkedList-

based radix tree LRTree contains only the root node

R with node identiﬁer as -1. The children C and sib-

ling S pointers of the root node will contain null val-

ues.

5 PERFORMANCE EVALUATION

This section shows the experimental results to eval-

uate the effectiveness of our indexing structure on

datasets of increasing size. The experiment is run on

a machine with Seven Intel Core 1185G7@ 3.00 GHz

1.80 GHz CPUs and 32 GB of RAM running Win-

dows operating system.

5.1 Datasets

The input dataset is a text ﬁle that contains V

and

related to each g

∈ G. The ﬁrst line starts with

the character ‘t’, which denotes a transaction as a

data graph. The character ‘t’ is followed by integer i,

which represents the graph identiﬁer g

. The follow-

ing lines which begin with the character ‘v’ indicate

the vertices. The letter ‘v’ is followed by a vertex

identiﬁer v

∈ V

and its corresponding label l

∈ L.

Additionally, the next lines which are preceded by ‘e’

indicate edges ∈ E

, which contain the source ver-

tex identiﬁer v

∈ V

and the destination vertex iden-

tiﬁer v

∈ V

. We ran our experiments on the two

datasets obtained from (Sun and Luo, 2019). The ﬁrst

dataset, AIDS, comprises of 40,000 small graphs with

62 unique labels. Each graph contains on average 45

vertices and 46.95 edges. The second dataset, PDBS,

has 600 data graphs that represent DNA, RNA, and

proteins. Each data graph has an average of 2939 ver-

tices with an average degree of 2.06 per vertex, and

3,064 edges. Besides that, the graph consists of 10

unique labels.

5.2 Performance Measurement

We have measured the index construction time as well

as the index size for three cases: an uncompressed

index as well as our linkedlist- and hashmap-based

sufﬁx trees.

The barcharts in Figures 4 to 5 visualize the in-

dexing size and timing measurements. In the ﬁgures,

the Hashmap-based Radix tree and LinkedList-based

Radix tree are denoted as HRTree and LRTree respec-

tively. Moreover, for each dataset, we have generated

its subgroup as small, medium, and large, which con-

tains 20%, 50%, and 100% of the original data graphs.

In addition, each experiment is run four times and the

average result has been reported for the indexing size

and construction time.

Our experiments in Figure 4 and 5 show that the

choice of the data structure to represent the node im-

pacts the performance. Further, the optimization us-

ing path compression can overcome the drawbacks of

trie indexing structure, thereby reducing the memory

consumption.

In the AIDS dataset, the ﬁnding indicates the ben-

eﬁts of path compression. For the large subgroup of

the AIDS dataset, the hashmap-based radix tree occu-

pies approximately 20% and 12% less space in hard

disk than the trie and linkedlist-based radix tree re-

spectively. In particular, the construction time for the

linkedlist-based radix tree is the longest, followed by

an uncompressed trie.

In contrast to the noticeable impact of compres-

sion on various AIDS data subgroups, the index struc-

ture size has levelled out for the PDBS data set. This

can be because for the trie and compressed trie, as

soon as all the enumeration of paths up to a certain

maximum length has been indexed only the graph oc-

currences information is updated. A general trend is

that both the index size and construction time increase

as we add more data graphs.

6 CONCLUSIONS

Graph indexing is essential to enable efﬁcient query

processing for graph database. Hence, it is well stud-

ied and adopted in various ﬁelds ranging from bioin-

formatics to social network analysis. In this work we

presented a compact radix tree for indexing biochem-

ical datasets. We have demonstrated through bench-

marking the beneﬁts of path compression, over the

regular trie data structure to reduce the index size for

storing biochemical datasets.

7 FUTURE WORK

In future work, we consider the integration of query

processing to enable run time indexing of a query

Efﬁcient Subgraph Indexing for Biochemical Graphs

539

Small (P) Medium (P) Large (P) Small (A) Medium (A) Large (A)

100

150

200

6.8

7.1

7.6

110

119

Dataset type

Index size in KB

HRTree

LRTree

Uncompressed Trie

Figure 4: Average size of indexes (A: AIDS, P: PDBS).

Small (P) Medium (P) Large (P) Small (A) Medium (A) Large (A)

0.7

1.4

·10

1,793

2,964

10,002

1,613

2,075

4,298

1,954

5,293

10,431

1,870

4,781

9,844

1,821

4,258

13,004

1,340

4,649

9,272

Dataset type

Time in seconds

HRTree

LRTree

Uncompressed Trie

Figure 5: Average construction time of indexes (A: AIDS, P: PDBS).

graph and generation of candidate graphs with our in-

dex structure. On the technical side, it would be in-

teresting to implement the maintenance algorithm to

support an incremental update operation.

REFERENCES

Giugno, R., Bonnici, V., Bombieri, N., Pulvirenti, A., Ferro,

A., and Shasha, D. (2013). Grapes: A software

for parallel searching on biological graphs targeting

multi-core architectures. PloS one, 8(10):e76911.

Han, M., Kim, H., Gu, G., Park, K., and Han, W. (2019).

Efﬁcient subgraph matching: Harmonizing dynamic

programming, adaptive matching order, and failing set

together. In Boncz, P. A., Manegold, S., Ailamaki,

A., Deshpande, A., and Kraska, T., editors, Proceed-

ings of the 2019 International Conference on Man-

agement of Data, SIGMOD Conference 2019, Amster-

dam, The Netherlands, June 30 - July 5, 2019, pages

1429–1446, Amsterdam. ACM.

Katsarou, F. (2018). Improving the performance and scal-

ability of pattern subgraph queries. PhD thesis, Uni-

versity of Glasgow, UK.

Kim, H., Choi, Y., Park, K., Lin, X., Hong, S., and Han,

W. (2021). Versatile equivalences: Speeding up sub-

graph query processing and subgraph matching. In

Li, G., Li, Z., Idreos, S., and Srivastava, D., editors,

SIGMOD ’21: International Conference on Manage-

ment of Data, Virtual Event, China, June 20-25, 2021,

pages 925–937, China. ACM.

Licheri, N., Bonnici, V., Beccuti, M., and Giugno, R.

(2021). GRAPES-DD: exploiting decision diagrams

for index-driven search in biological graph databases.

BMC Bioinform., 22(1):209.

Luaces, D., Viqueira, J. R., Cotos, J. M., and Flores,

J. C. (2021). Efﬁcient access methods for very large

distributed graph databases. Information Sciences,

573:65–81.

Min, S., Park, S. G., Park, K., Giammarresi, D., Italiano,

G. F., and Han, W. (2021). Symmetric continuous sub-

graph matching with bidirectional dynamic program-

ming. Proc. VLDB Endow., 14(8):1298–1310.

Sun, S. and Luo, Q. (2019). Scaling up subgraph query

processing with efﬁcient subgraph matching. In 35th

IEEE International Conference on Data Engineering,

ICDE 2019, Macao, China, April 8-11, 2019, pages

220–231, China. IEEE.

Sun, X., Sun, S., Luo, Q., and He, B. (2022a). An in-

depth study of continuous subgraph matching (com-

plete version). CoRR, abs/2203.06913.

Sun, Y., Li, G., Du, J., Ning, B., and Chen, H. (2022b).

A subgraph matching algorithm based on subgraph

index for knowledge graph. Frontiers Comput. Sci.,

16(3):163606.

DATA 2022 - 11th International Conference on Data Science, Technology and Applications

540