Vertical and Horizontal Distances to Approximate Edit Distance

for Rooted Labeled Caterpillars

Kohei Muraka, Takuya Yoshino and Kouichi Hirata

Kyushu Institute of Technology, Kawazu 680-4, Iizuka 820-8502, Japan

Keywords:

Edit Distance, Rooted Labeled Caterpillar, Vertical Distance, Horizontal Distance, String Edit Distance,

Multiset Edit Distance.

Abstract:

A rooted labeled caterpillar (caterpillar, for short) is a rooted labeled tree transformed to a rooted path (called

a backbone) after removing all the leaves in it and we can compute the edit distance between caterpillars in

quartic time. In this paper, we introduce two vertical distances and two horizontal distances for caterpillars.

The former are based on a string edit distance between the string representations of the backbones and the

latter on a multiset edit distance between the multisets of labels occurring in all the leaves. Then, we show that

these distances give both lower bound and upper bound of the edit distance and we can compute the vertical

distances in quadratic time and the horizontal distances in linear time under the unit cost function.

1 INTRODUCTION

Comparing tree-structured data such as HTML and

XML data for web mining or RNA and glycan data for

bioinformatics is one of the important tasks for data

mining. The most famous distance measure between

rooted labeled unordered trees (trees, for short) is the

edit distance (Tai, 1979). The edit distance is formu-

lated as the minimum cost of edit operations, con-

sisting of a substitution, a deletion and an insertion,

applied to transform a tree to another tree. Unfor-

tunately, the problem of computing the edit distance

between trees is MAX SNP-hard (Zhang and Jiang,

1994), even if trees are binary or height 2 (Akutsu

et al., 2013; Hirata et al., 2011).

A caterpillar (cf. (Gallian, 2007)) is a tree trans-

formed to a rooted path after removing all the leaves

in it. Recently, Muraka et al. (Muraka et al., 2018)

have shown that we can compute the edit distance

between caterpillars in O(h

) time, where h is the

maximum height and λ is the maximum number of

leaves in caterpillars. Hence, the problem is quartic-

time tractable with respect to the maximum number

of nodes, which is not efﬁcient well.

As an efﬁcient distance comparing caterpillars,

histogram distances such as a path histogram dis-

tance (Kawaguchi et al., 2018), a complete subtree

histogram distance (Akutsu et al., 2013; Yoshino

et al., 2018) and an LCA histogram distance (Yoshino

et al., 2018) have developed. Whereas these distances

are metrics for caterpillars and we can compute them

more efﬁciently (linear or quadratic time) than the edit

distance (quartic time), they are incomparable with

the edit distance in both theoretical and experimental.

In order to approximate the edit distance for cater-

pillars efﬁciently, in this paper, we introduce two ver-

tical distances d

and d

∗

based on a string edit dis-

tance and two horizontal distances d

and d

∗

based

on a multiset edit distance. Here, the multiset edit dis-

tance coincides with a famous bag distance (Deza and

Deza, 2016) if we adopt a unit cost function.

Let C

and C

be caterpillars. Then, d

) is

the string edit distance between the string representa-

tions of the backbones ofC

and C

, and d

∗

) is

the sum of d

), the multiset edit distance be-

tween the multisets on labels occurring in the leaves

of the endpoints of the backbones in C

and C

and

the costs of deleting the remained leaves in C

and

inserting the remained leaves in C

. Also d

)

is the multiset edit distance between the multisets of

labels occurring in all the leaves of C

and C

, and

∗

) is the sum of d

), the cost of the

correspondence between the roots of C

and C

and

the costs of deleting nodes in the backbone in C

and

inserting nodes in the backbone in C

Then, we show that these distances provide the

following lower bound and upper bound of the edit

distance τ

TAI

) between C

and C

max{d

),d

)}

≤ τ

TAI

) ≤ min{d

∗

),d

∗

)}.

590

Muraka, K., Yoshino, T. and Hirata, K.

Vertical and Horizontal Distances to Approximate Edit Distance for Rooted Labeled Caterpillars.

DOI: 10.5220/0007387205900597

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 590-597

ISBN: 978-989-758-351-3

 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reser ved

Furthermore, if we adopt the unit cost function, then

we can compute d

), d

∗

), d

)

and d

∗

) in O(h

) time, O(h

+ λ) time, O(λ)

time and O(λ + h) time, respectively. Hence, we can

compute the vertical distances in quadratic time and

the horizontal distances in linear time with respect to

the number of nodes.

Finally, we give experimental results to evaluate

the running time and the approximation for caterpil-

lars in real data.

2 PRELIMINARIES

A tree T is a connected graph (V,E) without cycles,

where V is the set of vertices and E is the set of edges.

We denote V and E by V(T) and E(T). The size of

T is |V| and denoted by |T|. We sometime denote

v ∈ V(T) by v ∈ T. We denote an empty tree (

0) by

0. A rooted tree is a tree with one node r chosen as its

root. We denote the root of a rooted tree T by r(T).

Let T be a rooted tree such that r = r(T) and

u,v, w∈ T. We denote the unique path from r to v, that

is, the tree (V

′

) such that V

′

= {v

,... , v

}, v

= r,

= v and (v

i+1

) ∈ E

′

for every i (1 ≤ i ≤ k − 1),

by UP

(v).

The parent of v(6= r), which we denote by par(v),

is its adjacent node on UP

(v) and the ancestors of

v(6= r) are the nodes on UP

(v)−{v}. We say that u is

a child of v if v is the parent of u and u is a descendant

of v if v is an ancestor of u. We denote the set of

children of v by ch(v) and that v is a ancestor of u

by u ≤ v. We call a node with no children a leaf and

denote the set of all the leaves in T by lv(T).

A rooted path P is a rooted tree

({v

,... , v

},{(v

i+1

) | 1 ≤ i ≤ n − 1}) such

that r(P) = v

. We call the node v

(the leaf of P) an

endpoint of P and denote it by e(P).

The degree of v, denoted by d(v), is the number of

children of v, and the degree of T, denoted by d(T), is

max{d(v) | v ∈ T}. The height of v, denoted by h(v),

is max{|UP

(w)| | w ∈ lv(T[v])}, and the height of T,

denoted by h(T), is max{h(v) | v ∈ T}.

We say that u is to the left of v in T if pre(u) ≤

pre(v) for the preorder number pre in T and post(u) ≤

post(v) for the postorder number post in T. We say

that a rooted tree is ordered if a left-to-right order

among siblings is given; unordered otherwise. We say

that a rooted tree is labeled if each node is assigned a

symbol from a ﬁxed ﬁnite alphabet Σ. For a node v,

we denote the label of v by l(v), and sometimes iden-

tify v with l(v). In this paper, we call a rooted labeled

unordered tree a tree simply.

Deﬁnition 1 (Caterpillar (cf., (Gallian, 2007))). We

say that a tree is a caterpillar if it is transformed to a

rooted path after removing all the leaves in it. For a

caterpillarC, we call the remained rooted path a back-

bone of C and denote it by bb(C).

It is obvious that r(C) = r(bb(C)) and V(C) =

bb(C) ∪ lv(C) for a caterpillar C, that is, every node

in a caterpillar is either a leaf or an element of the

backbone.

Next, we introduce a tree edit distance and a Tai

mapping.

Deﬁnition 2 (Edit operations (Tai, 1979)). The edit

operations of a tree T are deﬁned as follows, see Fig-

ure 1.

1. Substitution: Change the label of the node v in T.

2. Deletion: Delete a node v in T with parent v

′

making the children of v become the children of

′

. The children are inserted in the place of v as

a subset of the children of v

′

. In particular, if v is

the root in T, then the result applying the deletion

is a forest consisting of the children of the root.

3. Insertion: The complement of deletion. Insert a

node v as a child of v

′

in T making v the parent of

a subset of the children of v

′

Substitution (v 7→ w)

7→

Deletion (v 7→ ε)

′

7→

′

Insertion (ε 7→ v)

′

7→

′

Figure 1: Edit operations for trees.

Let ε 6∈ Σ denote a special blank symbol and deﬁne

= Σ∪ {ε}. Then, we represent each edit operation

by (l

7→ l

), where (l

) ∈ (Σ

×Σ

−{(ε,ε)}). The

operation is a substitution if l

6= ε and l

6= ε, a dele-

tion if l

= ε, and an insertion if l

= ε. For nodes v

and w, we also denote (l(v) 7→ l(w)) by (v 7→ w). We

deﬁne a cost function γ : (Σ

× Σ

\ {(ε,ε)}) 7→ R

pairs of labels. We often constrain a cost function γ to

be a metric, that is, γ(l

) ≥ 0, γ(l

) = 0 iff l

= l

γ(l

) = γ(l

) and γ(l

) ≤ γ(l

)+γ(l

). In

particular, we call the cost function that γ(l

) = 1

if l

6= l

a unit cost function.

Vertical and Horizontal Distances to Approximate Edit Distance for Rooted Labeled Caterpillars

591

Deﬁnition 3 (Edit distance (Tai, 1979)). For a cost

function γ, the cost of an edit operation e = l

7→ l

is given by γ(e) = γ(l

). The cost of a sequence

E = e

,... , e

of edit operations is given by γ(E) =

∑

i=1

γ(e

). Then, an edit distance τ

TAI

) be-

tween trees T

and T

is deﬁned as follows:

TAI

) = min







γ(E)



E is a sequence

of edit operations

transforming T

to T







Deﬁnition 4 (Tai mapping (Tai, 1979)). Let T

and

be trees. We say that a triple (M,T

) is a Tai

mapping (a mapping, for short) from T

to T

if M ⊆

V(T

) ×V(T

) and every pair (v

) and (v

) in

M satisﬁes the following conditions.

1. v

= v

iff w

= w

(one-to-one condition).

2. v

≤ v

iff w

≤ w

(ancestor condition).

We will use M instead of (M,T

) when there is no

confusion denote it by M ∈ M

TAI

Let M be a mapping from T

to T

. Let I

and J

be the sets of nodes in T

and T

but not in M, that is,

= {v ∈ T

| (v,w) 6∈ M} and J

= {w ∈ T

| (v,w) 6∈

M}. Then, the cost γ(M) of M is given as follows.

γ(M) =

∑

(v,w)∈M

γ(v,w) +

∑

v∈I

γ(v,ε) +

∑

w∈J

γ(ε,w).

Theorem 1 (Tai, 1979). τ

TAI

) = min{γ(M) |

M ∈ M

TAI

)}.

For computing the edit distance between trees, the

following theorem is well-known.

Theorem 2 (Akutsu et al., 2013; Hirata et al., 2011;

Zhang and Jiang, 1994). Let T

and T

be trees. Then,

the problem of computing τ

TAI

) is MAX SNP-

hard, even if both T

and T

are binary or height 2.

On the other hand, Muraka et al. (Muraka et al.,

2018) have recently shown the following theorem.

Theorem 3 (Muraka et al., 2018). Let C

and C

be caterpillars, where h = max{h(C

),h(C

)} and

λ = max{|lv(C

)|,|lv(C

)|}. Then, we can compute

TAI

) in O(h

) time.

Finally, we introduce the notions of multisets. A

multiset on Σ is a mapping S : Σ → N. For a multiset S

on Σ, we say that a ∈ Σ is an element of S if S(a) > 0

and denote it by a ∈ S (like as a standard set). The

cardinality of S, denoted by |S|, is deﬁned as

∑

a∈Σ

S(a).

Let S

and S

be multisets on Σ. Then, we

deﬁne the intersection S

⊓ S

and the difference

\ S

are multisets satisfying that (S

⊓ S

)(a) =

min{S

(a),S

(a)} and (S

\ S

)(a) = max{S

(a) −

(a),0} for every a ∈ Σ. Note that S

\ S

= S

⊓ S

and |S

\ S

| = |S

\ S

⊓ S

| = |S

| − |S

⊓ S

3 VERTICAL AND HORIZONTAL

DISTANCES FOR

CATERPILLARS

Theorem 3 claims that the problem of computing

TAI

) for caterpillars C

and C

is tractable

in quartic time, which is not efﬁcient well. In this

section, we give simple and efﬁcient approximation

of τ

TAI

) by using vertical and horizontal dis-

tances, respectively.

The vertical distance is based on a string edit

distance (cf., (Deza and Deza, 2016)) for the string

representation of the backbones. For strings s

and s

, we denote the string edit distance between

and s

by σ(s

). For a rooted path P =

({v

,... , v

},{(v

i+1

) | 1 ≤ i ≤ n − 1}) such that

r(P) = v

, we deﬁne the string representation of P

as a string l(v

)··· l(v

) and denote it by s(P).

On the other hand, the horizontal distance is based

on a multiset edit distance, which is deﬁned as similar

as another edit distance (cf., Deﬁnition 3).

The edit operations of a multiset S on Σ are de-

ﬁned as those of a tree. Let a, b ∈ Σ such that S(a) > 0

and a 6= b. Then, a substitution (a 7→ b) operates S(a)

to S(a)−1 and S(b) to S(b)+1, a deletion (a 7→ ε) op-

erates S(a) to S(a) − 1 and an insertion (ε 7→ b) oper-

ates S(b) to S(b) + 1. Also we assume a cost function

γ as in Section 2.

Deﬁnition 5 (Multiset edit distance). Let S

and S

multisets on Σ and γ a cost function. Then, a multiset

edit distance µ(S

) between S

and S

is deﬁned as

follows.

µ(S

) = min







γ(E)



E is a sequence

of edit operations

transforming S

to S







For multisets S

and S

such that |S

| ≤ |S

(resp., |S

| > |S

|), we can consider an injection π

from S

to S

(resp., from S

to S

). For exam-

ple, let S

and S

be multisets such that S

(a) = 3,

(b) = 0, S

(a) = 2 and S

(b) = 2. Then, by re-

garding S

and S

as the sequences [a

(1)

(2)

(3)

]

and [a

(1)

(2)

(1)

(2)

] (where the superscript de-

notes the order of the element), the function π such

that π(a

(1)

) = a

(2)

, π(a

(2)

) = b

(2)

and π(a

(3)

) = a

(1)

is an injection from S

to S

. When |S

| ≤ |S

| (resp.,

| > |S

|), we denote the set of all the injections

from S

to S

(resp., from S

to S

) by Π

(resp., Π

Lemma 1. The following equation holds.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

592

µ(S

)











min

π∈Π

(

∑

a∈S

γ(a,π(a)) +

∑

b∈S

\π(S

)

γ(ε,b)

)

if |S

| ≤ |S

min

π∈Π

(

∑

b∈S

γ(π(b),b) +

∑

a∈S

\π(S

)

γ(a,ε)

)

otherwise.

Proof. Suppose that |S

| ≤ |S

|. By the minimality of

Deﬁnition 5, an injection π ∈ Π

maps a ∈ S

to the

same a ∈ S

as possible, that is, π(a) = a with the cost

γ(a,π(a)) = 0, and the remained c ∈ S

to π(c) ∈ S

with the cost γ(c, π(c)). Then, the sum of the costs is

represented by

∑

a∈S

γ(a,π(a)). Furthermore, every b ∈

\ π(S

) is inserted with the cost

∑

b∈S

\π(S

)

γ(ε,b).

Hence, the total cost implies the ﬁrst formula.

Suppose that |S

| > |S

|. By the minimality of

Deﬁnition 5, an injection π ∈ Π

maps b ∈ S

to the

same b ∈ S

as possible, that is, π(b) = b with the cost

γ(π(b),b) = 0, and the remained c ∈ S

to π(c) ∈ S

with the cost γ(π(c), c). Then, the sum of the costs

is represented by

∑

b∈S

γ(π(b),b). Furthermore, every

a ∈ S

\π(S

) is deleted with the cost

∑

a∈S

\π(S

)

γ(a,ε).

Hence, the total cost implies the second formula.

If we adopt a unit cost function, then we can give

the following simpler form of Lemma 1 which coin-

cides with a bag distance (Deza and Deza, 2016) be-

tween multisets.

Lemma 2. If γ is a unit cost function, then the follow-

ing statement holds.

µ(S

) = max{|S

\ S

|,|S

\ S

|}.

Proof. Suppose that |S

| ≤ |S

|. Then, by Lemma 1,

it holds that:

∑

a∈S

γ(a,π(a))

∑

a∈S

⊓S

γ(a,a)

{z }

∑

a∈S

⊓S

,b∈S

⊓S

,a6=b

γ(a,b)

= |S

\ S

⊓ S

| = |S

| − |S

⊓ S

On the other hand, since π is an injection, it holds

that

∑

b∈S

\π(S

)

γ(ε,b) = |S

\ π(S

)| = |S

| − |S

|. As a

result, it holds that µ(S

) = |S

|−|S

⊓S

|+|S

|−

| = |S

| − |S

⊓ S

| = |S

\ S

Furthermore, in this case, by the supposition that

| ≤ |S

| and since |S

\ S

| = |S

\ S

⊓ S

| = |S

| −

⊓S

| and |S

| = |S

⊓S

| = |S

|−|S

⊓S

it holds that |S

\ S

| ≥ |S

\ S

|. Hence, |S

\ S

| =

max{|S

\ S

|,|S

\ S

|}.

By using the same discussion, if |S

| > |S

|, then

µ(S

) = |S

\ S

| = max{|S

\ S

|,|S

\ S

|}.

Lemma 3. We can compute µ(S

) in

O(m

M) time, where m = min{|S

|,|S

|} and

M = max{|S

|,|S

|}. Furthermore, if we adopt the

unit cost function, then we can compute µ(S

) in

O(m+ M) time.

Proof. By Lemma 1 and by using the same technique

based on the maximum weighted bipartite matching

algorithm for the complete bipartite graph consisting

of S

and S

(cf., (Yamamoto et al., 2014; Zhang et al.,

1996)), we can compute µ(S

) in O(m

M) time.

On the other hand, by Lemma 2, we can compute

µ(S

) in O(m+ M) time.

Hence, we formulate vertical and horizontal dis-

tances between caterpillars. Here, we regard a set L

of leaves as a multiset of labels on Σ occurring in L,

which we denote by

Deﬁnition 6 (Vertical and horizontal distances). For

i = 1,2, let C

be a caterpillar such that r

= r(C

= bb(C

), L

= lv(C

) and E

= ch(e(B

)). Then, we

deﬁne two vertical distances d

and d

∗

as follows.

) = σ(s(B

),s(B

)).

∗

) = d

) + µ(

)

∑

v∈L

γ(v,ε) +

∑

w∈L

γ(ε,w).

Also we deﬁne two horizontal distances d

and d

∗

follows.

) = µ(

∗

) = d

) + γ(r

)

∑

v∈B

\{r

}

γ(v,ε) +

∑

w∈B

\{r

}

γ(ε,w).

Theorem 4. Let C

and C

be caterpillars. Then, the

following statement holds.

max{d

),d

)}

≤ τ

TAI

) ≤ min{d

∗

),d

∗

)}.

Proof. In order to show the left inequality, it is suf-

ﬁcient to show how the values of d

) and

) change when C

is obtained by applying

one edit operation to C

If C

is obtained by substituting to an element

in bb(C

), then it holds that d

) = 1 and

) = 0. If C

is obtained by substituting to

a leaf in lv(C

), then it holds that d

) = 0 and

) = 1. If C

is obtained by deleting an el-

ement in bb(C

), then it holds that d

) = 1

and d

) = 0. If C

is obtained by deleting a

Vertical and Horizontal Distances to Approximate Edit Distance for Rooted Labeled Caterpillars

593

leaf in lv(C

), then it holds that d

) = 0 and

) = 1.

As a result, if C

is obtained by applying one

edit operation to C

, then both values of d

)

and d

) change at most one. Hence, it

holds that d

) ≤ τ

TAI

) and d

) ≤

TAI

), which implies the left inequality.

On the other hand, it order to show the right in-

equality, by regarding the correspondences between

and B

in σ(s(B

),s(B

)) and those between L

and L

in µ(

) as the pairs of V(C

) ×V(C

), the

set of correspondences between nodes in d

)

and d

) form Tai mappings. Then, it is ob-

vious that all the correspondences in d

∗

) and

∗

) are one-to-one.

Since the correspondences in d

) preserve

ancestor relation and every node in E

is a descendant

of the node in e(B

) (i = 1, 2), all the correspondences

in d

∗

) preserve ancestor relation. Also, since

every leaf in L

is an descendant of the root r

inC

(i=

1,2), all the correspondences in d

∗

) preserve

ancestor relation.

As a result, all the correspondences in d

∗

)

and d

∗

) form Tai mappings between C

and

, respectively, which implies that τ

TAI

) ≤

∗

) and τ

TAI

) ≤ d

∗

) by Theo-

rem 1. Hence, the right inequality holds.

Theorem 5. Let C

and C

be caterpillars, where h =

max{h(C

),h(C

)} and λ = max{|lv(C

)|,|lv(C

)|}.

Then, we can compute d

), d

∗

) and d

∗

) in O(h

) time, O(h

+ λ

)

time, O(λ

) time and O(λ

+ h) time, respectively.

Furthermore, if we adopt the unit cost function, then

we can compute d

), d

∗

), d

)

and d

∗

) in O(h

) time, O(h

+ λ) time, O(λ)

time and O(λ+ h) time, respectively.

Proof. It is obvious by Lemma 3 and since we can

compute σ(s(B

),s(B

)) in O(h

) time (cf., (Deza and

Deza, 2016)).

Hence, if we adopt the unit cost function, then we

can compute the vertical distances of d

) and

∗

) in quadratic time and the horizontal dis-

tances of d

) and d

∗

) in linear time.

4 EXPERIMENTAL RESULTS

In this section, we give experimental results to eval-

uate the inequality in Theorem 4 and the running

time in Theorem 5 (under the unit cost function).

Here, concerned with Theorem 4, we denote the lower

bound distance max{d

} of τ

TAI

by lbd and the

upper bound distance min{d

∗

} of τ

TAI

by ubd.

Also let diff = ubd− lbd.

In this paper, we use the real data illustrated from

Table 1, which illustrates the number of caterpillars in

N-glycans and all-glycans from KEGG

, CSLOGS

dblp

. Here, #cat is the number of caterpillars and

#data is the total number of data.

Table 1: The number of caterpillars in N-glycans and all-

glycans from KEGG, CSLOGS and dblp.

dataset #cat #data %

N-glycans 514 2,142 23.996

all-glycans 8,005 10,704 74.785

CSLOGS 41,592 59,691 69.679

dblp 5,154,295 5,154,530 99.995

We deal with caterpillars for N-glycans, all-

glycans, CSLOGS and the largest 5,154 caterpillars

(0.1%) in dblp (we refer to dblp

−

). Table 2 illus-

trates the information of such caterpillars. Here, # is

the number of caterpillars, n is the average number of

nodes, d is the average degree, h is the average height,

λ is the average number of leaves and β is the average

number of labels.

Table 2: The information of caterpillars in N-glycans, all-

glycans, CSLOGS and dblp

−

dataset # n d h λ β

N-glycans 514 6.40 1.84 4.22 2.18 4.50

all-glycans 8,005 4.74 1.49 3.02 1.72 2.84

CSLOGS 41,592 5.84 3.05 2.20 3.64 5.18

dblp

−

5,154 41.74 40.73 1.01 40.73 10.62

First, Table 3 illustrates the running time to com-

pute the vertical distances d

and d

∗

, the horizontal

distances d

and d

∗

and the edit distance τ

TAI

(Mu-

raka et al., 2018) for all the pairs of caterpillars in

Table 2.

Table 3: The running time of computing distances d

, d

∗

, d

∗

and τ

TAI

(sec).

dataset d

∗

TAI

N-glycans 0.15 0.26 0.17 0.19 635.97

all-glycans 20.35 48.08 29.98 20.35 57,011.10

CSLOGS 336.72 1,821.36 1,564.28 1,788.53 —

dblp

−

2.86 149.17 137.20 143.22 6,363.79

Kyoto Encyclopedia of Genes and Genomes, http://

www.kegg.jp/

CSLOGS: http://www.cs.rpi.edu/∼zaki/www-new/pm

wiki.php/Software/Software

dblp computer science bibliography: http://dblp.uni-

trier.de/

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

594

Table 3 shows that, as the experimental evaluation

of Theorem 5 (and 3), the running time of comput-

ing all the distances of d

, d

∗

, d

and d

∗

is much

smaller than that of the edit distance τ

TAI

, and the run-

ning time of computing the horizontal distance d

∗

smaller than that of the vertical distance d

∗

Note that, the reason why the running time of

computing d

for dblp

−

is extremely small is that the

height in every caterpillar in dblp

−

is either 1 or 2

and then the running time of σ(s(B

),s(B

)) is small.

Also, the height of 88% in caterpillars for CSLOGS is

from 1 to 3, which is the reason why the running time

of computingd

is smaller than that of other distances

for CSLOGS. Furthermore, in contrast to Theorem 5,

the running time of computing d

and d

∗

(in O(h

)

and O(h

+ λ) time in theoretical) is not much larger

than that of d

and hd

∗

(in O(λ) and O(λ + h) time

in theoretical), because we conjecture that the height

in caterpillars for all the data is too small to inﬂuence

the running time.

Next, we compare the distances of d

, d

∗

, d

∗

and τ

TAI

. Figure 2 illustrates the distributions of the

distances for N-glycans and all-glycans. Also Fig-

ure 3 and 4 illustrate the distributions of the distances

to 10, from 10 to 30, from 30 to 100 and from 100,

for CSLOGS and dblp

−

, respectively. Since we can-

not compute τ

TAI

for CSLOGS, Figure 3 presents the

distances of d

, d

∗

, d

and d

∗

. Since the vertical

distance d

for more than 99% pairs of caterpillars in

CSLOGS is 0 or 1, Figure 4 presents the distances of

∗

, d

∗

and τ

TAI

Figure 2 shows that the forms of all the distribu-

tions in are nearly normal, lbd is left to τ

TAI

and τ

TAI

is left to ubd. On the other hand, Figure 3 and 4 show

that the forms of distributions are not normal, but con-

centrate small values. Figure 3 shows that more than

90% pairs of caterpillars for CSLOGS concentrate on

the distances within 30, where the maximum values of

, d

∗

, d

and d

∗

are 70, 579, 403 and 473, respec-

tively. Also Figure 4 shows that more than 90% pairs

of caterpillars for dblp

−

concentrate on the distances

within 40, where the maximum values of τ

TAI

. d

∗

, d

and d

∗

are 746, 813, 745 and 746, respectively.

Figure 5 illustrates the scatter charts of lbd, ubd

and τ

TAI

for N-glycans, all-glycans, CSLOGS and

dblp

−

. Here, the representation of d

means that

the number of pairs of caterpillars with the distance

is pointed at the x-axis and that with the distance

at the y-axis.

Since the number of caterpillars in N-glycans is

small, so the scatter charts in Figure 5 are sparse. For

N-glycans and all-glycans, the difference between a

pair of ubd, lbd and τ

TAI

is almost within 10. For

CSLOGS and dblp

−

, the difference is not large.

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 5 10 15 20 25

distance

edit distance

N-glycans

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 5 10 15 20 25 30 35 40

distance

TAI

all-glycans

Figure 2: The distributions of distances for N-glycans and

all-glycans.

In order to coﬁrm it in more detail, we evaluate

how the lower bound distances and the upper bound

distances approximate to the edit distance. Then, Ta-

ble 4 illustrates the difference diff for N-glycans, all-

glycans, dblp

−

and CSLOGS.

Table 4 shows that more than 93% of caterpillars

for N-glycans satisfy that diff ≤ 5, more than 94% of

caterpillars for all-glycans satisfy that diff ≤ 4, more

than 99% of caterpillars for dblp

−

satisfy that diff ≤ 1

and more than 92% of caterpillars for CSLOGS sat-

isfy that diff ≤ 5.

Hence, since more than 90% (resp., 98%) of cater-

pillars satisfy that diff ≤ 5 (resp., diff ≤ 10), we can

conclude that max{d

} and min{d

∗

} succeed

to approximate τ

TAI

within 5 (resp., 10). This result is

important for the case that the running time of com-

puting τ

TAI

is large as CSLOGS.

Vertical and Horizontal Distances to Approximate Edit Distance for Rooted Labeled Caterpillars

595

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 2 4 6 8 10

distance

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

10 15 20 25 30

distance

d ≤ 10 10 ≤ d ≤ 30

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0.0018

30 40 50 60 70 80 90 100

distance

5×10

−6

1×10

−5

1.5×10

−5

2×10

−5

2.5×10

−5

3×10

−5

3.5×10

−5

4×10

−5

4.5×10

−5

5×10

−5

100 150 200 250 300 350 400 450 500 550 600

distance

30 ≤ d ≤ 100 d ≥ 100

Figure 3: The distributions of distances for CSLOGS.

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0 2 4 6 8 10

distance

TAI

0.012

0.014

0.016

0.018

0.02

0.022

0.024

0.026

0.028

0.03

10 15 20 25 30

distance

TAI

d ≤ 10 10 ≤ d ≤ 30

0.005

0.01

0.015

0.02

0.025

30 40 50 60 70 80 90 100

distance

TAI

5×10

−5

0.0001

0.00015

0.0002

0.00025

0.0003

0.00035

0.0004

0.00045

0.0005

100 200 300 400 500 600 700 800 900

distance

TAI

30 ≤ d ≤ 100 d ≥ 100

Figure 4: The distributions of distances for dblp

−

5 CONCLUSION

In this paper, we have formulated the vertical dis-

tances d

and d

∗

and the horizontal distances d

and

∗

to approximate the edit distance τ

TAI

. Then, we

have shown the following inequality:

max{d

} ≤ τ

TAI

≤ min{d

∗

Furthermore, we have shown that, if we adopt the

unit cost function, then we can compute d

and d

∗

in quadratic time and d

and d

∗

in linear time.

Finally, we have given the experimental results to

evaluate the inequality and the running time for N-

glycans, all-glycans, CSLOGS and dblp

−

. Then, we

can conclude that by combining d

, d

∗

, d

and d

∗

we can approximate to the edit distance well such that

min{d

∗

} − max{d

} ≤ 5

for more than 90% of caterpillars.

It is a future work to give experimental results

for other data such as SwissProt, TPC-H, Auction,

0 2 4 6 8 10 12 14 16

MIN

TAI

0 2 4 6 8 10 12 14 16

MAX

TAI

lbd/τ

TAI

, N-glycans ubd/τ

TAI

, N-glycans

0 2 4 6 8 10 12 14 16 18

MAX

MIN

0 5 10 15 20 25 30

MIN

TAI

lbd/ubd, N-glycans lbd/τ

TAI

, all-glycans

0 5 10 15 20 25 30

MAX

TAI

0 5 10 15 20 25 30 35

MAX

MIN

ubd/τ

TAI

, all-glycans lbd/ubd, all-glycans

100

150

200

250

300

350

400

450

0 50 100 150 200 250 300 350 400 450 500

MAX

MIN

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800

MAX

TAI

lbd/ubd, CSLOGS lbd/τ

TAI

, dblp

−

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800

MAX

TAI

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800

MAX

TAI

ubd/τ

TAI

, dblp

−

lbd/ubd, dblp

−

Figure 5: The scatter charts of of lbd, ubd and τ

TAI

for N-

glycans, all-glycans, CSLOGS and dblp

−

University, Protein and Nasa from UW XML Reposi-

tory

. Note that, whereas the last four data contain no

caterpillars, we can obtain many caterpillars by delet-

ing the root (cf., (Muraka et al., 2018)).

UW XML Repository, http://aiweb.cs.washington.edu

/research/projects/xmltk/xmldata/www/repository.html

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

596

Table 4: The difference diff for N-glycans, all-glycans,

dblp

−

and CSLOGS.

N-glycans

diff # %

0 2,448 1.86

1 17,091 12.96

2 32,404 24.58

3 33,949 25.75

4 24,240 18.46

5 13,420 10.18

6 5,801 4.40

7 1,751 1.33

8 475 0.36

9 109 0.08

10 47 0.04

11 6 0.00

dblp

−

diff # %

0 6,960,854 52.42

1 6,198,038 46.67

2 119,889 0.90

3 500 0.00

all-glycans

diff # %

0 1,105,515 3.47

1 11,619,644 34.46

2 10,547,139 33.10

3 4,633,275 14.54

4 2,108,501 6.62

5 1,001,311 3.14

6 458,637 1.44

7 203,334 0.64

8 110,184 0.35

9 49,385 0.16

10 20,461 0.06

11 6,999 0.02

12 2,393 0.01

13 801 0.00

14 350 0.00

15 147 0.00

16 30 0.00

17 18 0.00

18 8 0.00

19 3 0.00

20 1 0.00

CSLOGS

diff # %

0 10,513,132 1.22

1 174,777,470 20.21

2 301,960,142 34.91

3 175,761,327 20.32

4 90,141,737 10.42

5 42,955,474 4.97

6 23,342,365 2.70

7 14,094,693 1.63

diff # %

8 8,791,664 1.02

9 5,472,715 0.63

10 3,612,677 0.42

11 2,667,528 0.31

12 2,046,998 0.24

13 1,567,370 0.18

14 1,247,637 0.14

≥ 15 5,973,407 0.69

One of the reason that the approximation suc-

ceeds is that every node in a caterpillar is either an

element of the backbone or a leaf, that is, V(C) =

bb(C) ∪ lv(C). Also d

and d

∗

are based on a string

edit distance for bb(C) and d

and d

∗

are based on

a multiset edit distance for lv(C). When we can ex-

tend these distances to standard trees, it is necessary

how to determine a backbone and to deal with internal

nodes, which is a future work.

Concerned with the horizontal distances, we can

consider the repetition of the bag distance between

leaves after removing leaves from trees as possible.

Then, it is a future work to analyze such a distance.

ACKNOWLEDGMENTS

This work is partially supported by Grant-in-Aid

for Scientiﬁc Research 17H00762, 16H02870 and

16H01743 from the Ministry of Education, Culture,

Sports, Science and Technology, Japan.

REFERENCES

Akutsu, T., Fukagawa, D., Halld´orsson, M. M., Takasu, A.,

and Tanaka, K. (2013). Approximation and parame-

terized algorithms for common subtrees and edit dis-

tance between unordered trees. Theoret. Comput. Sci.,

470:10–22.

Deza, M. M. and Deza, E. (2016). Encyclopedia of dis-

tances (4th ed.). Springer.

Gallian, J. A. (2007). A dynamic survey of graph labeling.

Electorn. J. Combin., 14:DS6.

Hirata, K., Yamamoto, Y., and Kuboyama, T. (2011). Im-

proved MAX SNP-hard results for ﬁnding an edit dis-

tance between unordered trees. In Proc. CPM’11

(LNCS 6661), pages 402–415.

Kawaguchi, T., Yoshino, T., and Hirata, K. (2018). Path

histogram distance for rooted labeled caterpillars. In

Proc. ACIIDS’18 (LNAI 10751), pages 276–286.

Muraka, K., Yoshino, T., and Hirata, K. (2018). Computing

edit distance between rooted labeled caterpillars. In

Proc. FedCSIS’18, pages 245–252.

Tai, K.-C. (1979). The tree-to-tree correction problem. J.

ACM, 26:422–433.

Yamamoto, Y., Hirata, K., and Kuboyama, T. (2014).

Tractable and intractable variations of unordered tree

edit distance. Internat. J. Found. Comput. Sci.,

25:307–329.

Yoshino, T., Muraka, K., and Hirata, K. (2018). LCA his-

togram distance for rooted labeled caterpillars. In

Proc. KDIR’18, pages 307–314.

Zhang, K. and Jiang, T. (1994). Some MAX SNP-hard re-

sults concerning unordered labeled trees. Inform. Pro-

cess. Lett., 49:249–254.

Zhang, K., Wang, J., and Shasha, D. (1996). On the editing

distance between undirected acyclic graphs. Internat.

J. Found. Comput. Sci., 7:43–58.

Vertical and Horizontal Distances to Approximate Edit Distance for Rooted Labeled Caterpillars

597