ample that their claim is false, even if the informa-
tion of depth is given, which is not well-known. On
the other hand, we show that the LCA histogram dis-
tance is a metric for caterpillars. By using the LCA
histogram distance, we can avoid not only the above
extreme cases but also the case that both the path his-
togram distance and the complete subtree histogram
distance are their maximum values but the edit dis-
tance is not. We can compute the LCA histogram dis-
tance in quadratic time.
Then, by using caterpillars in real data in Table 3
in Section 4, we give experimental results of comput-
ing the LCA histogram distance comparing with the
path histogram distance and the complete subtree his-
togram distance. Note that the maximum values of
the path histogram distance, the complete subtree his-
togram distance and the LCA histogram distance are
different. Then, by normalizing the distances to com-
pare them as experimental results, we compare the
running time, distributions and scatter charts of the
three distances.
2 PRELIMINARIES
A tree T is a connected graph (V, E) without cycles,
where V is the set of vertices and E is the set of edges.
We denote V and E by V(T) and E(T). The size of
T is |V| and denoted by |T|. We sometime denote
v ∈ V(T) by v ∈ T. We denote an empty tree (
/
0,
/
0)
by
/
0. A rooted tree is a tree with one vertex r chosen
as its root. We denote the root of a rooted tree T by
r(T).
Let T be a rooted tree such that r = r(T) and
u,v,w ∈ T. We denote the unique path from r to v, that
is, the tree (V
′
,E
′
) such that V
′
= {v
1
,.. .,v
k
}, v
1
= r,
v
k
= v and (v
i
,v
i+1
) ∈ E
′
for every i (1 ≤ i ≤ k − 1),
by UP
r
(v). The depth of v, denoted by d(v), is the
number of edges in UP
r
(v).
The parent of v(6= r), which we denote by par(v),
is its adjacent vertex on UP
r
(v) and the ancestors of
v(6= r) are the vertices on UP
r
(v) − {v}. We say that
u is a child of v if v is the parent of u and u is a de-
scendant of v if v is an ancestor of u. We call a vertex
with no children a leaf and denote the set of all the
leaves in T by lv(T).
We denote the set of all the children of v in T by
ch(v). The degreeof v, denoted by g(v), is the number
of children of v, that is, |ch(v)|, and the degree of T,
denoted by g(T), is max{g(v) | v ∈ T}. The height of
v, denoted by h(v), is max{|UP
v
(w)| | w ∈ lv(T[v])},
and the height of T, denoted by h(T), is max{h(v) |
v ∈ T}.
We use the ancestor orders < and ≤, that is, u < v
if v is an ancestor of u and u ≤ v if u < v or u = v.
We say that w is the least common ancestor (LCA, for
short) of u and v, denoted by u⊔v, if u ≤ w, v≤ w and
there exists no vertex w
′
∈ T such that w
′
≤ w, u ≤ w
′
and v ≤ w
′
.
Let T be a rooted tree (V, E) and v a vertex in T.
A complete subtree of T at v, denoted by T[v], is a
rooted tree T
′
= (V
′
,E
′
) such that r(T
′
) = v, V
′
=
{u ∈ V | u ≤ v} and E
′
= {(u, w) ∈ E | u,w ∈ V
′
}. For
a tree T
′
, we say that T
′
occurs in T at v if T
′
= T[v].
For a vertex v ∈ T, we call the occurrence number
of v in the preorder (resp., postorder) traversal on T
the preorder (resp., postorder) number of v and de-
note it by pre(v) (resp., post(v)). We say that u is to
the left of v in T if pre(u) ≤ pre(v) and post(u) ≤
post(v). We say that a rooted tree is ordered if a left-
to-right order among siblings is given; unordered oth-
erwise. We say that a rooted tree is labeled if each
vertex is assigned a symbol from a ﬁxed ﬁnite alpha-
bet Σ. For a vertex v, we denote the label of v by l(v),
and sometimes identify v with l(v). In this paper, we
call a rooted labeled unordered tree a tree simply.
As the restricted form of trees, we introduce a
rooted labeled caterpillar (a caterpillar, for short) as
follows, which this paper mainly deals with.
Deﬁnition 1 (Caterpillar (cf., (Gallian, 2007))). We
say that a tree is a caterpillar if it is transformed to a
path after removing all the leaves in it. For a caterpil-
lar C, we call the remained path a backbone of C and
denote it by bb(C).
Next, we introduce an edit distance for trees.
Deﬁnition 2 (Edit operations (Tai, 1979)). The edit
operations of a tree T are deﬁned as follows.
1. Substitution: Change the label of the vertex v in
T.
2. Deletion: Delete a non-root vertex v in T with par-
ent v
′
, making the children of v become the chil-
dren of v
′
. The children are inserted in the place
of v as a subset of the children of v
′
.
3. Insertion: The complement of deletion. Insert a
vertex v as a child of v
′
in T making v the parent
of a subset of the children of v
′
.
Let ε 6∈ Σ denote a special blank symbol and deﬁne
Σ
ε
= Σ ∪ {ε}. Then, we represent each edit operation
by (l
1
7→ l
2
), where (l
1
,l
2
) ∈ (Σ
ε
×Σ
ε
−{(ε,ε)}). The
operation is a substitution if l
1
6= ε and l
2
6= ε, a dele-
tion if l
2
= ε, and an insertion if l
1
= ε. For vertices v
and w, we also denote (l(v) 7→ l(w)) by (v 7→ w). We
deﬁne a cost function γ : (Σ
ε
× Σ
ε
\ {(ε,ε)}) 7→ R
+
on
pairs of labels. We often constrain a cost function γ to
be a metric, that is, γ(l
1
,l
2
) ≥ 0, γ(l
1
,l
2
) = 0 iff l
1
= l
2
,
γ(l
1
,l
2
) = γ(l
2
,l
1
) and γ(l
1
,l
3
) ≤ γ(l
1
,l
2
)+γ(l
2
,l
3
). In