Caterpillar Inclusion: Inclusion Problem for Rooted Labeled Caterpillars
Tomoya Miyazaki, Manami Hagihara and Kouich Hirata
Kyushu Institute of Technology, Kawazu 680-4, Iizuka 820-8502, Japan
Keywords:
Caterpillar Inclusion, Tree Inclusion, Rooted Labeled Caterpillar, Rooted Labeled Unordered Tree.
Abstract:
In this paper, we investigate an inclusion problem for rooted labeled caterpillars (resp., caterpillars, for short),
which we call a caterpillar inclusion. The caterpillar inclusion is to determine whether or not a text caterpillar
T achieves to a pattern caterpillar P by deleting vertices in T. Then, we design the algorithm of the caterpillar
inclusion for P and T in O((h + H)σ) time, where h is the height of P, H is the height of T and σ is the
number of labels occurring in P and T. Also we give experimental results for the algorithm by using real data
for caterpillars.
1 INTRODUCTION
The pattern matching for tree-structured data such
as HTML and XML documents for web mining or
DNA and glycan data for bioinformatics is one of the
fundamental tasks for information retrieval or query
processing. As such pattern matching for rooted la-
beled unordered trees, an unordered tree inclusion (an
inclusion, for short) is the problem of determining
whether or not an unordered tree P called a pattern
tree is included in an unordered tree T called a text
tree, that is, T achieves to P by deleting vertices in
T. However, the inclusion is known to be intractable,
that is, NP-complete (Kilpel¨ainen and Mannila, 1995;
Matou˘sek and Thomas, 1992).
In order to overcome such intractability, several
researches have developed the tractable variations of
the inclusion such as a top-down inclusion (Shamir
and Tsur, 1999), a bottom-up inclusion (Valiente,
2002), an LCA-preserving inclusion (Valiente, 2005)
1
and an isolated-subtree inclusion (Hokazono et al.,
2012). The first three variations are formulated by
restricting the scope of the deletion of vertices to
just leaves, just roots and just either leaves or ver-
tices with one child, respectively. Also the top-down
(resp., bottom-up) inclusion coincides with the top-
down (resp., bottom-up) unordered subtree isomor-
phism (cf., (Valiente, 2002)). On the other hand, the
1
While Valiente (Valiente, 2005) has called it a con-
strained inclusion, the definition of (Valiente, 2005) is cor-
responding to an LCA-preserving distance or a degree-2
distance (Zhang et al., 1996). Hence, we call it an LCA-
preserving inclusion.
several algorithms to compute unordered tree inclu-
sion have been designed as the exact exponential al-
gorithms (Akutsu et al., 2021; Kilpel¨ainen and Man-
nila, 1995).
Note that the proof of NP-completeness for the
inclusion implies the structural restriction of the
tractability for the inclusion that the height of a text
tree is at most 2 or the degree of a text tree is bounded
by some constant (Kilpel¨ainen and Mannila, 1995;
Matou˘sek and Thomas, 1992). In this paper, we give
another structural restriction providing the limitation
of the tractability for the inclusion as a rooted la-
beled caterpillar (a caterpillar, for short) (cf., (Gal-
lian, 2007)). The caterpillar is an unordered tree
transformed to a rooted path after removing all the
leaves in it.
The caterpillar provides the structural restriction
of the tractability of computing the edit distance for
unordered trees (Muraka et al., 2018). It is known
that the problem of computing the edit distance be-
tween unordered trees is MAX SNP-hard (Zhang and
Jiang, 1994). This statement also holds even if two
trees are binary, the maximum height is at most 3
or the cost function is the unit cost function (Akutsu
et al., 2013; Hirata et al., 2011). On the other hand,
we can compute the edit distance between caterpil-
lars in O(h
2
λ
3
) time in the general cost function and
O(h
2
λ) time under the unit cost function, where h is
the maximum height of the two caterpillars and λ is
the maximum number of leaves in the two caterpil-
lars (Muraka et al., 2018)
2
.
2
This time complexity is different from the result in
(Muraka et al., 2018), because it contains some errors. See