Grammar-based Compression for Directed and Undirected Generalized
Series-parallel Graphs using Integer Linear Programming
Morihiro Hayashida
1
, Hitoshi Koyano
2
and Tatsuya Akutsu
3
1
National Institute of Technology, Matsue College, 14-4, Nishiikumacho, Matsue, Shimane 690–8518, Japan
2
Quantitative Biology Center, Riken, 2-2-3, Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo 650–0047, Japan
3
Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611–0011, Japan
Keywords:
Generalized Series-parallel Graph, Grammar-based Compression, Integer Linear Programming.
Abstract:
We address a problem of finding generation rules from biological data, especially, represented as directed and
undirected generalized series-parallel graphs (GSPGs), which include trees, outerplanar graphs, and series-
parallel graphs. In the previous study, grammars for edge-labeled rooted ordered and unordered trees, called
SEOTG and SEUTG, respectively, were defined, and it was examined to extract generation rules from glycans
and RNAs that can be represented by rooted tree structures, where integer linear programming-based methods
for finding the minimum SEOTG and SEUTG that produce only given trees were developed. In nature and or-
ganisms, however, there are various kinds of structures such as gene regulatory networks, metabolic pathways,
and chemical structures that cannot be represented as rooted trees. In this study, we relax the limitation of
structures to be compressed, and propose grammars representing edge-labeled directed and undirected GSPGs
based on context-free grammars by extending SEOTG and SEUTG. In addition, we propose an integer linear
programming-based method for finding the minimum GSPG grammar in order to analyze more complicated
biological networks and structures.
1 INTRODUCTION
Data compression for a structure is related with the
amount of information that it contains. The amount of
information would be large if the size of compressed
data is still large. Otherwise, the data include redun-
dant data, and the amount of information is small. Our
purpose is to extract useful information and knowl-
edge from data through compression. In particular,
we focus on biological structured data constructed in
nature. Such structures could be often explained by
several simple generation rules.
In previous studies, biological data represented by
rooted trees such as glycans and RNAs were com-
pressed and analyzed (Zhao et al., 2010; Zhao et al.,
2015). It is known that glycans are composed of mul-
tiple monosaccharides bound by glycosidic bonds,
take various structures in accordance with biosyn-
thetic reactions, and the function of a glycan depends
on its structure. Hence, it is important to analyze the
glycan structures, and to extract rules of the biosyn-
theses. They developed integer linear programming-
based methods, called the minimum SEOTG and
SEUTG, for finding the minimum grammar that pro-
duces only given single ordered and unordered rooted
trees, and applied them to biological data such as gly-
cans with up to 36 nodes and 5 distinct labels, where
these methods are based on a kind of tree grammar,
the simple elementary ordered (unordered) tree gram-
mar (SEO(U)TG) (Akutsu, 2010). Furthermore, they
extended the methods to multiple rooted trees be-
cause generation rules are commonly utilised among
these multiple trees. It, however, is considered that
structures generated in nature cannot be always rep-
resented by rooted trees. In this paper, we extend
their grammar to directed and undirected generalized
series-parallel graphs (GSPGs), which include trees
and outerplanar graphs. In addition, we propose an
integer linear programming-based method for finding
the minimum GSPG grammar that produces only a
given generalized series-parallel graph.
A series-parallel graph is defined by two proce-
dures, called series and parallel compositions, and
two special nodes in the graph are labeled with source
and sink as terminal nodes (Eppstein, 1992; Eikel
et al., 2015). A generalized series-parallel graph is
defined by the addition of another series-type compo-
sition, called generalized series composition, where