similarity measure could classify the organisms very
well. As future work, we would like to analyze more
organisms, to find the minimum grammar for generat-
ing proteomes of more organisms, and to investigate
comprehensive evolutionary processes.
ACKNOWLEDGEMENTS
This work was partially supported by JSPS KAK-
ENHI Grant Number JP19K12228.
REFERENCES
Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman,
D. (1990). Basic local alignment search tool. Journal
of Molecular Biology, 215(3):403–410.
Chen, X., Kwong, S., and Li, M. (2001). A compression
algorithm for dna sequences. IEEE Engineering in
Medicine and Biology Magazine, 20(4):61–66.
Doolittle, R. (1995). The multiplicity of domains in pro-
teins. Annual Review of Biochemistry, 64:287–314.
El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R., Lu-
ciani, A., Potter, S. C., Qureshi, M., Richardson, L. J.,
Salazar, G. A., Smart, A., Sonnhammer, E. L., Hirsh,
L., Paladin, L., Piovesan, D., Tosatto, S. C., and Finn,
R. D. (2018). The Pfam protein families database in
2019. Nucleic Acids Research, 47(D1):D427–D432.
Ferragina, P., Giancarlo, R., Greco, V., Manzini, G., and
Valiente, G. (2007). Compression-based classification
of biological sequences and structures via the Univer-
sal Similarity Metric: experimental assessment. BMC
Bioinformatics, 8:252.
Hayashida, M. and Akutsu, T. (2010). Comparing biolog-
ical networks via graph compression. BMC Systems
Biology, 4(Suppl. 2):S13.
Hayashida, M., Ishibashi, K., and Koyano, H. (2018).
Analyzing order of domains in grammar-based com-
pression of proteomes. In The 24th International
Conference on Parallel and Distributed Processing
Techniques and Applications, pages 278–281. CSREA
Press.
Hayashida, M., Kamada, M., Song, J., and Akutsu, T.
(2011). Conditional random field approach to predic-
tion of protein-protein interactions using domain in-
formation. BMC Systems Biology, 5(Suppl. 1):S8.
Hayashida, M., Ruan, P., and Akutsu, T. (2014). Proteome
compression via protein domain compositions. Meth-
ods, 67:380–385.
Levenshtein, V. (1965). Binary codes capable of correcting
deletions, insertions and reversals. Doklady Adademii
Nauk SSSR, 163(4):845–848.
Li, H. (2018). Minimap2: pairwise alignment for nucleotide
sequences. Bioinformatics, 34(18):3094–3100.
Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P. (2004). The
similarity metric. IEEE Transactions on Information
Theory, 50:3250–3264.
Li, M. and Vitanyi, P. (1997). An introduction to Kol-
mogorov complexity and its applications. Springer-
Verlag, New York.
Lipman, D. and Pearson, W. (1985). Rapid and sensitive
protein similarity searches. Science, 227(4693):1435–
1441.
Mitchell, A. L., Attwood, T. K., Babbitt, P. C., Blum,
M., Bork, P., Bridge, A., Brown, S. D., Chang, H.-
Y., El-Gebali, S., Fraser, M. I., Gough, J., Haft,
D. R., Huang, H., Letunic, I., Lopez, R., Luciani,
A., Madeira, F., Marchler-Bauer, A., Mi, H., Natale,
D. A., Necci, M., Nuka, G., Orengo, C., Panduran-
gan, A. P., Paysan-Lafosse, T., Pesseat, S., Potter,
S. C., Qureshi, M. A., Rawlings, N. D., Redaschi,
N., Richardson, L. J., Rivoire, C., Salazar, G. A.,
Sangrador-Vegas, A., Sigrist, C. J., Sillitoe, I., Sut-
ton, G. G., Thanki, N., Thomas, P. D., Tosatto, S. C.,
Yong, S.-Y., and Finn, R. D. (2018). InterPro in 2019:
improving coverage, classification and access to pro-
tein sequence annotations. Nucleic Acids Research,
47(D1):D351–D360.
Nacher, J. C., Hayashida, M., and Akutsu, T. (2006). Pro-
tein domain networks: Scale-free mixing of positive
and negative exponents. Physica A, 367:538–552.
Nacher, J. C., Hayashida, M., and Akutsu, T. (2009). Emer-
gence of scale-free distribution in protein-protein in-
teraction networks based on random selection of in-
teracting domain pairs. BioSystems, 95:155–159.
Nicolae, M., Pathak, S., and Rajasekaran, S. (2015).
LFWC: a lossless compression algorithm for FASTQ
files. Bioinformatics, 31(20):3276–3281.
Pinho, A., Pratas, D., and Garcia, S. (2012). GReEn: a
tool for efficient compression of genome resequencing
data. Nucleic Acids Research, 40(4):e27.
Ruan, P., Hayashida, M., Maruyama, O., and Akutsu, T.
(2013). Prediction of heterodimeric protein complexes
from weighted protein-protein interaction networks
using novel features and kernel functions. PLoS ONE,
8(6):e65265.
Sigrist, C. J. A., De Castro, E., Langendijk-Genevaux,
P. S., Le Saux, V., Bairoch, A., and Hulo, N. (2005).
ProRule: a new database containing functional and
structural information on PROSITE profiles. Bioin-
formatics, 21(21):4060–4066.
Tarjan, R. (1977). Finding optimum branchings. Networks,
7:25–35.
The UniProt Consortium (2019). UniProt: a worldwide
hub of protein knowledge. Nucleic Acids Research,
47:D506–D515.
Woese, C. and Fox, G. (1977). Phylogenetic structure of
the prokaryotic domain: the primary kingdoms. Proc.
Natl. Acad. Sci. USA, 74(11):5088–5090.
BIOINFORMATICS 2020 - 11th International Conference on Bioinformatics Models, Methods and Algorithms
122