Infrequent, Unexpected, and Contrast Pattern Discovery from Bacterial Genomes by Genome-wide Comparative Analysis

Daisuke Ikeda, Osamu Maruyama, Satoru Kuhara

Abstract

With plenty of sequences, comparative genomics is becoming important. Its basic approach is to find similar subsequences from the sequences of different species and then examine differences in detail among found similar parts. Instead of focusing on similar parts, this paper is devoted to find different parts directly from the whole DNA sequences. It is challenging because the large size prohibits computationally expensive methods and there exits so many differences in case of genome-wide comparison. To cope with this, we exploit the algorithm in (Ikeda and Suzuki, 2009), which finds unexpected, infrequent patterns. But, found patterns was not evaluated from the viewpoint of biology. In this paper, we show that patterns discovered by the algorithm from bacterial genome sequences match well biological features, such as RNA and transposon. Therefore, assuming these features as relevant regions, we compute F-measure values and show that some species achieves about 90%, which is one order of magnitude better than patterns found by an existing method. Thus, we conclude that the algorithm can find these infrequent, but biologically meaningful patterns from genome-wide sequences.

References

  1. Apostolico, A., Bock, M. E., Lonardi, S., and Xu, X. (2000). Efficient Detection of Unusual Words. J. of Comput. Biol., 7(1/2):71-94.
  2. Bei├čbarth, T. and Speed, T. P. (2004). GOstat: Find Statistically Overrepresented Gene Ontologies within a Group of Genes. Bioinformatics, 20(9):1464-1465.
  3. Horng, J.-T., Huang, H.-D., Huang, S.-L., Yang, U.-C., and Chang, Y.-C. (2002). Mining Putative Regulatory Elements in Promoter Regions of Saccharomyces Cerevisiae. In Silico Biology, 2(3):263-273.
  4. Huang, H.-D., Chang, H.-L., Tsou, T.-S., Liu, B.-J., Kao, C.-Y., and Horng, J.-T. (2003). A Data Mining Method to Predict Transcriptional Regulatory Sites Based on Differentially Expressed Genes in Human Genome. J. of Info. Sci. and Eng., 19(6):923-942.
  5. Ikeda, D. and Suzuki, E. (2009). Mining Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts. In Proc. of ECML PKDD, pages 596-611.
  6. Ji, X., Bailey, J., and Dong, G. (2005). Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints. In Proc. of ICDM, pages 194-201.
  7. Leung, M.-Y., Marsh, G. M., and Speed, T. P. (1996). Overand Underrepresentation of Short DNA Words in Herpesvirus Genomes. J. of Comput. Biol., 3(3):345-360.
  8. Marschall, T. and Rahmann, S. (2009). Efficient Exact Motif Discovery. Bioinformatics, 25(12):i356-i364.
  9. Parida, L. (2007). Pattern Discovery in Bioinformatics: Theory & Algorithms. Chapman & Hall/CRC.
  10. Robin, S., Rodolphe, F., and Schbath, S. (2005). DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press.
  11. Schbath, S. (1997). An Efficient Statistic to Detect Overand Under-represented Words in DNA Sequences. J. of Comput. Biol., 4(2):189-192.
Download


Paper Citation


in Harvard Style

Ikeda D., Maruyama O. and Kuhara S. (2013). Infrequent, Unexpected, and Contrast Pattern Discovery from Bacterial Genomes by Genome-wide Comparative Analysis . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013) ISBN 978-989-8565-35-8, pages 308-311. DOI: 10.5220/0004241203080311


in Bibtex Style

@conference{bioinformatics13,
author={Daisuke Ikeda and Osamu Maruyama and Satoru Kuhara},
title={Infrequent, Unexpected, and Contrast Pattern Discovery from Bacterial Genomes by Genome-wide Comparative Analysis},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)},
year={2013},
pages={308-311},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004241203080311},
isbn={978-989-8565-35-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)
TI - Infrequent, Unexpected, and Contrast Pattern Discovery from Bacterial Genomes by Genome-wide Comparative Analysis
SN - 978-989-8565-35-8
AU - Ikeda D.
AU - Maruyama O.
AU - Kuhara S.
PY - 2013
SP - 308
EP - 311
DO - 10.5220/0004241203080311