Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles

Matteo Comin, Morris Antonelli

Abstract

Enhancers are stretches of DNA (100-1000 bp) that play a major role in development gene expression, evolution and disease. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cis-regulatory modules (CRMs). Even if the binding of transcription factors is sequence-specific, the identification of functionally similar enhancers is very difficult and it cannot be carried out with traditional alignment-based techniques. In this paper we study the use of alignment-free measures for the classification of CRMs. However alignment-free measures are generally tied to a fixed resolution k. Here we propose an alignment-free statistic that is based on multiple resolution patterns derived from Entropic Profiles. Entropic Profile is a function of the genomic location that captures the importance of that region with respect to the whole genome. We evaluate several alignment-free statistics on simulated data and real mouse ChIP-seq sequences. The new statistic is highly successful in discriminating functionally related enhancers and, in almost all experiments, it outperforms fixed-resolution methods.

References

  1. Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. (1990). Basic local alignment search tool. J. Mol. Biol., 215:403-410.
  2. Blaisdell, B. (1986). A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl Acad. Sci., 83(5155-5159).
  3. Blow, M. et al. (2010). Chip-seq identification of weakly conserved heart enhancers. Nature Genetics, 42(9):806-810.
  4. Comin, M. and Antonello, M. (2013). Fast computation of entropic profiles for the detection of conservation in genomes. In in BIoinformatics (LNBI), L. N., editor, Proceedings of Pattern Recognition in Bioinformatics, volume 7986, pages 277-288.
  5. Comin, M. and Antonello, M. (2014). Fast entropic profiler: An information theoretic approach for the discovery of patterns in genomes. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(3):500 - 509.
  6. Comin, M., Leoni, A., and Schimd, M. (2014). Qcluster: Extending alignment-free measures with quality values for reads clustering. Algorithms in Bioinformatics, Lecture Notes in Computer Science, 8701:1-13.
  7. Comin, M. and Schimd, M. (2014). Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns. BMC Bioinformatics, 15(Suppl 9):S1.
  8. Comin, M. and Verzotto, D. (2010). Classification of protein sequences by means of irredundant patterns. BMC bioinformatics, 11(Suppl 1):S16.
  9. Comin, M. and Verzotto, D. (2011). The irredundant class method for remote homology detection of protein sequences. Journal of Computational Biology, 18(12):1819-1829.
  10. Comin, M. and Verzotto, D. (2014). Beyond fixedresolution alignment-free measures for mammalian enhancers sequence comparison. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(4):628-637.
  11. Fernandes, F., Freitas, A., Almeida, J., and Vinga, S. (2009). Entropic profiler - detection of conservation in genomes using information theory. BMC research notes, 2:72.
  12. Foret, S., Wilson, S., and Burden, C. (2009). Characterising the d2 statistic: word matches in biological sequences. Stat. Appl. Genet. Mol. Biol., 8(43).
  13. Göke, J., Schulz, M., Lasserre, J., and Vingron, M. (2012). Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. 28(5):656-663.
  14. Kantorovitz, M., Robinson, G., and Sinha, S. (2007). A statistical method for alignment-free comparison of regulatory sequences. 23(13):249-255.
  15. Liu, X., Wan, L., Reinert, G., Waterman, M., Sun, F., and Li, J. (2011). New powerful statistics for alignmentfree sequence comparison under a pattern transfer model. 1:106-116.
  16. Reinert, G., Chew, D., Sun, F., and Waterman, M. S. (2009). Alignment-free sequence comparison (i): statistics and power. Journal of Computational Biology, 16(12):1615-1634.
  17. S. Robin, e. a. (2005). DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press.
  18. Shlyueva, D., Stampfel, G., and Stark, A. (2014). Transcriptional enhancers: from properties to genomewide predictions. Nature Reviews Genetics, 15:272 - 286.
  19. Smith, T. and Waterman, M. (1981). Comparison of biosequences. Adv. Appl. Math., 2:482-489.
  20. Song, K., Ren, J., Reinert, G., Deng, M., Waterman, M. S., and Sun, F. (2014). New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform, 15(3):343-353.
  21. Vinga, S. and Almeida, J. (2003). Alignment-free sequence comparison a review. Bioinformatics, 19(4):513-523.
  22. Vinga, S. and Almeida, J. S. (2007). Local renyi entropic profiles of dna sequences. BMC Bioinformatics, 8:393.
  23. Visel, A. et al. (2009). Chip-seq accurately predicts tissue-specific activity of enhancers. Nature, 457(7231):854-858.
Download


Paper Citation


in Harvard Style

Comin M. and Antonelli M. (2015). Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015) ISBN 978-989-758-070-3, pages 171-177. DOI: 10.5220/0005251001710177


in Bibtex Style

@conference{bioinformatics15,
author={Matteo Comin and Morris Antonelli},
title={Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)},
year={2015},
pages={171-177},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005251001710177},
isbn={978-989-758-070-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)
TI - Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles
SN - 978-989-758-070-3
AU - Comin M.
AU - Antonelli M.
PY - 2015
SP - 171
EP - 177
DO - 10.5220/0005251001710177