LINEAR--TIME MATCHING OF POSITION WEIGHT MATRICES

Nikola Stojanovic

Abstract

Position Weight Matrices are a popular way of representing variable motifs in genomic sequences, and they have been widely used for describing the binding sites of transcriptional proteins. However, the standard implementation of PWM matching, while not inefficient on shorter sequences, is too expensive for whole– genome searches. In this paper we present an algorithm we have developed for efficient matching of PWMs in long target sequences. After the initial pre–processing of the matrix it performs in time linear to the size of the genomic segment.

References

  1. Aho, A. and Corasick, M. (1975). Efficient string matching: an aid to bibliographic search. Comm. Assoc. Comput. Mach., 18:333-340.
  2. Apostolico, A., Bock, M., Lonardi, S., and Xu, X. (2000). Efficient detection of unusual words. J. Comput. Biol., 7:71-94.
  3. Bryne, J., Valen, E., Tang, M., Marstrand, T., Winther, O., da Piedade, I., Krogh, A., Lenhard, B., and Sandelin, A. (2008). JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res., 36:D102-D106.
  4. Gershenzon, N. I., Stormo, G. D., and Ioshikhes, I. P. (2005). Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites. Nucleic Acids Res., 33:2290-2301.
  5. Hannenhalli, S. and Wang, L.-S. (2005). Enhanced position weight matrices using mixture models. Bioinformatics, 21:i204-i212.
  6. Hughes, J., Estep, P., Tavazoie, S., and Church, G. (2000). Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol., 296:1205-1214.
  7. Kel, A. E., Gössling, E., Reuter, I., Cheremushkin, E., KelMargoulis, O. V., and Wingender, E. (2003). Match: A tool for searching transcription factor binding sites in dna sequences. Nucleic Acids Res., 31(13):3576- 3579.
  8. Khambata-Ford, S., Liu, Y., Gleason, C., Dickson, M., Altman, R., Batzoglou, S., and Myers, R. (2003). Identification of promoter regions in the human genome by using a retroviral plasmid library-based functional reporter gene assay. Genome Res., 13:1765-1774.
  9. Knuth, D., Morris, J., and Pratt, V. (1977). Fast pattern matching in strings. SIAM J. Computing, 6:323-350.
  10. Liefooghe, A., Touzet, H., and Varr, J.-S. (2006). Large Scale Matching for Position Weight Matrices. In Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, CPM 2006, volume 4009 of LNCS, pages 401-412. Springer-Verlag.
  11. Nelson, C., Hersh, B., and Carroll, S. B. (2004). The regulatory content of intergenic DNA shapes genome architecture. Genome Biol., 5:R25.
  12. Qin, Z., McCue, L., Thompson, W., Mayerhofer, L., Lawrence, C., and Liu, J. (2003). Identification of coregulated genes through Bayesian clustering of predicted regulatory binding sites. Nature Biotechnology, 21:435-439.
  13. Singh, A. and Stojanovic, N. (2006). An efficient algorithm for the identification of repetitive variable motifs in the regulatory sequences of co-expressed genes. In Proceedings of the 21st International Symposium on Computer and Information Sciences, volume 4263 of LNCS, pages 182-191. Springer-Verlag.
  14. Singh, A. and Stojanovic, N. (2009). Genome-wide search for putative transcriptional modules in eukaryotic sequences. In Proceedings of BIOCOMP'09, pages 848-854.
  15. Stojanovic, N. (2009). A study on the distribution of phylogenetically conserved blocks within clusters of mammalian homeobox genes. Genetics and Molecular Biology, 32:666-673.
  16. Stormo, G. (1990). Consensus patterns in DNA. Methods Enzym., 183:211-221.
  17. The ENCODE Project Consortium (2007). The ENCODE pilot project: Identification and analysis of functional elements in 1% of the human genome. Nature, 447:799-816.
  18. van Helden, J. (2004). Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics, 20:399-406.
  19. Wingender, E. (2008). The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Briefings in Bioinformatics, 9:326-332.
  20. Young, J. E., Vogt, T., Gross, K. W., and Khani, S. C. (2003). A short, highly active photoreceptor-specific enhancer/promoter region upstream of the human rhodopsin kinase gene. Investigative Ophtamology and Visual Science, 44:4076-4085.
Download


Paper Citation


in Harvard Style

Stojanovic N. (2010). LINEAR--TIME MATCHING OF POSITION WEIGHT MATRICES . In Proceedings of the First International Conference on Bioinformatics - Volume 1: BIOINFORMATICS, (BIOSTEC 2010) ISBN 978-989-674-019-1, pages 66-73. DOI: 10.5220/0002750500660073


in Bibtex Style

@conference{bioinformatics10,
author={Nikola Stojanovic},
title={LINEAR--TIME MATCHING OF POSITION WEIGHT MATRICES},
booktitle={Proceedings of the First International Conference on Bioinformatics - Volume 1: BIOINFORMATICS, (BIOSTEC 2010)},
year={2010},
pages={66-73},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002750500660073},
isbn={978-989-674-019-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the First International Conference on Bioinformatics - Volume 1: BIOINFORMATICS, (BIOSTEC 2010)
TI - LINEAR--TIME MATCHING OF POSITION WEIGHT MATRICES
SN - 978-989-674-019-1
AU - Stojanovic N.
PY - 2010
SP - 66
EP - 73
DO - 10.5220/0002750500660073