Learning Advanced TFBS Models from Chip-Seq Data - diChIPMunk: Effective Construction of Dinucleotide Positional Weight Matrices

Ivan V. Kulakovskiy, Victor G. Levitsky, Dmitry G. Oschepkov, Ilya E. Vorontsov, Vsevolod J. Makeev

Abstract

Identification and consequent analysis of DNA sequence motifs recognized by transcription factors is an important component in studying transcriptional regulation in higher eukaryotes. In particular, motif discovery methods are applied to construct transcription factor binding sites (TFBSs) models. The TFBS models are then used for prediction of putative binding sites in genomic regions of interest. The most popular TFBS model is a positional weight matrix (PWM). The PWM is usually constructed from nucleotide positional frequencies estimated from a gapless multiple local alignments of experimentally identified TFBS sequences. Modern high-throughput experiments, like ChIP-Seq, provide enough data for careful training of more advanced models having more parameters. Until now, the majority of existing tools for TFBS prediction in ChIP-Seq data still rely on PWMs with independent positions. This is partly explained with only marginal improvement of specificity and sensitivity of TFBS recognition for advanced models over those based on traditional PWMs if trained on ChIP-Seq data. Here we present a novel computational tool, diChIPMunk (http://autosome.ru/dichipmunk/), which can construct dinucleotide PWMs accounting for neighboring nucleotide correlations in input sequences. diChIPMunk retains advantages of the published ChIPMunk algorithm, including usage of ChIP Seq peak shape and overall computational efficiency. Using public ChIP-Seq data for several TFs we show that carefully trained dinucleotide PWMs perform significantly better as compared to PWMs based on mononucleotide frequencies.

References

  1. Stormo, G. D., (2000). DNA binding sites: representation and discovery. Bioinformatics, 16(1):16-23.
  2. Thomas-Chollier, M., Darbo, E., Herrmann, C., Defrance, M., Thieffry, D., van Helden, J., (2012). A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nat Protoc., 7(8):1551-68.
  3. Bi, Y., Kim, H., Gupta, R., Davuluri, R. V., (2011). Treebased position weight matrix approach to model transcription factor binding site profiles. PLoS One., 6(9):e24210.
  4. SantaLucia., J. Jr., Hicks, D., (2004). The thermodynamics of DNA structural motifs. Annu Rev Biophys Biomol Struct., 33:415-40.
  5. Gershenzon, N. I., Stormo, G. D., Ioshikhes, I. P., (2005). Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites. Nucleic Acids Res., 33(7):2290-301.
  6. Levitsky, V. G., Ignatieva, E. V., Ananko, E. A., Turnaev, I. I., Merkulova, T. I., Kolchanov, N. A., Hodgman, T. C. (2007). Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions. BMC Bioinformatics, 8:481.
  7. Zhao, Y., Ruan, S., Pandey, M., Stormo, G. D. (2012). Improved models for transcription factor binding site identification using nonindependent interactions. Genetics, 191(3):781-90.
  8. Kulakovskiy, I. V., Boeva, V. A., Favorov, A. V., Makeev, V. J. (2010). Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics, 26(20):2622-3.
  9. Kulakovskiy I. V., Medvedeva Y. A., Schaefer U., Kasianov A. S., Vorontsov I. E., Bajic V.B., Makeev V. K., (2012) HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res, in press.
  10. Ma, X., Kulkarni, A., Zhang, Z., Xuan, Z., Serfling, R., Zhang, M. Q. (2012). A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information. Nucleic Acids Res., 40(7):e50.
  11. Kuttippurathu, L., Hsing, M., Liu, Y., Schmidt, B., Maskell, D. L., Lee, K., He, A., Pu, W. T., Kong, S. W., (2011). CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments. Bioinformatics, 27(5):715-7.
  12. Matys, V., Kel-Margoulis, O. V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A. E., Wingender, E., (2006). TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res., 34(Database issue):D108-10.
  13. Touzet, H., Varré, J. S. (2007). Efficient and accurate Pvalue computation for Position Weight Matrices. Algorithms Mol Biol., 11;2:15.
Download


Paper Citation


in Harvard Style

V. Kulakovskiy I., G. Levitsky V., G. Oschepkov D., E. Vorontsov I. and J. Makeev V. (2013). Learning Advanced TFBS Models from Chip-Seq Data - diChIPMunk: Effective Construction of Dinucleotide Positional Weight Matrices . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013) ISBN 978-989-8565-35-8, pages 146-150. DOI: 10.5220/0004238201460150


in Bibtex Style

@conference{bioinformatics13,
author={Ivan V. Kulakovskiy and Victor G. Levitsky and Dmitry G. Oschepkov and Ilya E. Vorontsov and Vsevolod J. Makeev},
title={Learning Advanced TFBS Models from Chip-Seq Data - diChIPMunk: Effective Construction of Dinucleotide Positional Weight Matrices},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)},
year={2013},
pages={146-150},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004238201460150},
isbn={978-989-8565-35-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)
TI - Learning Advanced TFBS Models from Chip-Seq Data - diChIPMunk: Effective Construction of Dinucleotide Positional Weight Matrices
SN - 978-989-8565-35-8
AU - V. Kulakovskiy I.
AU - G. Levitsky V.
AU - G. Oschepkov D.
AU - E. Vorontsov I.
AU - J. Makeev V.
PY - 2013
SP - 146
EP - 150
DO - 10.5220/0004238201460150