GRAPHLET DATA MINING OF ENERGETICAL INTERACTION PATTERNS IN PROTEIN 3D STRUCTURES

Carsten Henneges, Marc Röttig, Oliver Kohlbacher, Andreas Zell

2010

Abstract

Interactions between secondary structure elements (SSEs) in the core of proteins are evolutionary conserved and define the overall fold of proteins. They can thus be used to classify protein families. Using a graph representation of SSE interactions and data mining techniques we identify overrepresented graphlets that can be used for protein classification. We find, in total, 627 significant graphlets within the ICGEB Protein Benchmark database (SCOP40mini) and the Super-Secondary Structure database (SSSDB). Based on graphlets, decision trees are able to predict the four SCOP levels and SSSDB (sub)motif classes with a mean Area Under Curve (AUC) better than 0.89 (5-fold CV). Regularized decision trees reveal that for each classification task about 20 graphlets suffice for reliable predictions. Graphlets composed of five secondary structure interactions are most informative. Finally, we find that graphlets can be predicted from secondary structure using decision trees (5-fold CV) with a Matthews Correlation Coefficient (MCC) reaching up to 0.7.

References

  1. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F., and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5):412-424.
  2. Chiang, Y.-S., Gelfand, T. I., Kister, A. E., and Gelfand, I. M. (2007). New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage. Proteins, 68(4):915- 921.
  3. Georgii, H.-O. (2004). Stochastik. de Gruyter, 2nd edition. p.198.
  4. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65-70.
  5. Kabsch, W. and Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers, 22(12):2577-2637.
  6. Kohlbacher, O. and Lenhof, H.-P. (2000). BALL-rapid software prototyping in computational molecular biology. Bioinformatics, 16(9):815-824.
  7. Milligan, G. W. and Isaac, P. D. (1980). The validation of four ultrametric clustering algorithms. Pattern Recognition, 12(2):41 - 50.
  8. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., and Alon, U. (2002). Network motifs: simple building blocks of complex networks. Science, 298(5594):824-827.
  9. Murzin, A. G., Brenner, S. E., Hubbard, T., and Chothia, C. (1995). Scop: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247(4):536 - 540.
  10. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., and Thornton, J. M. (1997). CATHa hierarchic classification of protein domain structures. Structure, 5(8):1093-1108.
  11. R Development Core Team (2009). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3- 900051-07-0.
  12. SAS Institute Inc. (2009). Jmp 8.0.1. www.jmp.com.
  13. Scott, C. and Nowak, R. (2005). On the adaptive properties of decision trees. In Advances in Neural Information Processing Systems 17. MIT Press.
  14. Sonego, P., Pacurar, M., Dhir, S., Kertesz-Farkas, A., Kocsor, A., Gaspari, Z., Leunissen, J. A. M., and Pongor, S. (2007). A Protein Classification Benchmark collection for machine learning. Nucleic Acids Res, 35(Database issue):D232-D236.
  15. Toussaint, T., G. (1980). The relative neighbourhood graph of a finite planar set. Pattern Recognition, 12:261 - 268.
  16. Vacic, V., Iakuoucheva, L., Lonardi, S., and Radivojac, P. (2010). Graphlet kernels for prediction of functional residues in protein structures. Journal of Computational Biology, 17(1):55 - 72.
  17. Wald, A. and Wolfowitz, J. (1944). Statistical tests based on permutations of the observations. The Annals of Mathematical Statistics, 15(4):358-372.
  18. Wassermann, L. (2004). All of statistics. Springer. theorem 14.5.
Download


Paper Citation


in Harvard Style

Henneges C., Röttig M., Kohlbacher O. and Zell A. (2010). GRAPHLET DATA MINING OF ENERGETICAL INTERACTION PATTERNS IN PROTEIN 3D STRUCTURES . In Proceedings of the International Conference on Fuzzy Computation and 2nd International Conference on Neural Computation - Volume 1: ICNC, (IJCCI 2010) ISBN 978-989-8425-32-4, pages 190-195. DOI: 10.5220/0003077501900195


in Bibtex Style

@conference{icnc10,
author={Carsten Henneges and Marc Röttig and Oliver Kohlbacher and Andreas Zell},
title={GRAPHLET DATA MINING OF ENERGETICAL INTERACTION PATTERNS IN PROTEIN 3D STRUCTURES},
booktitle={Proceedings of the International Conference on Fuzzy Computation and 2nd International Conference on Neural Computation - Volume 1: ICNC, (IJCCI 2010)},
year={2010},
pages={190-195},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003077501900195},
isbn={978-989-8425-32-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Fuzzy Computation and 2nd International Conference on Neural Computation - Volume 1: ICNC, (IJCCI 2010)
TI - GRAPHLET DATA MINING OF ENERGETICAL INTERACTION PATTERNS IN PROTEIN 3D STRUCTURES
SN - 978-989-8425-32-4
AU - Henneges C.
AU - Röttig M.
AU - Kohlbacher O.
AU - Zell A.
PY - 2010
SP - 190
EP - 195
DO - 10.5220/0003077501900195