Authors:
Carsten Henneges
;
Marc Röttig
;
Oliver Kohlbacher
and
Andreas Zell
Affiliation:
Eberhard Karls Universität Tübingen, Germany
Keyword(s):
Graphlets, DataMining, Relative Neighbourhood Graph, Secondary Structure Elements, Decision Tree Model Selection.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Artificial Intelligence
;
Biomedical Engineering
;
Biomedical Signal Processing
;
Biometrics
;
Computational Intelligence
;
Data Manipulation
;
Health Engineering and Technology Applications
;
Human-Computer Interaction
;
Methodologies and Methods
;
Neural Networks
;
Neurocomputing
;
Neuroinformatics and Bioinformatics
;
Neurotechnology, Electronics and Informatics
;
Pattern Recognition
;
Physiological Computing Systems
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Supervised and Unsupervised Learning
;
Theory and Methods
Abstract:
Interactions between secondary structure elements (SSEs) in the core of proteins are evolutionary conserved and define the overall fold of proteins. They can thus be used to classify protein families. Using a graph representation of SSE interactions and data mining techniques we identify overrepresented graphlets that can be used for protein classification. We find, in total, 627 significant graphlets within the ICGEB Protein Benchmark database (SCOP40mini) and the Super-Secondary Structure database (SSSDB). Based on graphlets, decision trees are able to predict the four SCOP levels and SSSDB (sub)motif classes with a mean Area Under Curve (AUC) better than 0.89 (5-fold CV). Regularized decision trees reveal that for each classification task about 20 graphlets suffice for reliable predictions. Graphlets composed of five secondary structure interactions are most informative. Finally, we find that graphlets can be predicted from secondary structure using decision trees (5-fold CV) with
a Matthews Correlation Coefficient (MCC) reaching up to 0.7.
(More)