USE OF DOMAIN KNOWLEDGE FOR DIMENSION REDUCTION - Application to Mining of Drug Side Effects

Emmanuel Bresso, Sidahmed Benabderrahmane, Malika Smail-Tabbone, Gino Marchetti, Arnaud Sinan Karaboga, Michel Souchet, Amedeo Napoli, Marie-Dominique Devignes


High dimensionality of datasets can impair the execution of most data mining programs and lead to the production of numerous and complex patterns, inappropriate for interpretation by the experts. Thus, dimension reduction of datasets constitutes an important research orientation in which the role of domain knowledge is essential. We present here a new approach for reducing dimensions in a dataset by exploiting semantic relationships between terms of an ontology structured as a rooted directed acyclic graph. Term clustering is performed thanks to the recently described IntelliGO similarity measure and the term clusters are then used as descriptors for data representation. The strategy reported here is applied to a set of drugs associated with their side effects collected from the SIDER database. Terms describing side effects belong to the MedDRA terminology. The hierarchical clustering of about 1,200 MedDRA terms into an optimal collection of 112 term clusters leads to a reduced data representation. Two data mining experiments are then conducted to illustrate the advantage of using this reduced representation.


  1. Alcala-Fdez, J., Snchez, L., Garca, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernndez, J., and Herrera, F. (2009). KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing - A Fusion of Foundations, Methodologies and Applications, 13(3):307-318-318.
  2. Benabderrahmane, S., Smail-Tabbone, M., Poch, O., Napoli, A., and Devignes, M. (2010). IntelliGO: a new vector-based semantic similarity measure including annotation origin. BMC Bioinformatics, 11(1):588.
  3. Coulet, A., Smail-Tabbone, M., Benlian, P., Napoli, A., and Devignes, M. (2008). Ontology-guided data preparation for discovering genotype-phenotype relationships. BMC Bioinformatics, 9(Suppl 4):S3.
  4. Dy, J. G. and Brodley, C. E. (2004). Feature selection for unsupervised learning. J. Mach. Learn. Res., 5:845- 889.
  5. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157-1182.
  6. Han, J. and Kamber, M. (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 1 edition.
  7. John, G. H., Kohavi, R., and Pfleger, K. (1994). Irrelevant Features and the Subset Selection Problem. In International Conference on Machine Learning, pages 121-129.
  8. Kaytoue-Uberall, M., Duplessis, S., Kuznetsov, S. O., and Napoli, A. (2009). Two fca-based methods for mining gene expression data. In ICFCA, pages 251-266.
  9. Kelley, L. A., Gardner, S. P., and Sutcliffe, M. J. (1996). An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally related subfamilies. Protein Engineering, 9(11):1063 -1065.
  10. Kim, Y., Street, W. N., and Menczer, F. (2000). Feature selection in unsupervised learning via evolutionary search. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 365-369, Boston, Massachusetts, United States. ACM.
  11. Knox, C., Law, V., Jewison, T., Liu, P., Ly, S., Frolkis, A., Pon, A., Banco, K., Mak, C., Neveu, V., Djoumbou, Y., Eisner, R., Guo, A. C., and Wishart, D. S. (2011). DrugBank 3.0: a comprehensive resource for Omics research on drugs. Nucleic Acids Research, 39(suppl 1):D1035 -D1041.
  12. Kohavi, R. and John, G. H. (1997). Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1-2):273- 324.
  13. Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In Saitta, L., editor, Proceedings of the Thirteenth International Conference on Machine Learning (ICML), pages 284-292. Morgan Kaufmann Publishers.
  14. Kuhn, M., Campillos, M., Letunic, I., Jensen, L. J., and Bork, P. (2010). A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol, 6.
  15. Kyriakopoulou, A. (2008). Text classification aided by clustering: a literature review. In Tools in Artificial Intelligence, chapter 14. Paula Fritzsche, intech edition.
  16. Lavrac, N., Kavsek, B., Flach, P., and Todorovski, L. (2004). Subgroup discovery with CN2-SD. J. Mach. Learn. Res., 5:153-188.
  17. Leva, A. D., Berchi, R., Pescarmona, G., and Sonnessa, M. (2005). Analysis and prototyping of biological systems: the abstract biological process model. International Journal of Information and Technology, 3(4):216-224.
  18. MedDRA (2007). Meddra maintenance and support services organization. introductory guide, meddra version 10.1.
  19. Messai, N., Devignes, M.-D., Napoli, A., and SmaïlTabbone, M. (2008). Many-valued concept lattices for conceptual clustering and information retrieval. In ECAI, pages 127-131.
  20. Pakhomov, S. S., Hemingway, H., Weston, S. A., Jacobsen, S. J., Rodeheffer, R., and Roger, V. L. (2007). Epidemiology of angina pectoris: Role of natural language processing of the medical record. American Heart Journal, 153(4):666-673.
  21. Slonim, N. and Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 208-215, Athens, Greece. ACM.
  22. Szathmary, L., Napoli, A., and Kuznetsov, S. O. (2007). Zart: A multifunctional itemset mining algorithm. In CLA, pages 26-37.
  23. Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236-244. ArticleType: research-article / Full publication date: Mar., 1963 / Copyright 1963 American Statistical Association.
  24. Witten, I. H., Frank, E., and Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3 edition.

Paper Citation

in Harvard Style

Bresso E., Benabderrahmane S., Smail-Tabbone M., Marchetti G., Sinan Karaboga A., Souchet M., Napoli A. and Devignes M. (2011). USE OF DOMAIN KNOWLEDGE FOR DIMENSION REDUCTION - Application to Mining of Drug Side Effects . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 263-268. DOI: 10.5220/0003662602710276

in Bibtex Style

author={Emmanuel Bresso and Sidahmed Benabderrahmane and Malika Smail-Tabbone and Gino Marchetti and Arnaud Sinan Karaboga and Michel Souchet and Amedeo Napoli and Marie-Dominique Devignes},
title={USE OF DOMAIN KNOWLEDGE FOR DIMENSION REDUCTION - Application to Mining of Drug Side Effects},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},

in EndNote Style

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - USE OF DOMAIN KNOWLEDGE FOR DIMENSION REDUCTION - Application to Mining of Drug Side Effects
SN - 978-989-8425-79-9
AU - Bresso E.
AU - Benabderrahmane S.
AU - Smail-Tabbone M.
AU - Marchetti G.
AU - Sinan Karaboga A.
AU - Souchet M.
AU - Napoli A.
AU - Devignes M.
PY - 2011
SP - 263
EP - 268
DO - 10.5220/0003662602710276