Semantic Enrichment of Relevant Feature Selection Methods for Data Mining in Oncology

Adriana Da Silva Jacinto, Ricardo Da Silva Santos, José Maria Parente De Oliveira


This project presents a proposal of capturing of the semantic importance of each feature by computational manner. The proposal enriches the traditional methods of feature selection by using of Natural Language Processing, the NCI ontology, WordNet and medical documents. A prototype of this approach was implemented and tested with five data sets related to cancer patients. The results show that the use of semantic improves the pre – processing by selecting of the most relevant semantic features.


  1. Almuallim, H.; Dietterich T. G. 1991. Learning with Many Irrelevant Features. In: Proceedings of the 9th National Conference on Artificial Intelligence, Anaheim, CA, v. 2, pp. 547-552.
  2. Almuallim, H.; Dietterich, T. G. 1992. Efficient algorithms for identifying relevant features. In: Proceedings of the Ninth Canadian Conference on Artificial Intelligence, Vancouver, BC: Morgan Kaufmann. May 11-15, pp. 38-45.
  3. Ammu, P. K.; Preeja, V. 2013. Review on Feature Selection Techniques of DNA Microarray Data. In: International Journal of Computer Applications 0975 - 8887 Volume 61- No.12, January 2013. pp. 39-44.
  4. Bray, F.; Ren, J. S.; Masuyer, E.; Ferlay, J. Estimates of global Cancer prevalence for 27 Sites in the Adult Population in 2008. Int J Cancer. 2013 Mar 1; 132 (5):1133-45. doi:10.1002/ijc.27711. Epub 2012 Jul 26.
  5. Chahkandi,Vahid; Yaghoobi, Mahdi; Veisi, Gelareh. 2013. Feature Selection with Chaotic Hybrid Artificial Bee ColonyAlgorithm based on Fuzzy CHABCF In: Journal of Soft Computing and Applications. pp. 1-8
  6. Chouchoulas, A.; Shen, Q. 2001. Rough set-aided keyword reduction for text categorization. Applied Artificial Intelligence: An International Journal. 159:843-873.
  7. Cover, T. M.; Thomas, J. A. 1991. Elements of Information Theory. Copyright © 1991 John Wiley and Sons, Inc. Print ISBN 0-471-06259-6 Online ISBN 0-471-20061- 1. 563 p.
  8. Dash, M.; Liu, H. 2003. Consistency-Based Search in Feature Selection. Artificial Intelligence. 1511-2:155- 176, December, 2003.
  9. Deisy, C., Baskar, S., Ramraj, N., Saravanan Koori, J., and Jeevanandam, P. 2010.. A novel information theoreticinteract algorithm (IT-IN) for feature selection using three machine learning algorithms. Expert Systems with Applications, 37(12), 7589-7597. Elsevier Ltd. doi:10.1016/j.eswa.2010.04.084
  10. Euzenat, Jérôme and Shvaiko, Pavel. 2007. Ontology matching, Springer-Verlag, 978-3-540-49611-3.
  11. Fellbaum, Christiane. 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
  12. Ferlay, J.; Soerjomataram, I.; Ervik, M. R.; Dikshit, S.; Eser, C.; Mathers, M.; Rebelo, M.; Parkin, D.; Forman, D.; Bray, F. GLOBOCAN 2012 v1.0, Cancer Incidence and Mortality Worldwide: IARC Cancer Base No. 11 [Internet]. Lyon, France: Inter-national Agency for Research on Cancer; 2013. Available at:, Accessed on June 2014.
  13. Freitas, A. A. 2001. Understanding the Crucial Role of Attribute Interaction in Data Mining. Artificial Intelligence Review, (1991), 177-199.
  14. Hall, M. A. 2000. A correlation-based feature selection for discrete and numeric class machine learning. ICML'00. In: Proceedings of the 17th International Conference on Machine Learning. pp. 1157-1182.
  15. He, X.; Cai, D.; Niyogi, P. 2005. Laplacian score for feature selection. In: Y. Weiss, B. Scholkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, Cambridge, MA, MIT Press.
  16. Hein, N.; Kroenke, A. 2010 Escólios sobre a Teoria dos Conjuntos Aproximados - Commentaries about the Rough Sets Theory. In: Revista CIATEC - UPF, vol.2 1, pp. 13-20. doi: 10.5335/ciatec.v2i1.876 13.
  17. Inbarani, H. H.; Thangavel, K.; Pethalakshmi, A. 2007. Rough Set Based Feature Selection for Web Usage Mining. In: International Conf. on Computational Intelligence and Multimedia Applications. ICCIMA 2007, pp. 33-38. IEEE. doi:10.1109/ICCIMA.2007.356
  18. Jaro, M. A. 1989. Advances in record linkage methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Association 84 (406): 414-20. doi:10.1080/01621459.1989.10478785.
  19. Jaro, M. A. (1995). Probabilistic linkage of large public health data file. Statistics in Medicine. 14 (5-7): 491- 8. doi:10.1002/sim.4780140510. PMID 7792443.
  20. Jemal, A.; Bray, F.; Center, M.; Ferlay, J.; Ward, E.; Forman., D. 2011. Global Cancer statistics. CA Cancer Journal for Clinicians.; 61(2):69-90.
  21. Kira, K.; Rendell, L. A. 1992. The Feature Selection Problem: Traditional Methods and a New Algorithm, In: Proceedings of 10th Conference on Artificial Intelligence, Menlo Park, CA, pp. 129-136.
  22. Kira, K.; Rendell, L. A.1992. A practical approach to feature selection. In: Sleeman and P. Edwards,editors, Proceedings of the 9th International Conference on Machine Learning ICML-92, Morgan Kaufmann, pp. 249-256.
  23. Kononenko, I. 1994. Estimating attributes: Analysis and extension of RELIEFF. In: F. Bergadano and L. de Raedt, editors, In: Proceedings of the European Conference on Machine Learning, April 6-8, Catania, Italy, Berlin: Springer-Verlag, pp. 171-182.
  24. Kuo, Y-T.; Lonie, A.; Sonenberg, L. Domain Ontology Driven Data Mining: A Medical Case Study. Proceddings of 2007 ACM SIGKDD Workshop on Domain Driven Data Mining (DDDM2007); 2007. Aug 12-14; San Jose, California, USA, pp.11-17.
  25. Lee, Huei Diana. Seleção de Atributos Importantes para a Extração de Conhecimento de Bases de Dados. Tese de Doutorado. USP, 2005. 154p.
  26. Liu, H.; Setiono, R. 1996. A Probabilistic Approach to Feature Selection: a Filter Solution. In: Proceedings of the 13th International Conference on Machine Learning, Morgan Kaufmann. pp. 319-327
  27. Liu, H.; Setiono, R.1998. Feature Selection for Large Sized Databases. In Proceedings of the 4th World Congress on Expert System, Morgan Kaufmann, pp. 68-75.
  28. Mansingh, G.; Osei-Bryson, K.-M.; Reichgelt, H. 2011. Using ontologies to facilitate post-processing of association rules by domain experts. Information Sciences, 1813, Elsevier Inc. pp. 419-434. doi:10.1016/j.ins.2010.09.027.
  29. Microsoft 2014. [Online]. Available at: <http://>. Accessed in 2014.
  30. Miller, George A. 1995. WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No.11: 39-41
  31. National Cancer Institute (NCI) [Online]. Available at: . Accessed in 2014.
  32. Netzer, M.; Fang, X.; Handler, M.; Baumgartner, C. 2012. A coupled two step network-based approach to identify genes associated with breast cancer. Proc. 4th Int. Conf. on Bioinformatics, Biocomputational Systems and Biotechnologies, (Biotechno, 2012), pp. 1-5.
  33. Osl, M.; Dreiseitl, S.; Cerqueira, F.; Netzer, M.; Pfeifer, B.; Baumgartner, C. 2009. Demoting redundant features to improve the discriminatory ability in cancer data. Journal of Biomedical Informatics, 424,Elsevier Inc. pp. 721-725. doi:10.1016/ j.jbi.2009.05.006
  34. Pawlak, Z. 1982. Rough sets. In: International Journal of Computer and Information Sciences, vol. 11, New York, NY. n.º5, pp. 341-356, Plenum. http://roughsets.
  35. Pearson, K. 1901. On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2 11: 559-572.
  36. Peng, H.; Long, F.; Ding, C. 2005. Feature Selection Based on Mutual Information: Criteria of MaxDependency, Max-Relevance, and Min-Redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 278: pp. 1226-1238.
  37. Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (5): 513-523. doi:10.1016/0306-4573(88)90021-0..
  38. Tan, K. C.; Teoh, E. J.; Yu, Q.; Goh, K. C. 2009. A hybrid evolutionary algorithm for attribute selection in data mining. Expert Systems with Applications, 364, pp. 8616-8630. doi:10.1016/j.eswa.2008.10.013
  39. Tan, P.-N.; Steinbach, M.; Kumar, V. 2005. Introduction to Data Mining, 1st Edition. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
  40. Teruya, Anderson. 2008 Uma metodologia para seleção de atributos no processo de extração de conhecimento de base de dados baseada em teoria de rough sets. Dissertação de Mestrado. Universidade Federal Mato Grosso do Sul, 86p.
  41. Winkler, W. E. 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods (American Statistical Association): 354-359.
  42. Wu, C.-A., Lin, W.-Y., Jiang, C.-L., and Wu, C.-C. 2011. Toward intelligent data warehouse mining: An ontology-integrated approach for multi-dimensional association mining. Expert Systems with Applications, 38(9), 11011-11023. Elsevier Ltd. doi:10.1016/j.eswa. 2011.02.144.
  43. Yu, L.; Liu, H. 2003. Feature selection for highdimensional data: A fast correlation-based filter solution. In: T. Fawcett and N. Mishra, editors, Proceedings of the 20th International Conference on Machine Learning ICML-03, August 21-24, Washington, D.C., 2003. Morgan Kaufmann, pp. 856-863.
  44. Zaki, M.; Meira Jr, W.2009 Fundamentals of Data Mining Algorithms, Cambridge University Press in press. 555p. Available at: algorithms/
  45. Zhao, Z.; Liu, H.2007. Searching for Interacting Features. In: Proceedings of the 20th International Joint Conference on AI IJCAI, January 2007.
  46. Zhao, Z.; Liu, H. 2007. Spectral feature selection for supervised and unsupervised learning. In International Conference on Machine Learning ICML, 2007.

Paper Citation

in Harvard Style

Da Silva Jacinto A., Da Silva Santos R. and Parente De Oliveira J. (2014). Semantic Enrichment of Relevant Feature Selection Methods for Data Mining in Oncology . In Doctoral Consortium - DC3K, (IC3K 2014) ISBN Not Available, pages 24-30. DOI: 10.5220/0005172400240030

in Bibtex Style

author={Adriana Da Silva Jacinto and Ricardo Da Silva Santos and José Maria Parente De Oliveira},
title={Semantic Enrichment of Relevant Feature Selection Methods for Data Mining in Oncology},
booktitle={Doctoral Consortium - DC3K, (IC3K 2014)},
isbn={Not Available},

in EndNote Style

JO - Doctoral Consortium - DC3K, (IC3K 2014)
TI - Semantic Enrichment of Relevant Feature Selection Methods for Data Mining in Oncology
SN - Not Available
AU - Da Silva Jacinto A.
AU - Da Silva Santos R.
AU - Parente De Oliveira J.
PY - 2014
SP - 24
EP - 30
DO - 10.5220/0005172400240030