A Centroid-based Approach for Hierarchical Classification

Mauri Ferrandin, Fabrício Enembreck, Julio César Nievola, Edson Emílio Scalabrin, Bráulio Coelho Ávila

Abstract

Classification is a common task in Machine Learning and Data Mining. Some classification problems need to take into account a hierarchical taxonomy establishing an order between involved classes and are called hierarchical classification problems. The protein function prediction can be considered a hierarchical classification problem because their functions may be arranged in a hierarchical taxonomy of classes. This paper presents an algorithm for hierarchical classification using a centroid-based approach with two versions named HCCS and HCCSic respectively. Centroid-based techniques have been widely used to text classification and in this work we explore it’s adoption to a hierarchical classification scenario. The proposed algorithm was evaluated in eight real datasets and compared against two other recent algorithms from the literature. Preliminary results showed that the proposed approach is an alternative for hierarchical classification, having as main advantage the simplicity and low computational complexity with good accuracy.

References

  1. Alves, R. T., Delgado, M. R., and Freitas, A. A. (2008). Multi-label hierarchical classification of protein functions with artificial immune systems. In Proceedings of the 3rd Brazilian symposium on Bioinformatics: Advances in Bioinformatics and Computational Biology, BSB 7808, pages 1-12, Berlin, Heidelberg. Springer-Verlag.
  2. Barros, R. C., Cerri, R., Freitas, A. A., and de Carvalho, A. C. P. L. F. (2013). Probabilistic clustering for hierarchical multi-label classification of protein functions. In In proceeding of: Machine Learning and Knowledge Discovery in Databases (ECML 2013), At Prague, Czech Republic, volume 8189 of Lecture Notes in Computer Science.
  3. Blockeel, H., De Raedt, L., and Ramon, J. (1998). Topdown induction of clustering trees. In Proceedings of the 15th International Conference on Machine Learning, pages 55-63. Morgan Kaufmann.
  4. Cerri, R., Barros, R. C., de Carvalho, A. C. P. L. F., and Freitas, A. A. (2013). A grammatical evolution algorithm for generation of hierarchical multi-label classification rules. In IEEE Congress on Evolutionary Computation, pages 454-461. IEEE.
  5. Enembreck, F., Scalabrin, E. E., Tacla, C. A., and Í vila, B. C. (2006). Automatic identification of teams based on textual information retrieval. In CSCWD, pages 534-538. IEEE.
  6. Ferrandin, M., Nievola, J. C., Enembreck, F., Scalabrin, E. E., Kredens, K. V., , and Í vila, B. C. (2013). Hierarchical classification using fca and the cosine similarity function. In Proceedings of the 2013 Internacional Conference on Artificial Inteligence (ICAI'13), volume 1, pages 281-287.
  7. Filmore, D. (2004). It's a GPCR world. Modern Drug Discovery, 7:24-28.
  8. Guan, H., Zhou, J., and Guo, M. (2009). A class-featurecentroid classifier for text categorization. In Proceedings of the 18th international conference on World wide web, WWW 7809, pages 201-210, New York, NY, USA. ACM.
  9. Han, E.-H. and Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD 7800, pages 424-431, London, UK, UK. Springer-Verlag.
  10. Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohen, F. E., and Vriend, G. (2003). Gpcrdb information system for g protein-coupled receptors. Nucleic Acids Research, 31(1):294-297.
  11. Kiritchenko, S., Matwin, S., and Famili, A. F. (2005). Functional annotation of genes using hierarchical text categorization. In in Proc. of the BioLINK SIG: Linking Literature, Information and Knowledge for Biology (held at ISMB-05.
  12. Otero, F. E. B., Freitas, A. A., and Johnson, C. G. (2010). A hierarchical multi-label classification ant colony algorithm for protein function prediction. Memetic Computing, pages 165-181.
  13. Rocchio, J. (1971). Relevance feedback in information retrieval. In Salton, G., editor, The SMART Retrieval System - Experiments in Automatic Document Processing, pages 313-323. Prentice Hall.
  14. Roma˜o, L. M. and Nievola, J. C. (2012). Hierarchical classification of gene ontology with learning classifier systems. In Advances in Artificial Intelligence - IBERAMIA 2012, volume 7637 of Lecture Notes in Computer Science, pages 120-129. Springer Berlin Heidelberg.
  15. Salton, G. (1989). Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
  16. Silla, C. N. and Freitas, A. A. (2009). A global-model naive bayes approach to the hierarchical prediction of protein functions. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM 7809, pages 992-997, Washington, DC, USA. IEEE Computer Society.
  17. Silla, C. N. and Freitas, A. A. (2011a). Selecting different protein representations and classification algorithms in hierarchical protein function prediction. Intell. Data Anal., 15(6):979-999.
  18. Silla, C. N. and Kaestner, C. A. A. (2013). Hierarchical classification of bird species using their audio recorded songs. In SMC, pages 1895-1900. IEEE.
  19. Silla, Jr., C. N. and Freitas, A. A. (2011b). A survey of hierarchical classification across different application domains. Data Min. Knowl. Discov., 22:31-72.
  20. Tan, S. (2008). An improved centroid classifier for text categorization. Expert Syst. Appl., 35(1-2):279-285.
  21. Theeramunkong, T. and Lertnattee, V. (2001). Improving centroid-based text classification using termdistribution-based weighting system and clustering.
  22. Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10):6567- 6572.
  23. Tipton, K. F. and Boyce, S. (2000). History of the enzyme nomenclature system. Bioinformatics, 16(1):34-40.
  24. Vens, C., Struyf, J., Schietgat, L., Dz?eroski, S., and Blockeel, H. (2008). Decision trees for hierarchical multilabel classification. Mach. Learn., 73:185-214.
Download


Paper Citation


in Harvard Style

Ferrandin M., Enembreck F., Nievola J., Scalabrin E. and Ávila B. (2015). A Centroid-based Approach for Hierarchical Classification . In Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-096-3, pages 25-33. DOI: 10.5220/0005339000250033


in Bibtex Style

@conference{iceis15,
author={Mauri Ferrandin and Fabrício Enembreck and Julio César Nievola and Edson Emílio Scalabrin and Bráulio Coelho Ávila},
title={A Centroid-based Approach for Hierarchical Classification},
booktitle={Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2015},
pages={25-33},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005339000250033},
isbn={978-989-758-096-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - A Centroid-based Approach for Hierarchical Classification
SN - 978-989-758-096-3
AU - Ferrandin M.
AU - Enembreck F.
AU - Nievola J.
AU - Scalabrin E.
AU - Ávila B.
PY - 2015
SP - 25
EP - 33
DO - 10.5220/0005339000250033