A Methodology for Optimizing the Cost Matrix in Cost Sensitive Learning Models applied to Prediction of Molecular Functions in Embryophyta Plants

S. García-López, J. A. Jaramillo-Garzón, L. Duque-Muñoz, C. G. Castellanos-Domínguez

2013

Abstract

Due to the large amount of data generated by genomics and proteomics research, the use of computational methods has been a great support tool for this purpose. However, tools based on machine learning, face several problems associated to the nature of the data, one of them is the class-imabalance problem. Several balancing techniques exist to obtain an improvement in prediction performance, such as boosting and resampling, but they have multiple weaknesses in difficult data spaces. On the other hand, cost sensitive learning is an alternative solution, yet, the obtention of appropriate cost matrix to induce a good prediction model is complex, and still remains an open problem. In this paper, a methodology to obtain an optimal cost matrix to train models based on cost sensitive learning is proposed. The results show that cost sensitive learning with a proper cost can be very competitive, and even outperform many class-balance strategies in the state of the art. Tests were applied to prediction of molecular functions in Embryophyta plants.

References

  1. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., et al. (2000). Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25.
  2. Basu, M. (2006). Data complexity in pattern recognition. Springer-Verlag New York Inc.
  3. Batuwita, R. and Palade, V. (2009). A new performance measure for class imbalance learning. application to bioinformatics problems. In Machine Learning and Applications, 2009. ICMLA'09. International Conference on, pages 545-550. IEEE.
  4. Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, W. (2002). Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research., 16:321- 357.
  5. Ding, Z. (2011). Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics. PhD thesis, GEORGIA STATE UNIVERSITY.
  6. Domingos, P. (1999). Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 155- 164. ACM.
  7. Elkan, C. (2001). The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence, volume 17, pages 973-978. LAWRENCE ERLBAUM ASSOCIATES LTD.
  8. Frishman, D., Argos, P., et al. (1997). Seventy-five percent accuracy in protein secondary structure prediction.
  9. Proteins-Structure Function and Genetics, 27(3):329- 335.
  10. García-Ló pez, S., Jaramillo-Garzón, J. A., HiguitaVásquez, J., and Castellanos-Domínguez., C. (2012). Wrapper and filter metrics for pso-based class balance applied to protein subcellular localization. In BIOSTEC-BIOINFORMATICS 2012.
  11. Grzymala-Busse, J., Stefanowski, J., and Wilk, S. (2005). A comparison of two approaches to data mining from imbalanced data. Journal of Intelligent Manufacturing, 16(6):565-573.
  12. He, H. and Garcia, E. (2009). Learning from imbalanced data. Knowledge and Data Engineering, IEEE Transactions on, 21(9):1263-1284.
  13. Jain, E., Bairoch, A., Duvaud, S., Phan, I., Redaschi, N., Suzek, B., Martin, M., McGarvey, P., and Gasteiger, E. (2009). Infrastructure for the life sciences: design and implementation of the UniProt website. BMC bioinformatics, 10(1):136.
  14. Larran˜ aga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J., Arman˜ anzas, R., Santafé, G., Pérez, A., et al. (2006). Machine learning in bioinformatics. Briefings in bioinformatics, 7(1):86-112.
  15. Liu, X. and Zhou, Z. (2006). The influence of class imbalance on cost-sensitive learning: an empirical study. In Data Mining, 2006. ICDM'06. Sixth International Conference on, pages 970-974. IEEE.
  16. Liu, X. and Zhou, Z. (2012). Towards cost-sensitive learning for real-world applications. New Frontiers in Applied Data Mining, pages 494-505.
  17. Polikar, R. (2006). Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE, 6(3):21-45.
  18. Schapire, R. (1999). A brief introduction to boosting. In International Joint Conference on Artificial Intelligence, volume 16, pages 1401-1406. LAWRENCE ERLBAUM ASSOCIATES LTD.
  19. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., and Rätsch, G. (2007). Accurate splice site prediction using support vector machines. BMC bioinformatics, 8(Suppl 10):S7.
  20. Su, C. and Hsiao, Y. (2007). An evaluation of the robustness of mts for imbalanced data. Knowledge and Data Engineering, IEEE Transactions on, 19(10):1321-1332.
  21. Valian, E., Mohanna, E., and Tavakoli, S. (2011). Improved cuckoo search algorithm for global optimization. Int. J. Communications and Information Technology, 1(1):31-44.
  22. Yang, P., Xu, L., Zhou, B., Zhang, Z., and Zomaya, A. (2009). A particle swarm based hybrid system for imbalanced medical data sampling. BMC genomics, 10(Suppl 3):S34.
  23. Yu, L. and Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research, 5:1205-1224.
Download


Paper Citation


in Harvard Style

García-López S., A. Jaramillo-Garzón J., Duque-Muñoz L. and G. Castellanos-Domínguez C. (2013). A Methodology for Optimizing the Cost Matrix in Cost Sensitive Learning Models applied to Prediction of Molecular Functions in Embryophyta Plants . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013) ISBN 978-989-8565-35-8, pages 71-80. DOI: 10.5220/0004250900710080


in Bibtex Style

@conference{bioinformatics13,
author={S. García-López and J. A. Jaramillo-Garzón and L. Duque-Muñoz and C. G. Castellanos-Domínguez},
title={A Methodology for Optimizing the Cost Matrix in Cost Sensitive Learning Models applied to Prediction of Molecular Functions in Embryophyta Plants},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)},
year={2013},
pages={71-80},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004250900710080},
isbn={978-989-8565-35-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)
TI - A Methodology for Optimizing the Cost Matrix in Cost Sensitive Learning Models applied to Prediction of Molecular Functions in Embryophyta Plants
SN - 978-989-8565-35-8
AU - García-López S.
AU - A. Jaramillo-Garzón J.
AU - Duque-Muñoz L.
AU - G. Castellanos-Domínguez C.
PY - 2013
SP - 71
EP - 80
DO - 10.5220/0004250900710080