Towards an Approach to Select Features from Low Quality Datasets

José Manuel Cadenas, María del Carmen Garrido, Raquel Martínez

2012

Abstract

Feature selection is an active research in machine learning. The main idea of feature selection is to choose a subset of available features, by eliminating features with little or no predictive information, and features strongly correlated. There are many approaches for feature selection, but most of them can only work with crisp data. Until our knowledge there are not many approaches which can directly work with both crisp and low quality (imprecise and uncertain) data. That is why, we propose a new method of feature selection which can handle both crisp and low quality data. The proposed approach integrates filter and wrapper methods into a sequential search procedure with improved classification accuracy of the features selected. This approach consists of steps following: (1) Scaling and discretization process of the feature set; and feature pre-selection using the discretization process (filter); (2) Ranking process of the feature pre-selection using a Fuzzy Random Forest ensemble; (3) Wrapper feature selection using a Fuzzy Decision Tree technique based on cross-validation. The efficiency and effectiveness of the approach is proved through several experiments with low quality datasets. Approach shows an excellent performance, not only classification accuracy, but also with respect to the number of features selected.

References

  1. Asuncion, A. and Newman, D. J. (2007). UCI Machine Learning Repository, http://www.ics.uci.edu/ mlearn/MLRepository.html. Irvine, CA: University of California, School of Information and Computer Science.
  2. Battiti, R. (1994). Using mutal information for selection features in supervised neural net learning. IEEE Transactions on Neural Networks, 5:537-550.
  3. Bonissone, P. P., Cadenas, J. M., Garrido, M. C., and Díaz-Valladares, R. A. (2010). A fuzzy random forest. International Journal of Approximate Reasoning, 51(7):729-747.
  4. Cadenas, J. M., Garrido, M. C., Martínez, R., and Bonissone, P. P. (2012a). Extending information processing in a fuzzy random forest ensemble. Soft Computing, 16(5):845-861.
  5. Cadenas, J. M., Garrido, M. C., Martínez, R., and Bonissone, P. P. (2012b). Ofp class: a hybrid method to generate optimized fuzzy partitions for classification. Soft Computing, 16(4):667-682.
  6. Casillas, J., Cordón, O., del Jesús, M. J., and Herrera, F. (2001). Genetic feature selection in a fuzzy rule-based classification system learning process for high-dimensional problems. Information Sciences, 139:135-157.
  7. Diaz-Uriarte, R. and de Andrés, S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(3).
  8. Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern Classification. Wiley-Interscience Publication.
  9. Ferreira, A. J. and Figueiredo, M. A. T. (2012). An unsupervised approach to feature discretization and selection. Pattern Recognition (doi:10.1016/j.patcog.2011.12.008).
  10. Garrido, M. C., Cadenas, J. M., and Bonissone, P. P. (2010). A classification and regression technique to handle heterogeneous and imperfect information. Soft Computing, 14:1165-1185.
  11. Guyon, I., Weston, J., Barnhill, S., and Bapnik, V. (2002). Gene selection for cancer classification using support vector machine. Machine Learning, 46:389-422.
  12. He, Q., Xie, Z., Hu, Q., and Wu, C. (2011). Neighborhood based sample and feature selection for svm classification learning. Neurocomputing, 74:1585-1594.
  13. Jain, A. K., Duin, R. P. W., and Mao, J. (2000). Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):4-37.
  14. Jensen, R. and Shen, Q. (2007). Fuzzy-rough sets assisted attribute selection. IEEE Transactions on Fuzzy Systems, 15(1):73-89.
  15. Kabir, M. M., Shahjahan, M., and Murase, K. (2012). A new hybrid ant colony optimization algorithm for feature selection. Expert System with Applications, 39:3747-3763.
  16. Kira, K. and Rendell, L. (1992). A practical approach to feature selection. In Proceedings of the Ninth International Conference on Machine Learning, pages 249-256.
  17. Luukka, P. (2011). Feature selection using fuzzy entropy measures with similarity classifier. Expert Systems with Applications, 38:4600-4607.
  18. Mladenic, D. (2006). Feature selection for dimensionality reduction. subspace, latent structure and feature selection, statistical and optimization. SLSFS 2005, Lecture Notes in Computer Science, 3940:84-102.
  19. Pedrycz, W. and Vukovich, G. (2002). Feature analysis through information granulation and fuzzy sets. Pattern Recognition, 35:825-834.
  20. Saeys, Y., Rouze, P., and de Peer, Y. V. (2007). In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists. Bioinformatics, 23(4):414-420.
  21. Sánchez, L., Suarez, M. R., and Couso, I. (2005). A fuzzy definition of mutual information with application to the desing of genetic fuzzy classifiers. In Proceedings of the International Conference on Machine Intelligence, pages 602-609.
  22. Sánchez, L., Suárez, M. R., Villar, J. R., and Couso, I. (2008). Mutual information-based feature selection and partition design in fuzzy rule-based classifiers from vague data. International Journal of Approximate Reasoning, 49:607-622.
  23. Suárez, M. R., Villar, J. R., and Grande, J. (2010). A feature selection method using a fuzzy mutual information measure. International Journal of Reasoning-based Intelligent Systems, 2:133-141.
  24. Vieira, S. M., Sousa, J. M. C., and Kaymak, U. (2012). Fuzzy criteria for feature selection. Fuzzy set and System, 189:1-18.
  25. Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria decision making. IEEE transactions on Systems, Man and Cybernetics, 18:183-190.
  26. Yan-Qing, Y., Ju-Sheng, M., and Zhou-Jun, L. (2011). Attribute reduction based on generalized fuzzy evidence theory in fuzzy decision systems. Fuzzy Sets and Systems, 170:64-75.
Download


Paper Citation


in Harvard Style

Manuel Cadenas J., del Carmen Garrido M. and Martínez R. (2012). Towards an Approach to Select Features from Low Quality Datasets . In Proceedings of the 4th International Joint Conference on Computational Intelligence - Volume 1: FCTA, (IJCCI 2012) ISBN 978-989-8565-33-4, pages 357-366. DOI: 10.5220/0004153503570366


in Bibtex Style

@conference{fcta12,
author={José Manuel Cadenas and María del Carmen Garrido and Raquel Martínez},
title={Towards an Approach to Select Features from Low Quality Datasets},
booktitle={Proceedings of the 4th International Joint Conference on Computational Intelligence - Volume 1: FCTA, (IJCCI 2012)},
year={2012},
pages={357-366},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004153503570366},
isbn={978-989-8565-33-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 4th International Joint Conference on Computational Intelligence - Volume 1: FCTA, (IJCCI 2012)
TI - Towards an Approach to Select Features from Low Quality Datasets
SN - 978-989-8565-33-4
AU - Manuel Cadenas J.
AU - del Carmen Garrido M.
AU - Martínez R.
PY - 2012
SP - 357
EP - 366
DO - 10.5220/0004153503570366