WRAPPER AND FILTER METRICS FOR PSO-BASED CLASS BALANCE APPLIED TO PROTEIN SUBCELLULAR LOCALIZATION

S. Garcia López, J. A. Jaramillo-Garzón, J. C. Higuita-Vásquez, C. G. Castellanos-Domínguez

Abstract

Recent advances in proteomic research have generated an unprecedented amount of stored data. Given the size of current databases, manual annotation has become an almost intractable process, paving the way to the use of computational methods. In this context, considering that a single protein can belong to several functional classes, a multi-label classification problem is generated. The most common way to cope with these problems is by training a number of classifiers equal to the number of classes that will allow taking independent decisions on the membership of proteins. Nevertheless, this methodology leads to a high degree of imbalance between classes, magnifying the disparity already present in their size. Current balancing techniques are based on the optimization of criteria leading to a better subset that represent the data. Moreover, most of the sample selection criteria are based on the Wrapper type metrics. However, Wrapper metrics are computationally quite expensive. This work presents a comparative analysis between the Wrapper and Filter metrics as the sample selection criteria in balance techniques. In order to accomplish this task, a subsampling technique based on the Particle Swarm Optimization method to obtain the optimal balance subset is used. The results show that filter metrics notably improved the computational cost obtaining a similar performance when compared with the Wrapper type metrics.

References

  1. Al-Shahib, A., Breitling, R., and Gilbert, D. (2005). Feature selection and the class imbalance problem in predicting protein function from sequence. In Applied Bioinformatics, volume 4, page 195.
  2. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., and Eppig, J. (2000). Gene ontology: tool for the unification of biology. In Nature genetics, volume 25, page 25.
  3. Chawla, N., Hall, L. O., Bowyer, K. W., and Kegelmeyer, W. P. (2002). Smote: Synthetic minority oversampling technique. In Journal of Artificial Intelligence Research., volume 16, page 321.
  4. Chawla, N., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. In ACM SIGKDD Explorations Newsletter, volume 6.
  5. Chou, K. and Shen, H. (2010). Plant-mploc: a top-down strategy to augment the power for predicting plant protein subcellular localization. In PLoS One, volume 5.
  6. Cortes, C. and Mohri, M. (2004). Auc optimization vs error rate minimization. In In Advances in neural information processing systems 16: proceedings of the 2003 conference, volume 16, page 313.
  7. Ehrlich, J., Hansen, M., and Nelson, W. (2002). Spatiotemporal regulation of rac1 localization and lamellipodia dynamics during epithelial cell-cell adhesion. In Developmental Cell, volume 3.
  8. García, S. and Herrera, F. (2008). Evolutionary undersampling for classification with imbalanced data sets:proposals and taxonomy. In Evolutionary Computation.
  9. Glory, E. and Murphy, R. (2007). Automated subcellular location determination and high-throughput microscopy. In Developmental Cell, volume 12.
  10. He, H. and Garcia, E. (2008). Learning from imbalanced data. In IEEE Transactions on Knowledge and Data Engineering, page 1263.
  11. Jain, E., Bairoch, A., Duvaud, S., Phan, I., Redaschi, N., Suzek, B., Martin, M., McGarvey, P., and Gasteiger, E. (2009). Infrastructure for the life sciences: design and implementation of the uniprot website. In BMC bioinformatics, volume 10.
  12. Jaramillo-Garzón, J. A., Perera-Lluna, A., and CastellanosDomínguez, C. G. (2010). Predictability of protein subcellular locations by pattern recognition techniques. In Proceedings of the 32nd Annual International Conference of the IEEE EMBS 2010, pages 5512-5515.
  13. Luengo, I., Navas, E., Hernández, I., and Sánchez, J. (2005). Reconocimiento automtico de emociones utilizando parmetros prosdicos. In Procesamiento del lenguaje natural, volume 35, page 1320.
  14. Meyer, I. (2007). A practical guide to the art of rna gene prediction. brie fings in bioinformatics. In Briefings in bioinformatics, volume 8.
  15. Pengyi, Y., Liang, X., Bing, Z., Zili, Z., and Albert, Z. (2009). A particle swarm based hybrid system for imbalanced medical data sampling. In BMC Genomics, volume 10, page 396.
  16. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., and Ratsch, G. (2007). Accurate splice site prediction using support vector machines. In BMC bioinformatics, volume 8.
  17. Webb, A. (2002). Statistical pattern recognition. In John Wiley and Sons Inc.
  18. Yu, L. and Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. In The Journal of Machine Learning Research, volume 5, page 1205.
Download


Paper Citation


in Harvard Style

Garcia López S., A. Jaramillo-Garzón J., C. Higuita-Vásquez J. and G. Castellanos-Domínguez C. (2012). WRAPPER AND FILTER METRICS FOR PSO-BASED CLASS BALANCE APPLIED TO PROTEIN SUBCELLULAR LOCALIZATION . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012) ISBN 978-989-8425-90-4, pages 214-219. DOI: 10.5220/0003782702140219


in Bibtex Style

@conference{bioinformatics12,
author={S. Garcia López and J. A. Jaramillo-Garzón and J. C. Higuita-Vásquez and C. G. Castellanos-Domínguez},
title={WRAPPER AND FILTER METRICS FOR PSO-BASED CLASS BALANCE APPLIED TO PROTEIN SUBCELLULAR LOCALIZATION},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)},
year={2012},
pages={214-219},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003782702140219},
isbn={978-989-8425-90-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)
TI - WRAPPER AND FILTER METRICS FOR PSO-BASED CLASS BALANCE APPLIED TO PROTEIN SUBCELLULAR LOCALIZATION
SN - 978-989-8425-90-4
AU - Garcia López S.
AU - A. Jaramillo-Garzón J.
AU - C. Higuita-Vásquez J.
AU - G. Castellanos-Domínguez C.
PY - 2012
SP - 214
EP - 219
DO - 10.5220/0003782702140219