A COMPARISON OF MULTIVARIATE MUTUAL INFORMATION ESTIMATORS FOR FEATURE SELECTION

Gauthier Doquire, Michel Verleysen

Abstract

Mutual Information estimation is an important task for many data mining and machine learning applications. In particular, many feature selection algorithms make use of the mutual information criterion and could thus benefit greatly from a reliable way to estimate this criterion. More precisely, the multivariate mutual information (computed between multivariate random variables) can naturally be combined with very popular search procedure such as the greedy forward to build a subset of the most relevant features. Estimating the mutual information (especially through density functions estimations) between high-dimensional variables is however a hard task in practice, due to the limited number of available data points for real-world problems. This paper compares different popular mutual information estimators and shows how a nearest neighbors-based estimator largely outperforms its competitors when used with high-dimensional data.

References

  1. Bellman, R. E. (1961). Adaptive control processes - A guided tour. Princeton University Press, Princeton, New Jersey, U.S.A.
  2. Bowman, A. W. (1984). An alternative method of crossvalidation for the smoothing of density estimates. Biometrika, 71(2):353-360.
  3. Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley-Interscience.
  4. Darbellay, G. and Vajda, I. (1999). Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory, 45(4):1315-1321.
  5. Daub, C., Steuer, R., Selbig, J., and Kloska, S. (2004). Estimating mutual information using b-spline functions - an improved similarity measure for analysing gene expression data. BMC Bioinformatics, 5(1):118.
  6. Francois, D., Rossi, F., Wertz, V., and Verleysen, M. (2007). Resampling methods for parameter-free and robust feature selection with mutual information. Neurocomputing, 70(7-9, Sp. Iss. SI):1276-1288.
  7. Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1-67.
  8. Gomez-Verdejo, V., Verleysen, M., and Fleury, J. (2009). Information-theoretic feature selection for functional data classification. Neurocomputing, 72:3580-3589.
  9. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157-1182.
  10. Hall, P., Sheater, S. J., Jones, M. C., and Marron, J. S. (1991). On optimal data-based bandwidth selection in kernel density estimation. Biometrika, 78(2):263- 269.
  11. Kraskov, A., Stögbauer, H., and Grassberger, P. (2004). Estimating mutual information. Phys. Rev. E, 69(6):066138.
  12. Li, S., Mnatsakanov, R. M., and Andrew, M. E. (2011). knearest neighbor based consistent entropy estimation for hyperspherical distributions. Entropy, 13(3):650- 667.
  13. Parzen, E. (1962). On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065-1076.
  14. Rossi, F., Delannay, N., Conan-Guez, B., and Verleysen, M. (2005). Representation of functional data in neural networks. Neurocomputing, 64:183 - 210.
  15. Rossi, F., Lendasse, A., Franc¸ois, D., Wertz, V., and Verleysen, M. (2007). Mutual information for the selection of relevant variables in spectrometric nonlinear modelling. CoRR.
  16. Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics, 9(2):65-78.
  17. Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66(3):605-610.
  18. Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27:379- 423,623-656.
  19. Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman and Hall, London.
  20. Sturges, H. A. (1926). The choice of a class interval. Journal of the American Statistical Association, 21(153):65-66.
  21. Turlach, B. A. (1993). Bandwidth selection in kernel density estimation: A review. In CORE and Institut de Statistique, pages 23-493.
  22. Walters-Williams, J. and Li, Y. (2009). Estimation of mutual information: A survey. In RSKT 7809, pages 389- 396, Berlin, Heidelberg. Springer-Verlag.
  23. Wang, Q., Kulkarni, S. R., and Verdu, S. (2009). Divergence Estimation for Multidimensional Densities Via k-Nearest Neighbor Distances. Information Theory, IEEE Transactions on, 55(5):2392-2405.
Download


Paper Citation


in Harvard Style

Doquire G. and Verleysen M. (2012). A COMPARISON OF MULTIVARIATE MUTUAL INFORMATION ESTIMATORS FOR FEATURE SELECTION . In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-8425-98-0, pages 176-185. DOI: 10.5220/0003726101760185


in Bibtex Style

@conference{icpram12,
author={Gauthier Doquire and Michel Verleysen},
title={A COMPARISON OF MULTIVARIATE MUTUAL INFORMATION ESTIMATORS FOR FEATURE SELECTION},
booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2012},
pages={176-185},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003726101760185},
isbn={978-989-8425-98-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - A COMPARISON OF MULTIVARIATE MUTUAL INFORMATION ESTIMATORS FOR FEATURE SELECTION
SN - 978-989-8425-98-0
AU - Doquire G.
AU - Verleysen M.
PY - 2012
SP - 176
EP - 185
DO - 10.5220/0003726101760185