Highly Robust Classification: A Regularized Approach for Omics Data

Jan Kalina, Jaroslav Hlinka

2016

Abstract

Various regularized approaches to linear discriminant analysis suffer from sensitivity to the presence of outlying measurements in the data. This work has the aim to propose new versions of regularized linear discriminant analysis suitable for high-dimensional data contaminated by outliers. We use principles of robust statistics to propose classification methods suitable for data with the number of variables exceeding the number of observations. Particularly, we propose two robust regularized versions of linear discriminant analysis, which have a high breakdown point. For this purpose, we propose a regularized version of the minimum weighted covariance determinant estimator, which is one of highly robust estimators of multivariate location and scatter. It assigns implicit weights to individual observations and represents a unique attempt to combine regularization and high robustness. Algorithms for the efficient computation of the new classification methods are proposed and the performance of these methods is illustrated on real data sets.

References

  1. Cai, T. and Shen, X. (2010). High-dimensional data analysis. World Scientific, Singapore.
  2. Chen, Y., Wiesel, A., and Hero, A. O. (2011). Robust shrinkage estimation of high dimensional covariance matrices. IEEE Transactions on Signal Processing, 59:4097-4107.
  3. Christmann, A. and van Messem, A. (2008). Bouligand derivatives and robustness of support vector machines for regression. Journal of Machine Learning Research, 9:915-936.
  4. Croux, C. and Dehon, C. (2001). Robust linear discriminant analysis using S-estimators. Canadian Journal of Statistics, 29:473-493.
  5. Davies, P. (2014). Data Analysis and Approximate Models: Model Choice, Location-Scale, Analysis of Variance, Nonparametric Regression and Image Analysis. Chapman & Hall/CRC, Boca Raton.
  6. Davies, P. L. and Gather, U. (2005). Breakdown and groups. Annals of Statistics, 33:977-1035.
  7. Duffau, H. (2011). Brain mapping. From neural basis of cognition to surgical applications. Springer, Vienna.
  8. Filzmoser, P. and Todorov, V. (2011). Review of robust multivariate statistical methods in high dimension. Analytica Chinica Acta, 705:2-14.
  9. Friedman, J., Hastie, T., Simon, N., and Tibshirani, R. (2015). glmnet: Lasso and elastic-net regularized generalized linear models. http://cran.rproject.org/web/packages/glmnet/index.html.
  10. Guo, Y., Hastie, T., and Tibshirani, R. (2007). Regularized discriminant analysis and its application in microarrays. Biostatistics, 8:86-100.
  11. Han, H. and Jiang, X. (2014). Overcome support vector machine diagnosis overfitting. Cancer Informatics, 13:145-148.
  12. Hastie, T., Tibshirani, R., and Friedman, J. (2008). The elements of statistical learning. Springer, New York, 2nd edition.
  13. Hausser, J. and Strimmer, K. (2009). Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. Journal of Machine Learning Research, 10:1469-1484.
  14. Hlinka, J., Palus?, M., Vejmelka, M., Mantini, D., and Corbetta, M. (2011). Functional connectivity in restingstate fMRI: Is linear correlation sufficient? NeuroImage, 54:2218-2225.
  15. Huber, P. J. and Ronchetti, E. M. (2009). Robust statistics. Wiley, New York, 2nd edition.
  16. Hubert, M., Rousseeuw, P. J., and van Aelst, S. (2008). High-breakdown robust multivariate methods. Statistical Science, 23:92-119.
  17. Jurczyk, T. (2012). Outlier detection under multicollinearity. Journal of Statistical Computation and Simulation, 82:261-278.
  18. Ku°rková, V. and Sanguineti, M. (2005). Learning with generalization capability by kernel methods of bounded complexity. Journal of Complexity, 21:350-367.
  19. Kalina, J. (2012). Implicitly weighted methods in robust image analysis. Journal of Mathematical Imaging and Vision, 44:449-462.
  20. Kalina, J. (2014). Classification analysis methods for high-dimensional genetic data. Biocybernetics and Biomedical Engineering, 34:10-18.
  21. Kalina, J. and Schlenker, A. (2015). A robust and regularized supervised variable selection. BioMed Research International, 2015(320385).
  22. Kindermans, P.-J., Schreuder, M., Schrauwen, B., Mller, K.-R., and Tangermann, M. (2014). True zero-training brain-computer interfacing-An online study. PLoS One, 9(102504).
  23. Krzanowski, W. J., Jonathan, P., McCarthy, W. V., and Thomas, M. R. (1995). Discriminant analysis with singular covariance matrices: Methods and applications to spectroscopic data. Applications of Statistics, 44:101-115.
  24. Lopuhaä, H. P. and Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Annals of Statistics, 19:229-248.
  25. Mallick, B. K., Gold, D., and Baladandayuthapani, V. (2009). Bayesian analysis of gene expression data. Wiley, New York.
  26. Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information: Criteria of maxdependency, max-relevance, and min-redundancy. IEEE Transactions of Pattern Analysis and Machine Intelligence, 27:1226-1238.
  27. Pourahmadi, M. (2013). High-dimensional covariance estimation. Wiley, New York.
  28. Roelant, E., van Aelst, S., and Willems, G. (2009). The minimum weighted covariance determinant estimator. Metrika, 70:177-204.
  29. Rousseeuw, P. J. and van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212-223.
  30. Sreekumar, A. et al. (2009). Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature, 457:910-914.
  31. Tibshirani, R. and Narasimhan, B. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statistical Science, 18:104-117.
  32. Todorov, V. and Filzmoser, P. (2009). An object-oriented framework for robust multivariate analysis. Journal of Statistical Software, 32(3):1-47.
  33. Tyler, D. E. (1987). A distribution-free M-estimator of multivariate scatter. Annals of Statistics, 15:234-251.
  34. Tyler, D. E. (2014). Breakdown properties of the M-estimators of multivariate scatter. http://arxiv.org/pdf/1406.4904v1.pdf.
  35. Ví s?ek, J. Í. (2006). The least trimmed squares. Part III: Asymptotic normality. Kybernetika, 42:203-224.
  36. Wager, T. D., Keller, M. C., Lacey, S. C., and Jonides, J. (2005). Increased sensitivity in neuroimaging analyses using robust regression. NeuroImage, 26:99-113.
  37. Xanthopoulos, P., Pardalos, P. M., and Trafalis, T. B. (2013). Robust data mining. Springer, New York.
Download


Paper Citation


in Harvard Style

Kalina J. and Hlinka J. (2016). Highly Robust Classification: A Regularized Approach for Omics Data . In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016) ISBN 978-989-758-170-0, pages 17-26. DOI: 10.5220/0005623500170026


in Bibtex Style

@conference{bioinformatics16,
author={Jan Kalina and Jaroslav Hlinka},
title={Highly Robust Classification: A Regularized Approach for Omics Data},
booktitle={Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)},
year={2016},
pages={17-26},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005623500170026},
isbn={978-989-758-170-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)
TI - Highly Robust Classification: A Regularized Approach for Omics Data
SN - 978-989-758-170-0
AU - Kalina J.
AU - Hlinka J.
PY - 2016
SP - 17
EP - 26
DO - 10.5220/0005623500170026