A Nonlinear Mixture Model based Unsupervised Variable Selection in Genomics and Proteomics

Ivica Kopriva

doi:10.5220/0005161700850092

A Nonlinear Mixture Model based Unsupervised Variable Selection in Genomics and Proteomics

Ivica Kopriva

2015

Abstract

Typical scenarios occurring in genomics and proteomics involve small number of samples and large number of variables. Thus, variable selection is necessary for creating disease prediction models robust to overfitting. We propose an unsupervised variable selection method based on sparseness constrained decomposition of a sample. Decomposition is based on nonlinear mixture model comprised of test sample and a reference sample representing negative (healthy) class. Geometry of the model enables automatic selection of component comprised of disease related variables. Proposed unsupervised variable selection method is compared with 3 supervised and 1 unsupervised variable selection methods on two-class problems using 3 genomic and 2 proteomic data sets. Obtained results suggest that proposed method could perform better than supervised methods on unseen data of the same cancer type.

References

Aliferis, C. F., et al. (2010a). Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification - Part I: Algorithms and Empirical Evaluation. J. Mach. Learn. Res., 11, 171- 234.
Aliferis, C. F., et al. (2010b). Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification - Part II: Analysis and Extensions. J. Mach. Learn. Res., 11, 235-284.
Alon, U., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96, 6745-6750.
Alter, O., Brown, P. O., and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA, 97, 10101-10106.
Aronszajn, N. (1950). The theory of reproducing kernels. Trans. of the Amer. Math. Soc., 68, 337-404.
Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. on Imag. Sci., 2, 183-202.
Ben-Dor, A., Shamir, R., and Yakhini, Z. (1999). Clustering gene expression patterns. J. Comp. Biol., 6, 281-297.
Brown, G. (2009). A New Perspective for Information Theoretic Feature Selection. J. Mach. Learn. Res., 5, 49-56.
Brunet, J. P., et al. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl. Acad. Sci. USA, 101, 4164-4169.
Chang, C. C., and Lin, C. J. (2003). LIBSVM: a library for support vector machines.
Cichocki, A., et al. (2010). Nonnegative Matrix and Tensor Factorizations. John Wiley, Chichester.
Decramer, S., et al. (2008). Urine in clinical proteomics. Mol Cell Proteomics, 7, 1850-1862.
Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. J. of the Amer. Stat. Assoc., 97, 77-87.
Gao, Y., and Church, G. (2005). Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics, 21, 3970-3975.
Gillis, N., and Vavanis, S. A. (2012). Fast and Robust Recursive Algorithms for Separable Nonnegative Matrix Factorization, arXiv , v2.
Girolami, M., and Breitling, R. (2004). Biologically valid linear factor models of gene expression. Bioinformatics, 20, 3021-3033.
Gribonval, R., and Zibulevsky, M. (2010). Sparse component analysis. In Jutten, C., and Comon, P. (eds.), Handbook of Blind Source Separation, Elsevier, pp. 367-420.
Guyon, I., et al. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389-422.
Guyon, I., Elisseeff, A. (2002). An introduction to variable and feature selection. J. of Machine Learning Res., 3, 1157-1182.
Harmeling, S., Ziehe, A., and Kawanabe, M. (2003). Kernel-Based Nonlinear Blind Source Separation, Neural Comput., 15, 1089-1124.
Hyvärinen A., Karhunen J., and Oja E. (2001). Independent Component Analysis. John Wiley & Sons, New York.
Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints, J. Mach. Learn. Res., 5, 1457-1469.
Jutten, C., Babaie-Zadeh, M., and Karhunen, J. (2010). Nonlinear mixtures. In Jutten, C., and Comon, P. (eds.), Handbook of Blind Source Separation, Elsevier, pp. 549-592.
Kim, H., and Park, H. (2007). Sparse non-negative matrix factorizations via alternating non-negativity constrained least squares for microarray data analysis. Bioinformatics, 23, 1495-1502.
Kohavi, R., and John, G. (1997). Wrappers for feature selection. Artificial Intel., 97, 273-324.
Kopriva, I., and Filipovic, M. (2011). A mixture model with a reference-based automatic selection of components for disease classification from protein and/or gene expression levels. BMC Bioinformatics, 12, 496.
Kruskal, W., and Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. J. of the Am. Stat. Assoc., 47: 583-621.
Lazar, C., et al. (2012). A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE Tr. Comp. Biol. and Bioinf., 9, 1106- 1119.
Lazzeroni, L., and Owen, A. (2002). Plaid models for gene expression data. Statistica Sinica, 12, 61-86.
Lee, S.I., and Batzoglou, S. (2003). Application of independent component analysis to microarrays. Genome Biol., 4, R76.
Martinez, D., and Bray, A. (2003) Nonlinear Blind Source Separation Using Kernels. IEEE Tr. on Neural Networks, 14, 228-235.
Mischak, H., et al. (2009). Capillary electrophoresis-mass spectrometry as powerful tool in biomarker discovery and clinical diagnosis: an update of recent developments. Mass Spectrom. Rev., 28, 703-724.
Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information: criteria for maxdependency, max-relevance and min-redundancy. IEEE Tr. Pat. Anal. Mach. Intel., 27, 1226-1238.
Petricoin, E.F., et al. (2002a) .Use of proteomic patterns in serum to identify ovarian cancer. The Lancet, 359, 572-577.
Petricoin, E.F., et al. (2002b). Serum proteomic patterns for detection of prostate cancer. J. Natl. Canc. Institute, 94, 1576-1578.
Reju, V. G., Koh, S. N., Soon, I. Y. (2009). An algorithm for mixing matrix estimation in instantaneous blind source separation. Sig. Proc., 89, 1762-1773.
Schachtner, R., et al. (2008). Knowledge-based gene expression classification via matrix factorization. Bioinformatics, 24, 1688-1697.
Schölkopf, B., and Smola, A. (2002). Learning with kernels, The MIT Press, Cambridge, MA.
Shipp, M. A., et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Med., 8, 68-74.
Singh, D., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203- 209.
Sprites, P., Glymour, C., and Scheines, R. (2000). Causation, prediction, and search. The MIT Press, 2nd edition.
Stadtlthanner, K., et al. (2008). Hybridizing Sparse Component Analysis with Genetic Algorithms for Microarray Analysis. Neurocomputing, 71, 2356- 2376.
Statnikov, A., et al. (2005a). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21 631-643.
Statnikov, A., et al. (2005b). GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int. J. Med. Informatics, 74, 491-503.
Vapnik, V. (1998). Statistical learning theory. WileyInterscience, New York.
Yuh, C. H., Bolouri, H., and Davidson, E. H (1998). Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science, 279, 1896-1902.

Download

Paper Citation

in Harvard Style

Kopriva I. (2015). A Nonlinear Mixture Model based Unsupervised Variable Selection in Genomics and Proteomics . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015) ISBN 978-989-758-070-3, pages 85-92. DOI: 10.5220/0005161700850092

in Bibtex Style

@conference{bioinformatics15,
author={Ivica Kopriva},
title={A Nonlinear Mixture Model based Unsupervised Variable Selection in Genomics and Proteomics},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)},
year={2015},
pages={85-92},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005161700850092},
isbn={978-989-758-070-3},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)
TI - A Nonlinear Mixture Model based Unsupervised Variable Selection in Genomics and Proteomics
SN - 978-989-758-070-3
AU - Kopriva I.
PY - 2015
SP - 85
EP - 92
DO - 10.5220/0005161700850092