Optimized Linear Imputation

Yehezkel S. Resheff, Daphna Weinshal

2017

Abstract

Often in real-world datasets, especially in high dimensional data, some feature values are missing. Since most data analysis and statistical methods do not handle gracefully missing values, the first step in the analysis requires the imputation of missing values. Indeed, there has been a long standing interest in methods for the imputation of missing values as a pre-processing step. One recent and effective approach, the IRMI stepwise regression imputation method, uses a linear regression model for each real-valued feature on the basis of all other features in the dataset. However, the proposed iterative formulation lacks convergence guarantee. Here we propose a closely related method, stated as a single optimization problem and a block coordinate-descent solution which is guaranteed to converge to a local minimum. Experiments show results on both synthetic and benchmark datasets, which are comparable to the results of the IRMI method whenever it converges. However, while in the set of experiments described here IRMI often diverges, the performance of our methods is shown to be markedly superior in comparison with other methods.

References

  1. Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of statistical software, 45(3).
  2. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. (2009). Modeling Wine Preferences by Data Mining from Physicochemical Properties. Decision Support Systems, 47(4):547-553.
  3. Donders, A. R. T., van der Heijden, G. J., Stijnen, T., and Moons, K. G. (2006). Review: a gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10):1087-1091.
  4. Duan, Y., Yisheng, L., Kang, W., and Zhao, Y. (2014). A deep learning based approach for traffic data imputation. In Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on, pages 912-917. IEEE.
  5. Engels, J. M. and Diehr, P. (2003). Imputation of missing longitudinal data: a comparison of methods. Journal of clinical epidemiology, 56(10):968-976.
  6. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179- 188.
  7. García-Laencina, P. J., Sancho-G ómez, J.-L., and FigueirasVidal, A. R. (2010). Pattern classification with missing data: a review. Neural Computing and Applications, 19(2):263-282.
  8. Harrison, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81-102.
  9. Heitjan, D. F. and Basu, S. (1996). Distinguishing missing at random and missing completely at random. The American Statistician, 50(3):207-213.
  10. Hope, T. and Shahaf, D. (2016). Ballpark learning: Estimating labels from rough group comparisons. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 299-314.
  11. Horton, N. J. and Kleinman, K. P. (2007). Much ado about nothing. The American Statistician, 61(1).
  12. Horton, P. and Nakai, K. (1996). A probabilistic classification system for predicting the cellular localization sites of proteins. In Ismb, volume 4, pages 109-115.
  13. Jacobusse, G. (2005). Winmice users manual. TNO Quality of Life, Leiden. URL http://www. multiple-imputation. com.
  14. Lichman, M. (2013). UCI machine learning repository.
  15. Little, R. J. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404):1198- 1202.
  16. Little, R. J. and Rubin, D. B. (2014). Statistical analysis with missing data. John Wiley & Sons.
  17. Pigott, T. D. (2001). A review of methods for missing data. Educational research and evaluation, 7(4):353-383.
  18. Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey methodology, 27(1):85-96.
  19. Resheff, Y. S., Rotics, S., Harel, R., Spiegel, O., and Nathan, R. (2014). Accelerater: a web application for supervised learning of behavioral modes from acceleration measurements. Movement ecology, 2(1):25.
  20. Resheff, Y. S., Rotics, S., Nathan, R., and Weinshall, D. (2015). Matrix factorization approach to behavioral mode analysis from acceleration data. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on, pages 1-6. IEEE.
  21. Resheff, Y. S., Rotics, S., Nathan, R., and Weinshall, D. (2016). Topic modeling of behavioral modes using sensor data. International Journal of Data Science and Analytics, 1(1):51-60.
  22. Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434):473-489.
  23. Schmitt, P., Mandel, J., and Guedj, M. (2015). A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics, 2015.
  24. Templ, M., Kowarik, A., and Filzmoser, P. (2011). Iterative stepwise regression imputation using standard and robust methods. Computational Statistics & Data Analysis, 55(10):2793-2806.
  25. Tüfekci, P. (2014). Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems, 60:126-140.
  26. Van Buuren, S. and Oudshoorn, K. (1999). Flexible multivariate imputation by mice. Leiden, The Netherlands: TNO Prevention Center.
  27. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11:3371-3408.
  28. Wagner, A. and Zuk, O. (2015). Low-rank matrix recovery from row-and-column affine measurements. arXiv preprint arXiv:1505.06292.
Download


Paper Citation


in Harvard Style

S. Resheff Y. and Weinshal D. (2017). Optimized Linear Imputation . In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-222-6, pages 17-25. DOI: 10.5220/0006092900170025


in Bibtex Style

@conference{icpram17,
author={Yehezkel S. Resheff and Daphna Weinshal},
title={Optimized Linear Imputation},
booktitle={Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2017},
pages={17-25},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006092900170025},
isbn={978-989-758-222-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Optimized Linear Imputation
SN - 978-989-758-222-6
AU - S. Resheff Y.
AU - Weinshal D.
PY - 2017
SP - 17
EP - 25
DO - 10.5220/0006092900170025