# Sparse-Reduced Computation - Enabling Mining of Massively-large Data Sets

### Philipp Baumann, Dorit S. Hochbaum, Quico Spaen

#### Abstract

Machine learning techniques that rely on pairwise similarities have proven to be leading algorithms for classification. Despite their good and robust performance, similarity-based techniques are rarely chosen for largescale data mining because the time required to compute all pairwise similarities grows quadratically with the size of the data set. To address this issue of scalability, we introduced a method called sparse computation, which efficiently generates a sparse similarity matrix that contains only significant similarities. Sparse computation achieves significant reductions in running time with minimal and often no loss in accuracy. However, for massively-large data sets even such a sparse similarity matrix may lead to considerable running times. In this paper, we propose an extension of sparse computation called sparse-reduced computation that not only avoids computing very low similarities but also avoids computing similarities between highly-similar or identical objects by compressing them to a single object. Our computational results show that sparse-reduced computation allows highly-accurate classification of data sets with millions of objects in seconds.

#### References

- Andrews, N. O. and Fox, E. A. (2007). Clustering for data reduction: a divide and conquer approach.
- Arora, S., Hazan, E., and Kale, S. (2006). A fast random sampling algorithm for sparsifying matrices. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 272- 279. Springer Berlin.
- Asuncion, A. and Newman, D.J. (2007). UCI Machine Learning Repository.
- Baumann, P., Hochbaum, D.S., and Yang, Y.T. (2015). A comparative study of leading machine learning techniques and two new algorithms. submitted 2015.
- Breiman, L. (1996). Bias, variance, and arcing classifiers. Technical report, Statistics Department, University of California, Berkeley.
- Caruana, R. and Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 161-168, Pittsburgh, PA.
- Chandran, B. G. and Hochbaum, D. S. (2009). A computational study of the pseudoflow and push-relabel algorithms for the maximum flow problem. Operations Research, 57(2):358-376.
- Chang, F., Guo, C.-Y., Lin, X.-R., and Lu, C.-J. (2010). Tree decomposition for large-scale SVM problems. Journal of Machine Learning Research, 11:2935-2972.
- Collobert, R., Bengio, S., and Bengio, Y. (2002). A parallel mixture of svms for very large scale problems. Neural computation, 14(5):1105-1114.
- Dong, J.-X., Krzyz?ak, A., and Suen, C. Y. (2005). Fast SVM training algorithm with decomposition on very large data sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(4):603-618.
- Drineas, P., Kannan, R., and Mahoney, M.W. (2006). Fast monte carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM J. Computing, 36:158-183.
- Fix, E. and Hodges, J.L., Jr. (1951). Discriminatory analysis, nonparametric discrimination, consistency properties. Randolph Field, Texas, Project 21-49-004, Report No. 4.
- Graf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I., and Vapnik, V. (2004). Parallel support vector machines: The cascade svm. In Advances in neural information processing systems, pages 521-528.
- Hochbaum, D. and Baumann, P. (2014). Sparse computation for large-scale data mining. In Lin, J., Pei, J., Hu, X., Chang, W., Nambiar, R., Aggarwal, C., Cercone, N., Honavar, V., Huan, J., Mobasher, B., and Pyne, S., editors, Proceedings of the 2014 IEEE International Conference on Big Data, pages 354-363, Washington DC.
- Hochbaum, D.S. (2008). The pseudoflow algorithm: a new algorithm for the maximum-flow problem. Operations Research, 56:992-1009.
- Hochbaum, D.S. (2010). Polynomial time algorithms for ratio regions and a variant of normalized cut. IEEE Trans. Pattern Analysis and Machine Intelligence, 32:889-898.
- Hochbaum, D.S., Lu, C., and Bertelli, E. (2013). Evaluating performance of image segmentation criteria and techniques. EURO Journal on Computational Optimization, 1:155-180.
- Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., and Sundararajan, S. (2008). A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408-415.
- Jhurani, C. (2013). Subspace-preserving sparsification of matrices with minimal perturbation to the near nullspace. Part I: basics. arXiv:1304.7049 [math.NA].
- Kawaji, H., Takenaka, Y., and Matsuda, H. (2004). Graphbased clustering for finding distant relationships in a large set of protein sequences. Bioinformatics, 20(2):243-252.
- Provost, F. and Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data mining and knowledge discovery, 3(2):131-169.
- Rida, A., Labbi, A., and Pellegrini, C. (1999). Local experts combination through density decomposition. In Proceedings of International Workshop on AI and Statistics.
- Segata, N. and Blanzieri, E. (2010). Fast and scalable local kernel machines. The Journal of Machine Learning Research, 11:1883-1926.
- Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2011). Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3-30.
- Spielman, D.A. and Teng, S.-H. (2011). Spectral sparsification of graphs. SIAM J. Computing, 40:981-1025.
- Tsang, I. W., Kwok, J. T., and Cheung, P.-M. (2005). Core vector machines: Fast svm training on very large data sets. In Journal of Machine Learning Research, pages 363-392.
- Witten, I.H.. and Frank, E. (2005). Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2nd ed. edition.
- Yu, H., Yang, J., and Han, J. (2003). Classifying large data sets using svms with hierarchical clusters. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306-315.

#### Paper Citation

#### in Harvard Style

Baumann P., Hochbaum D. and Spaen Q. (2016). **Sparse-Reduced Computation - Enabling Mining of Massively-large Data Sets** . In *Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,* ISBN 978-989-758-173-1, pages 224-231. DOI: 10.5220/0005690402240231

#### in Bibtex Style

@conference{icpram16,

author={Philipp Baumann and Dorit S. Hochbaum and Quico Spaen},

title={Sparse-Reduced Computation - Enabling Mining of Massively-large Data Sets},

booktitle={Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},

year={2016},

pages={224-231},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0005690402240231},

isbn={978-989-758-173-1},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,

TI - Sparse-Reduced Computation - Enabling Mining of Massively-large Data Sets

SN - 978-989-758-173-1

AU - Baumann P.

AU - Hochbaum D.

AU - Spaen Q.

PY - 2016

SP - 224

EP - 231

DO - 10.5220/0005690402240231