Sampling and Evaluating the Big Data for Knowledge Discovery

Andrew H. Sung, Bernardete Ribeiro, Qingzhong Liu


The era of Internet of Things and big data has seen individuals, businesses, and organizations increasingly rely on data for routine operations, decision making, intelligence gathering, and knowledge discovery. As the big data is being generated by all sorts of sources at accelerated velocity, in increasing volumes, and with unprecedented variety, it is also increasingly being traded as commodity in the new “data economy” for utilization. With regard to data analytics for knowledge discovery, this leads to the question, among various others, of how much data is really necessary and/or sufficient for getting the analytic results that will reasonably satisfy the requirements of an application. In this work-in-progress paper, we address the sampling problem in big data analytics and propose that (1) the problem of sampling the big data for analytics is “hard”specifically, it is a theoretically intractable problem when formal measures are incorporated into performance evaluation; therefore, (2) heuristic, rather than algorithmic, methods are necessarily needed in data sampling, and a plausible heuristic method is proposed (3) a measure of dataset quality is proposed to facilitate the evaluation of the worthiness of datasets with respect to model building and knowledge discovery in big data analytics.


  1. Datanami, 2016. Finding Your Way in the New Data Economy (by A. Woodie).
  3. Liu, Q., Sung, A.H., Chen, Z. and Chen, L., 2015. Exploring Image Tampering with the Same Quantization Matrix, in Multimedia Data Mining and Analytics-Disruptive Innovation (Editors: Baughman, A.K., Gao J., Pan J-Y., and Petrushin V.) Springer, pp.327-343.
  4. Wiederhold, G., 2016. Unbalanced Data Leads to Obsolete Economic Advice, Communications of the ACM, Vol. 59 No. 1, pp.45-46.
  5. Hastie, T., Tibshirani, R. and Friedman, J., 2009. The Elements of Statistical Learning, Data Mining, Inference, and Prediction. 2nd Edition, Springer, 2009.
  6. Guyon, I., Elisseeff, A., 2003. An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, Vol 3, pp.1157-1182.
  7. Liu, Q., Sung, A.H., Xu, J., Liu, J. and Chen, Z., 2006. Microarray Gene Expression Classification based on Supervised Learning and Similarity Measures. Proceedings of 2006 IEEE International Conference on Systems, Man, and Cybernetics, Vol. 6, pp.5094-5099.
  8. Garey, M.R., Johnson, D.S., 1979. Computers and Intractability: A Guide to the Theory of NPCompleteness, W. H. Freeman and Compnay.
  9. Suryakumar, D., Sung, A.H. and Liu, Q., 2014. The Critical Dimension Problem: No Compromise Feature Selection. Proceedings of eKNOW 2014, The Sixth International Conference on Information, Process, and Knowledge Management, pp.145-151.
  10. Domingo, C., Gavaldà, R. and Watanabe, O., 2002. Adaptive sampling methods for scaling up knowledge discovery algorithms, Data Mining and Knowledge Discovery, Kluwer Academic Publishers, Vol. 6 No. 2, pp.131-152, 2002.
  11. National Research Council, 2013. Frontiers in Massive Data Analysis, The National Academies Press.
  12. Papadimitriou, C.H.,Yannakakis, M., 1984. The complexity of facets (and some facets of complexity), Journal of Computer and System Sciences 28:244-259.
  13. Provost, F., Jensen, D. and Oates, T., 1999. Efficient Progressive Sampling. Proceeding of the Fifth International Conference on Knowledge Discovery and Data Mining, ACM KDD-99, pp.23-32.

Paper Citation

in Harvard Style

Sung A., Ribeiro B. and Liu Q. (2016). Sampling and Evaluating the Big Data for Knowledge Discovery . In Proceedings of the International Conference on Internet of Things and Big Data - Volume 1: IoTBD, ISBN 978-989-758-183-0, pages 378-382. DOI: 10.5220/0005932703780382

in Bibtex Style

author={Andrew H. Sung and Bernardete Ribeiro and Qingzhong Liu},
title={Sampling and Evaluating the Big Data for Knowledge Discovery},
booktitle={Proceedings of the International Conference on Internet of Things and Big Data - Volume 1: IoTBD,},

in EndNote Style

JO - Proceedings of the International Conference on Internet of Things and Big Data - Volume 1: IoTBD,
TI - Sampling and Evaluating the Big Data for Knowledge Discovery
SN - 978-989-758-183-0
AU - Sung A.
AU - Ribeiro B.
AU - Liu Q.
PY - 2016
SP - 378
EP - 382
DO - 10.5220/0005932703780382