Authors:
Andrew H. Sung
1
;
Bernardete Ribeiro
2
and
Qingzhong Liu
3
Affiliations:
1
The University of Southern Mississippi, United States
;
2
University of Coimbra, Portugal
;
3
Sam Houston State University, United States
Keyword(s):
Data Analytics, Knowledge Discovery, Sampling Methods, Quality of Datasets.
Abstract:
The era of Internet of Things and big data has seen individuals, businesses, and organizations increasingly rely on data for routine operations, decision making, intelligence gathering, and knowledge discovery. As the big data is being generated by all sorts of sources at accelerated velocity, in increasing volumes, and with unprecedented variety, it is also increasingly being traded as commodity in the new “data economy” for utilization. With regard to data analytics for knowledge discovery, this leads to the question, among various others, of how much data is really necessary and/or sufficient for getting the analytic results that will reasonably satisfy the requirements of an application. In this work-in-progress paper, we address the sampling problem in big data analytics and propose that (1) the problem of sampling the big data for analytics is “hard”specifically, it is a theoretically intractable problem when formal measures are incorporated into performance evaluation; theref
ore, (2) heuristic, rather than algorithmic, methods are necessarily needed in data sampling, and a plausible heuristic method is proposed (3) a measure of dataset quality is proposed to facilitate the evaluation of the worthiness of datasets with respect to model building and knowledge discovery in big data analytics.
(More)