be found experimentally, with the respective heuristic
methods employed to find them), indicating that the
dataset is inadequate for the purpose of model
building to achieve acceptable performance. At the
other extreme, Q
D
= 1 when
ν
= n and
μ
= p,
indicating that the dataset D
n,p
is indeed optimal, in
terms of both the number of features and the number
of data points, when it is evaluated with respect to the
data mining task of model building and knowledge
discovery for the problem under study.
4 CONCLUDING REMARKS
This position paper addresses the dataset sampling
problem in model building for knowledge discovery.
The issue of data mining and association rule
extraction, etc. from small samples of large datasets
have been studied by many authors before (Domingo
et al., 2002) (Provost et al., 1999), and sampling
techniques have been studied extensively. However,
the problem of the critical sampling size of a dataset
has not been studied thoroughly. It is shown in this
paper that the problem of searching for an optimal
sampling is, unsurprisingly, intractable, as in many
other optimality problems encountered in machine
learning and data mining. Inspired by previous
successful results of using heuristic methods to find
optimal feature set, we similarly propose a heuristic
method for finding an optimal sample of a dataset.
Finally, a quality metric for a dataset, which is
computed experimentally by finding a critical feature
set and a critical sample, is proposed. This measure
indicates the percentage of data in the dataset that is
essential for model building in knowledge discovery
to meet performance requirements. A low value (close
to 0) indicates that the dataset contains much
“unimportant” data; an exact zero (0) value indicates
that the dataset is in fact inadequate for model
building; while a high value (close to 1) indicates that
the dataset is “knowledge-intensive” and contains
very little unimportant data.
It is our position that the proposed quality
metric−perhaps combined with some of the statistics
based “static” measures (i.e. those computed from the
dataset alone without having to carry out experiments
using learning machines)−holds great promise to be
refined into practically useful quality measures for
datasets, which will be very helpful to the big data
industry in the emerging “data economy”.
The authors’ ongoing study includes investigating
the possible interrelation between critical feature
dimension and critical sampling, as well as conducting
experiments on different datasets for proof of concept.
REFERENCES
Datanami, 2016. Finding Your Way in the New Data
Economy (by A. Woodie).
http://www.datanami.com/2016/01/25/finding-your-way-
in-the-new-data-economy/
Liu, Q., Sung, A.H., Chen, Z. and Chen, L., 2015.
Exploring Image Tampering with the Same
Quantization Matrix, in Multimedia Data Mining and
Analytics
−
Disruptive Innovation (Editors: Baughman,
A.K., Gao J., Pan J-Y., and Petrushin V.) Springer,
pp.327-343.
Wiederhold, G., 2016. Unbalanced Data Leads to Obsolete
Economic Advice, Communications of the ACM, Vol.
59 No. 1, pp.45-46.
Hastie, T., Tibshirani, R. and Friedman, J., 2009. The
Elements of Statistical Learning, Data Mining,
Inference, and Prediction. 2
nd
Edition, Springer, 2009.
Guyon, I., Elisseeff, A., 2003. An Introduction to Variable
and Feature Selection, Journal of Machine Learning
Research, Vol 3, pp.1157-1182.
Liu, Q., Sung, A.H., Xu, J., Liu, J. and Chen, Z., 2006.
Microarray Gene Expression Classification based on
Supervised Learning and Similarity Measures.
Proceedings of 2006 IEEE International Conference on
Systems, Man, and Cybernetics, Vol. 6, pp.5094-5099.
Garey, M.R., Johnson, D.S., 1979. Computers and
Intractability: A Guide to the Theory of NP-
Completeness, W. H. Freeman and Compnay.
Suryakumar, D., Sung, A.H. and Liu, Q., 2014. The Critical
Dimension Problem: No Compromise Feature
Selection. Proceedings of eKNOW 2014, The Sixth
International Conference on Information, Process, and
Knowledge Management, pp.145-151.
Domingo, C., Gavaldà, R. and Watanabe, O., 2002.
Adaptive sampling methods for scaling up knowledge
discovery algorithms, Data Mining and Knowledge
Discovery, Kluwer Academic Publishers, Vol. 6 No. 2,
pp.131-152, 2002.
National Research Council, 2013. Frontiers in Massive
Data Analysis, The National Academies Press.
Papadimitriou, C.H.,Yannakakis, M., 1984. The complexity
of facets (and some facets of complexity), Journal of
Computer and System Sciences 28:244-259.
Provost, F., Jensen, D. and Oates, T., 1999. Efficient
Progressive Sampling. Proceeding of the Fifth
International Conference on Knowledge Discovery and
Data Mining, ACM KDD-99, pp.23-32.