estimation techniques leading to loss of FIs and ARs.
We noticed that use of these techniques for associa-
tion mining is not convenient.
Examining the absolute errors, we verified that the
bigger the sample size, the smaller the absolute errors.
Riondato FI
abs
and AR
abs
return the smallest absolute
errors due to their high sample size. We could not
find any threshold of acceptance for errors identified
in the studies of this domain. In future works, we aim
to determine this threshold.
Besides, time elapsed during AR generation from
the universe is compared with total time elapsed dur-
ing sample size estimation, creating samples with
simple random sampling or stratified random sam-
pling method and AR generation from the sample.
Each of the techniques performed better than universe
in terms of time consumption. According to the abso-
lute errors, Riondato FI
abs
and Riondato AR
abs
tech-
niques are the best performers. When smaller sam-
ple size and less time consumption criteria are taken
in concern, Riondato FI
abs
is the leading sample size
estimation technique. Among several different sam-
ple size estimation techniques, we identified Riondato
FI
abs
to be the most suitable technique for our retail
banking data.
Dataset contains retail bank customers and their
product group ownership information. Rather than
the individual products owned by the customers, the
groups of these products are taken into consideration.
The main reason for this decision is to eliminate spar-
sity on the dataset and speed up the test phase. Use
of product groups has led to small time consump-
tions even on the universe. Even though duration
gain seems to be in the order of seconds, bigger gains
can be obtained if larger product sets are tested. For
next studies, dataset will be expanded and robustness
check will be done with an alternative dataset.
Moreover, systematic, cluster and multistage sam-
pling methods can be applied during construction of
association rule mining data. For instance, the cus-
tomers which will be subjects of ARM can be drawn
according to their clusters (e.g. geographical areas).
Our research focuses on the association rule mining
data (including binary values). Extraction of this data
from customers’ dataset will be examined in further
studies.
REFERENCES
Agrawal, R., Imieli
´
nski, T., and Swami, A. (1993). Min-
ing association rules between sets of items in large
databases. In ACM SIGMOD Record, volume 22,
pages 207–216. ACM.
Agresti, A. (1996). An introduction to categorical data
analysis, volume 135. Wiley New York.
Chakaravarthy, V. T., Pandit, V., and Sabharwal, Y. (2009).
Analysis of sampling techniques for association rule
mining. In Proceedings of the 12th international con-
ference on database theory, pages 276–283. ACM.
Durbin, J. (1973). Distribution theory for tests based on the
sample distribution function, volume 9. Siam.
Har-Peled, S. and Sharir, M. (2011). Relative (p, ε)-
approximations in geometry. Discrete & Computa-
tional Geometry, 45(3):462–496.
Hidber, C. (1999). Online association rule mining, vol-
ume 28. ACM.
Hipp, J., G
¨
untzer, U., and Nakhaeizadeh, G. (2000). Algo-
rithms for association rule mining—a general survey
and comparison. ACM sigkdd explorations newsletter,
2(1):58–64.
L
¨
offler, M. and Phillips, J. M. (2009). Shape fitting on point
sets with probability distributions. In Algorithms-ESA
2009, pages 313–324. Springer.
Mannila, H., Toivonen, H., and Verkamo, A. I. (1994). E
cient algorithms for discovering association rules. In
KDD-94: AAAI workshop on Knowledge Discovery in
Databases, pages 181–192.
Pei, J., Han, J., Lu†, H., Nishio, S., Tang, S., and Yang, D.
(2007). H-mine: Fast and space-preserving frequent
pattern mining in large databases. IIE Transactions,
39(6):593–605.
Pei, J., Han, J., Mao, R., et al. (2000). Closet: An effi-
cient algorithm for mining frequent closed itemsets. In
ACM SIGMOD workshop on research issues in data
mining and knowledge discovery, volume 4, pages 21–
30.
Riondato, M. and Upfal, E. (2012). Efficient discovery of
association rules and frequent itemsets through sam-
pling with tight performance guarantees. In Machine
Learning and Knowledge Discovery in Databases,
pages 25–41. Springer.
Toivonen, H. et al. (1996). Sampling large databases for
association rules. In VLDB, volume 96, pages 134–
145.
Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uni-
form convergence of relative frequencies of events to
their probabilities. Theory of Probability & Its Appli-
cations, 16(2):264–280.
Zaki, M. J. and Hsiao, C.-J. (2002). Charm: An efficient al-
gorithm for closed itemset mining. In SDM, volume 2,
pages 457–473. SIAM.
Zaki, M. J., Parthasarathy, S., Li, W., and Ogihara, M.
(1997). Evaluation of sampling for data mining of as-
sociation rules. In Research Issues in Data Engineer-
ing, 1997. Proceedings. Seventh International Work-
shop on, pages 42–50. IEEE.
Zhang, H., Zhao, Y., Cao, L., and Zhang, C. (2008). Com-
bined association rule mining. In Advances in Knowl-
edge Discovery and Data Mining, pages 1069–1074.
Springer.
KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval
202