Figure 8: Comparison on a sparse database with 100000
transactions.
support value, in seconds. In Figures 3-6 we used log-
scale for the Y axis in order to show the difference in
small running times better.
We tested both algorithms on several datasets with
different sizes and characteristics from the UCI ma-
chine learning repository (the datasets are available
at (UCI, 2011)). The chess.dat file contains 3196
transactions and is a dense dataset (a dense dataset
is a dataset that contains transactions with many com-
mon items). The number of maximal frequent item-
sets in chess.dat varies from tens for very high sup-
port values to over ten thousand for lower support
values. The mushroom.dat file contains 8124 trans-
actions and is a relatively sparse dataset. It con-
tains thousands of maximal frequent itemsets for low
support values. The connect.dat file contains 67557
transactions and represents a dense dataset (for low
support values, it contains up to 17000 maximal fre-
quent itemsets). The number of items in each trans-
action is large and is constant for chess.dat, mush-
room.dat and connect.dat. Datasets T10I4D100K.dat
and T40I10D100K.dat have variable transaction size
and are very sparse. These datasets contain no maxi-
mal frequent itemsets of size larger than 2 for higher
support values and several thousands maximal fre-
quent itemsets for very low support values. Fig-
ure 3 shows a comparison both algorithms on the
chess.dat dataset. We see that due to the density of
this dataset HashMax always shows substantially bet-
ter times than Genmax. Figure 4 shows a comparison
of both algorithms on the mushroom.dat dataset. Be-
cause this dataset is sparse, HashMax gains an advan-
tage over Genmax for lower support values. Figure
5 shows a comparison both algorithms on the con-
nect.dat dataset. As this dataset is quite dense, Hash-
Max consistently shows better times than Genmax.
Figures 6 and 7 show a comparison of the two al-
gorithms on parts of T40I10D100K.dat of different
sizes (20000 transactions and 40000 transactions re-
spectively). Since the original dataset is very sparse,
HashMax shows better results for lower support val-
ues. Figure 8 shows a comparison of the two algo-
rithms on large (1000000 transactions) sparse dataset
T10I4D100K.dat. The algorithms show similar times
for medium support values, but HashMax times are
much better for low support values. In conclusion, we
have found that HashMax outperforms Genmax for
dense datasets (i.e. when the total number of maximal
frequent itemsets is significant) throughout and for
low support values when tested on sparse datasets).
For support values in the range of 0-0.1% the differ-
ence in running time was quite noticeable.
ACKNOWLEDGEMENTS
Authors thank the Lynn and William Fraenkel Cen-
ter for Computer Science for partially supporting this
work.
REFERENCES
Agarwal, R., Aggarwal, C., and Prasad, V. (2000). Depth
first generation of long patterns. In ACM SIGKDD
Conf.
Bayardo, R. J. (1998). Efficiently mining long patterns from
databases. In ACM SIGMOD Conf. on Management of
Data, pages 85–93.
Burdick, D., Calimlim, M., and Gehrke, J. (2001). Mafia, a
maximal frequent itemset algorithm for transactional
databases. In IEEE Intl. Conf. on Data Engineering,
pages 443–452.
Genmax (2011). Genmax implementa-
tion. http://www.cs.rpi.edu/ zaki/www-
new/pmwiki.php/Software.
Gouda, K. and Zaki, M. J. (2005). Genmax: An efficient
algorithm for mining maximal frequent itemsets. Data
Mining and Knowledge Discovery 11(3), pages 223–
242.
Han, J., Pei, J., and Yin, Y. (2000). Mining frequent patterns
without candidate generation. In ACM SIGMOD Conf.
on Management of Data, pages 1–12.
Hu, T., Sung, S. Y., Xiong, H., and Fu, Q. (2008). Discov-
ery of maximum length frequent itemsets. In Inf. Sci.
178(1), pages 69–87.
Lin, D.-I. and Kedem, Z. M. (1998). Pincer search: A new
algorithm for discovering the maximum frequent set.
In EDBT, pages 105–119.
UCI (2011). Uci machine learning data repository.
http://archive.ics.uci.edu/ml/index.html.
Yang, G. (2004). The complexity of mining maximal fre-
quent itemsets and maximal frequent patterns. In
KDD, pages 344–353.
Zaki, M. J., Parthasarathy, S., Ogihara, M., and Li, W.
(1997). New algorithms for fast discovery of associa-
tion rules. In Third Int 1 Conf. on Knowledge Discov-
ery in Databases and Data Mining, pages 283–286.
HASHMAX: A NEW METHOD FOR MINING MAXIMAL FREQUENT ITEMSETS
145