Figure 5: Left: Test tree 2. Right: Tree found by WEKA
J48.
Figure 6: Left: Test tree 3. Right: Tree found by WEKA
J48.
Figure 4 shows a simple tree with only 6 notes
constructed in a generated dataset at the left hand side
and the tree found by the J48 algorithm in Weka at
the right hand side. Figure 5 and 6 shows the similar
practice with a little bit more complicated tree
structures in generated datasets. In all of the testing
cases, the designed tree structures were found
successfully in the generated datasets, respectively.
4.2 Evaluation
The test cases show that it is definitely possible to
generate data that matches a data mining pattern. In
some cases, the entropy step width had to be altered
or additional “hidden nodes” had to be introduced to
the tree in order to make some splits. But this is most
likely due to the fact that the pattern generator
algorithm’s implementation is not technically mature
yet and can be improved in further versions.
Furthermore, a module should be developed that
reads trees as XML files (or similar) and generates the
tree structure necessary to generate the data
automatically. This would greatly increase the
versatility of the synthetic data generator.
In summary, the testing results prove that the
proposed synthetic data generator is able to generate
datasets with intrinsic patterns, such as decision trees.
Additionally, the performance of the data generator
was surprisingly good. It was possible to create
almost a million rows in a few seconds with a laptop
with basic specifications.
5 CONCLUSIONS AND FUTURE
WORK
In this paper, a novel approach for developing a
synthetic data generator for matching decision trees
has been proposed. A prototype of such a generator
has been implemented. The results of the test run
prove that a large dataset with patterns like decision
trees can be generated automatically within seconds.
While the prototype meets all requirements set out
within the aims of the project, the work introduces a
number of further investigations, including: a) to add
more classification algorithms into the generator; b)
to add more algorithms into the generator, which
allow patterns of association rules, clustering and
repression to be created; c) to develop a
comprehensive, user-friendly interface, which allows
users to select algorithms from different categories,
define the number of attributes, and other parameters.
The successful outcome of such future work would
result in a comprehensive synthetic data generator,
which is able to generate big datasets with patterns for
data mining research and training.
REFERENCES
Berthold, M., Borgelt, C., Höppner, F., & Klawonn, F.
2010. Guide to intelligent data analysis: How to
intelligently make sense of real data. Springer-Verlag
London.
Coyle, E., Roberts, R., Collins, E., and Barbu, A. 2014.
Synthetic Data Generation for Classification via Uni-
Modal Cluster Interpolation. Auto Robot 37:27 - 45.
Eno, J. and Thompson, C., 2008. Generating Synthetic Data
to Match Data Mining Patterns. IEEE Intenet
Computing, Vol. 12, No. 3 pp. 78 – 82.
Frasch, J. V., Lodwich, A., Shafait, F. and M. Breuel, T. M.,
2011. A Bayes-true data generator for evaluation of
supervised and unsupervised learning Methods. Pattern
Recognition Letters 32.11, pp. 1523–1531.
Galler, S. J. and Aichernig, B. K. 2014. An Evalaution of
White- and Grey-box Testing Tools for C#, C++, Eiffel,
and Java, Int J Softw Tools Technol Transfer 16: pp. 727
-751.
Houkjær, K., Torp, K., and Wind, R. 2006. Simple and
Realistic Data Generation. Proceedings of the 32
nd
international conference on very large data bases
(VLDB ’06), pp. 1243-1246
Jeske, D. R., Samadi, B., Lin, P. J., Ye, L., Cox, S., Xiao,
R., Younglove, T., Ly, M., Holt, D., and Rich, R., 2005.
Generation of Synthetic Data Sets for Evaluating the
Accuracy of Knowledge Discovery Systems. In
Proceedings of the Eleventh ACM SIGKDD
International Conference on Knowledge Discovery in