Figure 7: Accuracy comparison between StreamGP and
one-pass boosting method.
We wanted also to compare the performanceof the
algorithm against the simple one-pass algorithm that
receives the entire data set at once. To this end we run
StreamGP with an ensemble size of 50 and simulated
the one-pass boosting method by using the entire data
set scanned so far as a unique block. However, since
the boosting rounds are 5, on 5 nodes, the ensemble
generated by the one-pass method contains 25 classi-
fiers. In order to have a fair comparison, the one-pass
method had to run for 10 rounds so as to generate 50
classifiers. Figure 7 shows the classification accuracy
for an increasing number of tuples, expressed in mil-
lions. The figure point out the better performance of
the streaming approach. Another advantage to make
clear is that the streaming method works on 1k tu-
ples at a time, discarding them as soon as they have
been processed. On the contrary, the one-pass method
must maintain the entire data set considered so far,
with considerable storage and time requirements. For
example the one-pass boosting method working on
a data set of 2,500,000 tuples needs 45280 seconds,
while StreamGP, with τ = 0.01, requires 7186 sec-
onds, which is almost a magnitude order less.
6 CONCLUSIONS
The paper presented an adaptive GP boosting ensem-
ble method able to deal with distributed streaming
data and to handle concept drift via change detection.
The approach is efficient since each node of the net-
work works with its local streaming data, and the en-
semble is updated only when concept drift is detected.
REFERENCES
Abdulsalam, H., Skillicorn, D. B., and Martin, P. (2008).
Classifying evolving data streams using dynamic
streaming random forests. In DEXA ’08: Proceedings
of the 19th international conference on Database and
Expert Systems Applications, pages 643–651, Berlin,
Heidelberg. Springer-Verlag.
Cant´u-Paz, E. and Kamath, C. (2003). Inducing oblique
decision trees with evolutionary algorithms. IEEE
Transaction on Evolutionary Computation, 7(1):54–
68.
Folino, G., Pizzuti, C., and Spezzano, G. (1999). A cellular
genetic programming approach to classification. In
Proc. Of the Genetic and Evolutionary Computation
Conference GECCO99, pages 1015–1020, Orlando,
Florida. Morgan Kaufmann.
Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W.
(1999). Boat - optimistic decision tree construction. In
Proceedings of the ACM SIGMOD International Con-
ference on Management of Data (SIGMOD’99), pages
169–180. ACM Press.
Grassberger, P. (1983). Generalized dimensions of strange
attractors. Physics Letters, 97A:227–230.
Iba, H. (1999). Bagging, boosting, and bloating in ge-
netic programming. In Proc. Of the Genetic and Evo-
lutionary Computation Conference GECCO99, pages
1053–1060, Orlando, Florida. Morgan Kaufmann.
Liebovitch, L. and Toth, T. (1989). A fast algorithm to de-
termine fractal dimensions by box counting. Physics
Letters, 141A(8):–.
Mandelbrot, B. (1983). The Fractal Geometry of Nature.
W.H Freeman, New York.
Sarraille, J. and DiFalco, P. (1990). FD3.
http://tori.postech.ac.kr/softwares.
Schapire, R. E. (1996). Boosting a weak learning by major-
ity. Information and Computation, 121(2):256–285.
Street, W. N. and Kim, Y. (2001). A streaming ensemble
algorithm (sea) for large-scale classification. In Pro-
ceedings of the seventh ACM SIGKDD International
conference on Knowledge discovery and data mining
(KDD’01),, pages 377–382, San Francisco, CA, USA.
ACM.
Utgoff, P. E. (1989). Incremental induction of decision
trees. Machine Learning, 4:161–186.
Valizadegan, H. and Tan, P.-N. (2007). A prototype-driven
framework for change detection in data stream clas-
sification. In Proc. of IEEE Symposium on Compu-
tational Intelligence and Data Mining, 2007. CIDM
2007. IEEE Computer Society.
Wang, H., Fan, W., Yu, P., and Han, J. (2003). Mining
concept-drifting data streams using ensemble classi-
fiers. In Proceedings of the nineth ACM SIGKDD In-
ternational conference on Knowledge discovery and
data mining (KDD’03),, pages 226–235, Washington,
DC, USA. ACM.
USING SELF-SIMILARITY TO ADAPT EVOLUTIONARY ENSEMBLES FOR THE DISTRIBUTED
CLASSIFICATION OF DATA STREAMS
181