# SEMI INTERACTIVE METHOD FOR DATA MINING

### Lydia Boudjeloud-Assala, François Poulet

#### Abstract

Usual visualization techniques for multidimensional data sets, such as parallel coordinates and scatter-plot matrices, do not scale well to high numbers of dimensions. A common approach to solve this problem is dimensionality selection. We present new semi-interactive method for dimensionality selection to select pertinent dimension subsets without losing information. Our cooperative approach uses automatic algorithms, interactive algorithms and visualization methods: an evolutionary algorithm is used to obtain optimal dimension subsets which represent the original data set without losing information for unsupervised tasks (clustering or outlier detection) using a new validity criterion. A visualization method is used to present the user interactive evolutionary algorithm results and let him actively participate in evolutionary algorithm search with more efficiency resulting in a faster evolutionary algorithm convergence. We have implemented our approach and applied it to real data set to confirm it is effective for supporting the user in the exploration of high dimensional data sets and evaluate the visual data representation.

#### References

- Blake C.L., Merz C.J. 1998, UCI Repository of Machine Learning Databases Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu /mlearn/MLR epository.html] .
- Boudjeloud L., Poulet F. 2004, A Genetic Approach for Outlier Detection in High Dimensional Data Sets, in Modelling, Computation and Optimization in Information Systems and Management Science, Le Thi H.A., Pham D.T. Eds, Hermes Sciences Publishing, pp 543-550.
- Boudjeloud L., Poulet F. 2005a, Attributes selection for high dimensional data clustering, in proc. of XIth International Symposium on Applied Stochastic Models and Data Analysis, ASMDA'05, pp 387-395.
- Boudjeloud L., Poulet F. 2005b, Visual Interactive Evolutionary Algorithm for High Dimensional Data Clustering and Outlier Detection, in Advances in Knowledge Discovery and Data Mining, T.B. Ho, D. Cheung, and H. Liu (Eds.), LNAI 3518, SpringerVerlag, PAKDD 2005, pp 426 - 431.
- Calinski R.B. and Harabasz J., 1974. A dendrite method for cluster analysis. In Communication in statistics, volume 3, pages 1-27.
- Carr D. B., Littlefield R. J., Nicholson W. L. 1987, Scatter-plot matrix techniques for large N, Journal of the American Statistical Association, 82(398), pp 424- 436, Littlefield.
- Dash M., Liu H. 2000, Feature selection for clustering. In Proc. Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp 110-121.
- Dy J. G., Brodley C. E. 2000, Feature subset selection and order identification for unsupervised learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pp 247-254.
- Fayyad U., Piatetsky-Shapiro G., Smyth P. 1996, The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, 39(11), pp 27-34.
- Freitas A. A. 2004, A Critical Review of Multi-Objective Optimization in Data Mining: A Position Paper, ACMSIGMOD'04, pp 77-87, vol. 6, n° 2.
- Inselberg A. 1985, The Plane with Parallel Coordinates, Special Issue on computational Geometry, vol 1, pp 69-97.
- Jinyan L., Huiqing L. 2004, Kent ridge bio-medical data set repository, http://sdmc.-lit.org.sg/GEDatasets. accede in December 2004.
- Keim D.A. 2002, Information visualization and visual data mining, IEEE transaction on visualization and computer graphics, 7(1), pp 100-107.
- Kim Y., Street W., Menczer F. 2000, Feature selection for unsupervised learning via evolutionary search. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 365-369.
- Liu H., Motoda H. 1998, Feature detection for knowledge discovery and data mining, Kluwer Academic Publishers.
- Milligan G., Cooper M. 1985, An examination of procedures for determining the number of classes in a data set, Psychometrika, vol.52, n°2, pp 159-179.
- Parsons L., Haque E., Liu H. 2004, Subspace clustering for high dimensional data: a review, Special issue on learning from imbalanced datasets, ACM SIGKDD Explorations Newsletter, vol 6, n° 1, pp 90-105, 2004.
- Poulet F. 2004, SVM and Graphical Algorithms: a Cooperative Approach, in proc. of IEEE ICDM, the 4th International Conference on Data Mining, pp 499- 502.
- Takagi H. 2001, Interactive Evolutionary Computation: Fusion of the Capacities of EC Optimization and Human Evaluation, Proceedings of the IEEE, Vol.89, No.9, pp1275-1296.
- Van Rijsbergen C.J. 1979, Information retrivial, Butterworth, London.
- Venturini G., Slimane M., Morin F., Asselin de Beauville J.P. 1997, On Using Interactive Genetic Algorithms for Knowledge Discovery in Databases, in 7th International Conference on Genetic Algorithms, pp 696-703.

#### Paper Citation

#### in Harvard Style

Boudjeloud-Assala L. and Poulet F. (2006). **SEMI INTERACTIVE METHOD FOR DATA MINING** . In *Proceedings of the Eighth International Conference on Enterprise Information Systems - Volume 2: ICEIS,* ISBN 978-972-8865-42-9, pages 3-10. DOI: 10.5220/0002454600030010

#### in Bibtex Style

@conference{iceis06,

author={Lydia Boudjeloud-Assala and François Poulet},

title={SEMI INTERACTIVE METHOD FOR DATA MINING},

booktitle={Proceedings of the Eighth International Conference on Enterprise Information Systems - Volume 2: ICEIS,},

year={2006},

pages={3-10},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0002454600030010},

isbn={978-972-8865-42-9},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the Eighth International Conference on Enterprise Information Systems - Volume 2: ICEIS,

TI - SEMI INTERACTIVE METHOD FOR DATA MINING

SN - 978-972-8865-42-9

AU - Boudjeloud-Assala L.

AU - Poulet F.

PY - 2006

SP - 3

EP - 10

DO - 10.5220/0002454600030010