# CLUSTERING OF HETEROGENEOUSLY TYPED DATA WITH SOFT COMPUTING

### Angel Kuri-Morales, Luis Enrique Cortes-Berrueco, Daniel Trejo-Baños

#### Abstract

The problem of finding clusters in arbitrary sets of data has been attempted using different approaches. In most cases, the use of metrics in order to determine the adequateness of the said clusters is assumed. That is, the criteria yielding a measure of quality of the clusters depends on the distance between the elements of each cluster. Typically, one considers a cluster to be adequately characterized if the elements within a cluster are close to one another while, simultaneously, they appear to be far from those of different clusters. This intuitive approach fails if the variables of the elements of a cluster are not amenable to distance measurements, i.e., if the vectors of such elements cannot be quantified. This case arises frequently in real world applications where several variables correspond to categories. The usual tendency is to assign arbitrary numbers to every category: to encode the categories. This, however, may result in spurious patterns: relationships between the variables which are not really there at the offset. It is evident that there is no truly valid assignment which may ensure a universally valid numerical value to this kind of variables. But there is a strategy which guarantees that the encoding will, in general, not bias the results. In this paper we explore such strategy. We discuss the theoretical foundations of our approach and prove that this is the best strategy in terms of the statistical behaviour of the sampled data. We also show that, when applied to a complex real world problem, it allows us to generalize soft computing methods to find the number and characteristics of a set of clusters.

#### References

- V. Ganti, J. Gehrke, and R. Ramakrishnan. CactusClustering categorical data using summaries. In KDD 7899: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 73-83, New York, NY, USA, 1999. ACM.
- Lee, Y., and Choi, S., Minimum entropy, k-means, spectral clustering, Neural Networks, 2004. Proceedings IEEE International Joint Conference on, volume 1, 2005.
- Shannon, C. E., and Weaver, W., The Mathematical Theory of Communication, Scientific American, July 1949.
- Shyam Boriah, Varun Chandola, and Vipin Kumar. Similarity measures for categorical data: A comparative evaluation. In SDM, pages 243-254, 2008.
- Feller, William, An introduction to probability theory and its applications. Vol. II., Oxford, England: Wiley. (1966).

#### Paper Citation

#### in Harvard Style

Kuri-Morales A., Cortes-Berrueco L. and Trejo-Baños D. (2011). **CLUSTERING OF HETEROGENEOUSLY TYPED DATA WITH SOFT COMPUTING** . In *Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)* ISBN 978-989-8425-79-9, pages 491-494. DOI: 10.5220/0003690304990502

#### in Bibtex Style

@conference{kdir11,

author={Angel Kuri-Morales and Luis Enrique Cortes-Berrueco and Daniel Trejo-Baños},

title={CLUSTERING OF HETEROGENEOUSLY TYPED DATA WITH SOFT COMPUTING},

booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},

year={2011},

pages={491-494},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0003690304990502},

isbn={978-989-8425-79-9},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)

TI - CLUSTERING OF HETEROGENEOUSLY TYPED DATA WITH SOFT COMPUTING

SN - 978-989-8425-79-9

AU - Kuri-Morales A.

AU - Cortes-Berrueco L.

AU - Trejo-Baños D.

PY - 2011

SP - 491

EP - 494

DO - 10.5220/0003690304990502