# HIERARCHICAL MODEL-BASED CLUSTERING FOR RELATIONAL DATA

### Jianzhong Chen, Mary Shapcott, Sally McClean, Kenny Adamson

#### Abstract

Relational data mining deals with datasets containing multiple types of objects and relationships that are presented in relational formats, e.g. relational databases that have multiple tables. This paper proposes a propositional hierarchical model-based method for clustering relational data. We first define an object-relational star schema to model composite objects, and present a method of flattening composite objects into aggregate objects by introducing a new type of aggregates – frequency aggregate, which can be used to record not only the observed values but also the distribution of the values of an attribute. A hierarchical agglomerative clustering algorithm with log-likelihood distance is then applied to cluster the aggregated data tentatively. After stopping at a coarse estimate of the number of clusters, a mixture model-based method with the EM algorithm is developed to perform a further relocation clustering, in which Bayes Information Criterion is used to determine the optimal number of clusters. Finally we evaluate our approach on a real-world dataset.

#### References

- Connolly, T. M. and Begg, C. E. (2002). Database Systems: A Practical Approach to Design, Implementation, and Management. Harlow: Addison-Wesley, third edition. International computer science series.
- Dzeroski, S. and Lavrac, N. (2001). Relational Data Mining. Springe-Verlag, Berlin.
- Dzeroski, S. and Raedt, L. D. (2003). Multi-relational data mining: a workshop report. SIGKDD Explorations, 4(2):122-124.
- Emde, W. and Wettschereck, D. (1996). Relational instance-based learning. In Proc. ICML-96, pages 122-130, San Mateo, CA. Morgan Kaufmann.
- Eriksson, H.-E. and Penker, M. (1998). UML Toolkit. John Wiley and Sons, New York.
- Everitt, B. (1981). Cluster Analysis. Halsted Press: John Wiley and Sons, New York, second edition.
- Fraley, C. and Raftery, A. (1998). How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal, 41(8):578- 588.
- Friedman, N., Getoor, L., Koller, D., and Pfeffer, A. (1999). Learning probabilistic relational models. In Proc. IJCAI-99, pages 1300-1307, Stockholm, Sweden. Morgan Kaufmann.
- Jain, A. K. and Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice-Hall.
- Meila, M. and Heckerman, D. (1998). An experimental comparison of several clustering and initialization methods. In Proc. UAI 98, pages 386-395, San Francisco, CA. Morgan Kaufmann.
- Taskar, B., Segal, E., and Koller, D. (2001). Probabilistic classi cation and clustering in relational data. In Nebel, B., editor, Proc. IJCAI-01, pages 870-878, Seattle, US.

#### Paper Citation

#### in Harvard Style

Chen J., Shapcott M., McClean S. and Adamson K. (2004). **HIERARCHICAL MODEL-BASED CLUSTERING FOR RELATIONAL DATA** . In *Proceedings of the Sixth International Conference on Enterprise Information Systems - Volume 2: ICEIS,* ISBN 972-8865-00-7, pages 92-97. DOI: 10.5220/0002624300920097

#### in Bibtex Style

@conference{iceis04,

author={Jianzhong Chen and Mary Shapcott and Sally McClean and Kenny Adamson},

title={HIERARCHICAL MODEL-BASED CLUSTERING FOR RELATIONAL DATA},

booktitle={Proceedings of the Sixth International Conference on Enterprise Information Systems - Volume 2: ICEIS,},

year={2004},

pages={92-97},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0002624300920097},

isbn={972-8865-00-7},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the Sixth International Conference on Enterprise Information Systems - Volume 2: ICEIS,

TI - HIERARCHICAL MODEL-BASED CLUSTERING FOR RELATIONAL DATA

SN - 972-8865-00-7

AU - Chen J.

AU - Shapcott M.

AU - McClean S.

AU - Adamson K.

PY - 2004

SP - 92

EP - 97

DO - 10.5220/0002624300920097