SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences
Alexis Gabadinho, Gilbert Ritschard, Matthias Studer, Nicolas S. Müller
2009
Abstract
This paper is concerned with the summarization of a set of categorical sequence data. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given coverage of the whole set, i.e. that have together a given percentage of sequences in their neighborhood. The goal is to yield a representative set that exhibits the key features of the whole sequence data set and permits easy sounded interpretation. We propose an heuristic for determining the representative set that first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in TraMineR our R package for mining and visualizing sequence data and we demonstrate their efficiency on a real world example from social sciences. The methods are nonetheless by no way limited to social science data and should prove useful in many other domains.
References
- Abbott, A. and Forrest, J. (1986). Optimal matching methods for historical sequences. Journal of Interdisciplinary History, 16:471-494.
- Abbott, A. and Tsay, A. (2000). Sequence analysis and optimal matching methods in sociology, Review and prospect. Sociological Methods and Research, 29(1):3-33. (With discussion, pp 34-76).
- Gabadinho, A., Müller, N. S., Ritschard, G., and Studer, M. (2009). Mining sequence data in R with TraMineR: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva.
- Hobohm, U., Scharf, M., Schneider, R., and Sander, C. (1992). Selection of representative protein data sets. Protein Sci, 1(3):409-417.
- Holm, L. and Sander, C. (1998). Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics, 14(5):423-429.
- Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York.
- McVicar, D. and Anyadike-Danes, M. (2002). Predicting successful and unsuccessful transitions from school to work by using sequence methods. Journal of the Royal Statistical Society. Series A (Statistics in Society), 165(2):317-334.
- Müller, N. S., Gabadinho, A., Ritschard, G., and Studer, M. (2008). Extracting knowledge from life courses: Clustering and visualization. In DAWAK 2008, volume LNCS 5182 of Lectures Notes in Computer Science, pages 176-185, Berlin Heidelberg. Springer.
- Studer, M., Ritschard, G., Gabadinho, A., and Müller, N. S. (2009). Analyse de dissimilarités par arbre d'induction. Revue des nouvelles technologies de l'information RNTI, E-15:7-18.
Paper Citation
in Harvard Style
Gabadinho A., Ritschard G., Studer M. and Müller N. (2009). SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 62-69. DOI: 10.5220/0002300400620069
in Bibtex Style
@conference{kdir09,
author={Alexis Gabadinho and Gilbert Ritschard and Matthias Studer and Nicolas S. Müller},
title={SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},
year={2009},
pages={62-69},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002300400620069},
isbn={978-989-674-011-5},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences
SN - 978-989-674-011-5
AU - Gabadinho A.
AU - Ritschard G.
AU - Studer M.
AU - Müller N.
PY - 2009
SP - 62
EP - 69
DO - 10.5220/0002300400620069