SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences

Alexis Gabadinho, Gilbert Ritschard, Matthias Studer, Nicolas S. Müller

2009

Abstract

This paper is concerned with the summarization of a set of categorical sequence data. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given coverage of the whole set, i.e. that have together a given percentage of sequences in their neighborhood. The goal is to yield a representative set that exhibits the key features of the whole sequence data set and permits easy sounded interpretation. We propose an heuristic for determining the representative set that first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in TraMineR our R package for mining and visualizing sequence data and we demonstrate their efficiency on a real world example from social sciences. The methods are nonetheless by no way limited to social science data and should prove useful in many other domains.

References

  1. Abbott, A. and Forrest, J. (1986). Optimal matching methods for historical sequences. Journal of Interdisciplinary History, 16:471-494.
  2. Abbott, A. and Tsay, A. (2000). Sequence analysis and optimal matching methods in sociology, Review and prospect. Sociological Methods and Research, 29(1):3-33. (With discussion, pp 34-76).
  3. Gabadinho, A., Müller, N. S., Ritschard, G., and Studer, M. (2009). Mining sequence data in R with TraMineR: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva, Geneva.
  4. Hobohm, U., Scharf, M., Schneider, R., and Sander, C. (1992). Selection of representative protein data sets. Protein Sci, 1(3):409-417.
  5. Holm, L. and Sander, C. (1998). Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics, 14(5):423-429.
  6. Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York.
  7. McVicar, D. and Anyadike-Danes, M. (2002). Predicting successful and unsuccessful transitions from school to work by using sequence methods. Journal of the Royal Statistical Society. Series A (Statistics in Society), 165(2):317-334.
  8. Müller, N. S., Gabadinho, A., Ritschard, G., and Studer, M. (2008). Extracting knowledge from life courses: Clustering and visualization. In DAWAK 2008, volume LNCS 5182 of Lectures Notes in Computer Science, pages 176-185, Berlin Heidelberg. Springer.
  9. Studer, M., Ritschard, G., Gabadinho, A., and Müller, N. S. (2009). Analyse de dissimilarités par arbre d'induction. Revue des nouvelles technologies de l'information RNTI, E-15:7-18.
Download


Paper Citation


in Harvard Style

Gabadinho A., Ritschard G., Studer M. and Müller N. (2009). SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 62-69. DOI: 10.5220/0002300400620069


in Bibtex Style

@conference{kdir09,
author={Alexis Gabadinho and Gilbert Ritschard and Matthias Studer and Nicolas S. Müller},
title={SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},
year={2009},
pages={62-69},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002300400620069},
isbn={978-989-674-011-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences
SN - 978-989-674-011-5
AU - Gabadinho A.
AU - Ritschard G.
AU - Studer M.
AU - Müller N.
PY - 2009
SP - 62
EP - 69
DO - 10.5220/0002300400620069