LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES
Timo Reuter, Philipp Cimiano
2011
Abstract
Every clustering algorithm requires a similarity measure, ideally optimized for the task in question. In this paper we are concerned with the task of identifying events in social media data and address the question of how a suitable similarity function can be learned from training data for this task. The task consists essentially in grouping social media documents by the event they belong to. In order to learn a similarity measure using machine learning techniques, we extract relevant events from last.fm and match the unique machine tags for these events to pictures uploaded to Flickr, thus getting a gold standard were each picture is assigned to its corresponding event. We evaluate the similarity measure with respect to accuracy on the task of assigning a picture to its correct event. We use SVMs to train an appropriate similarity measure and investigate the performance of different types of SVMs (Ranking SVMs vs. Standard SVMs), different strategies for creating training data as well as the impact of the amount of training data and the kernel used. Our results show that a suitable similarity measure can be learned from a few examples only given a suitable strategy for creating training data. We also show that i) Ranking SVMs can learn from fewer examples, ii) are more robust compared to standard SVMs in the sense that their performance does not vary significantly for different sizes and samples of training data and iii) are not as prone to overfitting as standard SVMs.
References
- Allan, J., Papka, R., and Lavrenko, V. (1998). On-line new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 37-45. ACM.
- Basu, S., Bilenko, M., and Mooney, R. (2003). Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering. In Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 42-49. Citeseer.
- Becker, H., Naaman, M., and Gravano, L. (2010). Learning similarity metrics for event identification in social media. In Proceedings of the third ACM International Conference on Web search and Data Mining, pages 291-300.
- Chang, C. and Lin, C. (2001). LIBSVM: a library for support vector machines.
- Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3):273-297.
- Demiriz, A., Bennett, K., and Embrechts, M. (1999). Semisupervised clustering using genetic algorithms. Artificial neural networks in engineering (ANNIE-99), pages 809-814.
- Eick, C., Rouhana, A., Bagherjeiran, A., and Vilalta, R. (2005). Using clustering to learn distance functions for supervised similarity assessment. Machine Learning and Data Mining in Pattern Recognition, pages 120-131.
- Fakeri-Tabrizi, A., Tollari, S., Usunier, N., and Gallinari, P. (2011). Improving image annotation in imbalanced classification problems with ranking SVM. Multilingual Information Access Evaluation II. Multimedia Experiments, pages 291-294.
- Finley, T. and Joachims, T. (2005). Supervised clustering with support vector machines. In Proceedings of the 22nd international conference on Machine learning, pages 217-224. ACM.
- Firan, C., Georgescu, M., Nejdl, W., and Paiu, R. (2010). Bringing order to your photos: event-driven classification of flickr images based on social knowledge. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 189-198. ACM.
- Freund, Y., Iyer, R., Schapire, R., and Singer, Y. (2003). An efficient boosting algorithm for combining preferences. The Journal of Machine Learning Research, 4:933-969.
- Gao, J., Qi, H., Xia, X., and Nie, J. (2005). Linear discriminant model for information retrieval. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 290-297. ACM.
- Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133-142. ACM.
- Rattenbury, T. and Naaman, M. (2009). Methods for extracting place semantics from Flickr tags. ACM Transactions on the Web (TWEB), 3(1):1.
- Rendle, S. and Schmidt-Thieme, L. (2006). Object identification with constraints.
- Reuter, T., Cimiano, P., Drumond, L., Buza, K., and Schmidt-Thieme, L. (2011). Scalable event-based clustering of social media via record linkage techniques. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media.
Paper Citation
in Harvard Style
Reuter T. and Cimiano P. (2011). LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 200-207. DOI: 10.5220/0003654602080215
in Bibtex Style
@conference{kdir11,
author={Timo Reuter and Philipp Cimiano},
title={LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},
year={2011},
pages={200-207},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003654602080215},
isbn={978-989-8425-79-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - LEARNING SIMILARITY FUNCTIONS FOR EVENT IDENTIFICATION USING SUPPORT VECTOR MACHINES
SN - 978-989-8425-79-9
AU - Reuter T.
AU - Cimiano P.
PY - 2011
SP - 200
EP - 207
DO - 10.5220/0003654602080215