SCALABLE CORPUS ANNOTATION BY GRAPH CONSTRUCTION AND LABEL PROPAGATION

Thomas Lansdall-Welfare, Ilias Flaounas, Nello Cristianini

Abstract

The efficient annotation of documents in vast corpora calls for scalable methods of text classification. Representing the documents in the form of graph vertices, rather than in the form of vectors in a bag of words space, allows for the necessary information to be pre-computed and stored. It also fundamentally changes the problem definition, from a content-based to a relation-based classification problem. Efficiently creating a graph where nearby documents are likely to have the same annotation is the central task of this paper. We compare the effectiveness of various approaches to graph construction by building graphs of 800,000 vertices based on the Reuters corpus, showing that relation-based classification is competitive with Support VectorMachines, which can be considered as state of the art. We further show that the combination of our relation-based approach and Support Vector Machines leads to an improvement over the methods individually.

References

  1. Angelova, R. and Weikum, G. (2006). Graph-based text classification: learn from your neighbors. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 485-492. ACM.
  2. Araujo, M., Navarro, G., and Ziviani, N. (1997). Large text searching allowing errors.
  3. Baeza-Yates, R. and Navarro, G. (2000). Block addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69- 82.
  4. Bayardo, R., Ma, Y., and Srikant, R. (2007). Scaling up all pairs similarity search. In Proceedings of the 16th 2Open source implementation of a distributed inverted
  5. index. Available at: http://lucene.apache.org/solr/ 3Open source implementation of a distributed inverted
  6. index. Available at: http://katta.sourceforge.net/ international conference on World Wide Web, pages 131-140. ACM.
  7. Belkin, M., Matveeva, I., and Niyogi, P. (2004). Regularization and semi-supervised learning on large graphs. Learning theory, pages 624-638.
  8. Carreira-Perpinan, M. and Zemel, R. (2004). Proximity graphs for clustering and manifold learning. In Advances in Neural Information Processing Systems 17, NIPS-17.
  9. Cesa-Bianchi, N., Gentile, C., Vitale, F., and Zappella, G. (2010a). Active learning on graphs via spanning trees.
  10. Cesa-Bianchi, N., Gentile, C., Vitale, F., and Zappella, G. (2010b). Random spanning trees and the prediction of weighted graphs. In Proc. of ICML, pages 175-182. Citeseer.
  11. Chang, C.-C. and Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
  12. Cormen, T., Leiserson, C., and Rivest, R. (1990). Introduction to Algorithms.
  13. Dietterich, T. (2000). Ensemble methods in machine learning. Multiple classifier systems, pages 1-15.
  14. Dong, W., Moses, C., and Li, K. (2011). Efficient knearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577-586. ACM.
  15. Flaounas, I., Ali, O., Turchi, M., Snowsill, T., Nicart, F., De Bie, T., and Cristianini, N. (2011). NOAM: News Outlets Analysis and Monitoring System. In Proceedings of the 2011 ACM SIGMOD international conference on Management of data, pages 1275-1278. ACM.
  16. Gionis, A., Indyk, P., and Motwani, R. (1999). Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, pages 518-529. Morgan Kaufmann Publishers Inc.
  17. Heaps, H. (1978). Information retrieval: Computational and theoretical aspects. Academic Press, Inc. Orlando, FL, USA.
  18. Herbster, M. and Pontil, M. (2007). Prediction on a graph with a perceptron. Advances in neural information processing systems, 19:577.
  19. Herbster, M., Pontil, M., and Rojas-Galeano, S. (2009). Fast prediction on a tree. Advances in Neural Information Processing Systems, 21:657-664.
  20. Jebara, T., Wang, J., and Chang, S. (2009). Graph construction and b-matching for semi-supervised learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 441-448. ACM.
  21. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, pages 137-142.
  22. Lewis, D., Yang, Y., Rose, T., and Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397.
  23. Lloyd, L., Kechagias, D., and Skiena, S. (2005). Lydia: A system for large-scale news analysis. In String Processing and Information Retrieval, pages 161-166. Springer.
  24. Maier, M., Von Luxburg, U., and Hein, M. (2009). Influence of graph construction on graph-based clustering measures. Advances in neural information processing systems, 22:1025-1032.
  25. Parzen, E. (1962). On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065-1076.
  26. Salton, G. (1989). Automatic text processing.
  27. Sarawagi, S. and Kirpal, A. (2004). Efficient set joins on similarity predicates. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 743-754. ACM.
  28. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1-47.
  29. Shawe-Taylor, J. and Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press.
  30. Steinbach, M., Karypis, G., and Kumar, V. (2000). A comparison of document clustering techniques. In KDD workshop on text mining, volume 400, pages 525-526. Citeseer.
  31. Steinberger, R., Pouliquen, B., and van der Goot, E. (2009). An Introduction to the Europe Media Monitor Family of Applications. In Information Access in a Multilingual World-Proceedings of the SIGIR 2009 Workshop, pages 1-8.
  32. Tan, P., Steinbach, M., Kumar, V., et al. (2006). Introduction to data mining. Pearson Addison Wesley Boston.
  33. Yang, Y., Zhang, J., and Kisiel, B. (2003). A scalability analysis of classifiers in text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 96-103. ACM.
  34. Zhang, J., Marszalek, M., Lazebnik, S., and Schmid, C. (2007). Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. International Journal of Computer Vision, 73(2):213-238.
  35. Zhu, X. (2007). Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison.
  36. Zhu, X., Ghahramani, Z., and Lafferty, J. (2003). Semisupervised learning using gaussian fields and harmonic functions. In International Conference of Machine Learning, volume 20, page 912.
Download


Paper Citation


in Harvard Style

Lansdall-Welfare T., Flaounas I. and Cristianini N. (2012). SCALABLE CORPUS ANNOTATION BY GRAPH CONSTRUCTION AND LABEL PROPAGATION . In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-8425-98-0, pages 25-34. DOI: 10.5220/0003728700250034


in Bibtex Style

@conference{icpram12,
author={Thomas Lansdall-Welfare and Ilias Flaounas and Nello Cristianini},
title={SCALABLE CORPUS ANNOTATION BY GRAPH CONSTRUCTION AND LABEL PROPAGATION},
booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2012},
pages={25-34},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003728700250034},
isbn={978-989-8425-98-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - SCALABLE CORPUS ANNOTATION BY GRAPH CONSTRUCTION AND LABEL PROPAGATION
SN - 978-989-8425-98-0
AU - Lansdall-Welfare T.
AU - Flaounas I.
AU - Cristianini N.
PY - 2012
SP - 25
EP - 34
DO - 10.5220/0003728700250034