graph is continuously learning from new data and im-
proving its quality of annotation, while the SVM is
fixed in its classification after the initial training pe-
riod.
In this paper, we have investigated two different
local neighbourhood methods, ε-Neighbourhood and
k-Nearest Neighbour, for constructing graphs for text.
We have shown that sparse graphs can be constructed
from large text corpora in O (N logN) time, with the
cost of propagating labels on the graph linear in the
size of the graph, i.e. O (N). Our results show that the
graph-based methods are competitive with content-
based SVM methods. We have further shown that
combining the graph-based and content-based meth-
ods leads to an improvement in performance.
The proposed methods can easily be scaled out
into a distributed setting using currently available
open source software such as Apache Solr
2
, or Katta
3
,
allowing a user to handle millions of texts with simi-
larly effective performance.
Research into novel ways of combining the re-
lation and content based methods could lead to fur-
ther improvements in the categorisation performance
while keeping the cost of building and propagating la-
bels on the graph to a minimum.
ACKNOWLEDGEMENTS
I. Flaounas and N.Cristianini are supported by FP7
under grant agreement no. 231495. N. Cristianini is
supported by Royal Society Wolfson Research Merit
Award. All authors are supported by Pascal2 Network
of Excellence.
REFERENCES
Angelova, R. and Weikum, G. (2006). Graph-based text
classification: learn from your neighbors. In Proceed-
ings of the 29th annual international ACM SIGIR con-
ference on Research and development in information
retrieval, pages 485–492. ACM.
Araujo, M., Navarro, G., and Ziviani, N. (1997). Large text
searching allowing errors.
Baeza-Yates, R. and Navarro, G. (2000). Block addressing
indices for approximate text retrieval. Journal of the
American Society for Information Science, 51(1):69–
82.
Bayardo, R., Ma, Y., and Srikant, R. (2007). Scaling up
all pairs similarity search. In Proceedings of the 16th
2
Open source implementation of a distributed inverted
index. Available at: http://lucene.apache.org/solr/
3
Open source implementation of a distributed inverted
index. Available at: http://katta.sourceforge.net/
international conference on World Wide Web, pages
131–140. ACM.
Belkin, M., Matveeva, I., and Niyogi, P. (2004). Regular-
ization and semi-supervised learning on large graphs.
Learning theory, pages 624–638.
Carreira-Perpinan, M. and Zemel, R. (2004). Proximity
graphs for clustering and manifold learning. In Ad-
vances in Neural Information Processing Systems 17,
NIPS-17.
Cesa-Bianchi, N., Gentile, C., Vitale, F., and Zappella, G.
(2010a). Active learning on graphs via spanning trees.
Cesa-Bianchi, N., Gentile, C., Vitale, F., and Zappella, G.
(2010b). Random spanning trees and the prediction of
weighted graphs. In Proc. of ICML, pages 175–182.
Citeseer.
Chang, C.-C. and Lin, C.-J. (2001). LIBSVM: a library
for support vector machines. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm.
Cormen, T., Leiserson, C., and Rivest, R. (1990). Introduc-
tion to Algorithms.
Dietterich, T. (2000). Ensemble methods in machine learn-
ing. Multiple classifier systems, pages 1–15.
Dong, W., Moses, C., and Li, K. (2011). Efficient k-
nearest neighbor graph construction for generic sim-
ilarity measures. In Proceedings of the 20th interna-
tional conference on World wide web, pages 577–586.
ACM.
Flaounas, I., Ali, O., Turchi, M., Snowsill, T., Nicart, F., De
Bie, T., and Cristianini, N. (2011). NOAM: News Out-
lets Analysis and Monitoring System. In Proceedings
of the 2011 ACM SIGMOD international conference
on Management of data, pages 1275–1278. ACM.
Gionis, A., Indyk, P., and Motwani, R. (1999). Similarity
search in high dimensions via hashing. In Proceedings
of the 25th International Conference on Very Large
Data Bases, pages 518–529. Morgan Kaufmann Pub-
lishers Inc.
Heaps, H. (1978). Information retrieval: Computational
and theoretical aspects. Academic Press, Inc. Or-
lando, FL, USA.
Herbster, M. and Pontil, M. (2007). Prediction on a graph
with a perceptron. Advances in neural information
processing systems, 19:577.
Herbster, M., Pontil, M., and Rojas-Galeano, S.(2009). Fast
prediction on a tree. Advances in Neural Information
Processing Systems, 21:657–664.
Jebara, T., Wang, J., and Chang, S. (2009). Graph construc-
tion and b-matching for semi-supervised learning. In
Proceedings of the 26th Annual International Confer-
ence on Machine Learning, pages 441–448. ACM.
Joachims, T. (1998). Text categorization with support vec-
tor machines: Learning with many relevant features.
Machine Learning: ECML-98, pages 137–142.
Lewis, D., Yang, Y., Rose, T., and Li, F. (2004). RCV1:
A new benchmark collection for text categorization
research. Journal of Machine Learning Research,
5:361–397.
Lloyd, L., Kechagias, D., and Skiena, S. (2005). Lydia: A
system for large-scale news analysis. In String Pro-
cessing and Information Retrieval, pages 161–166.
Springer.
SCALABLE CORPUS ANNOTATION BY GRAPH CONSTRUCTION AND LABEL PROPAGATION
33