TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs

Shumeet Baluja, Deepak Ravichandran, D. Sivakumar

Abstract

One of the fundamental assumptions for machine-learning based text classification systems is that the underlying distribution from which the set of labeled-text is drawn is identical to the distribution from which the text-to-be-labeled is drawn. However, in live news aggregation sites, this assumption is rarely correct. Instead, the events and topics discussed in news stories dramatically change over time. Rather than ignoring this phenomenon, we attempt to explicitly model the transitions of news stories and classifications over time to label stories that may be acquired months after the initial examples are labeled. We test our system, based on efficiently propagating labels in time-based graphs, with recently published news stories collected over an eighty day period. Experiments presented in this paper include the use of training labels from each story within the first several days of gathering stories, to using a single story as a label.

References

  1. Allen, J. (2002) Topic Detection and Tracking: EventBased Information Org., Springer.
  2. Mori, M. Miura, T. Shioya, I. (2006) Topic Detection
  3. International Conference on Web Intelligence, 2006.
  4. Baluja, S., Seth, R., Sivakumar, D., Jing, Y., Yagnik, J., Kumar, S., Ravichandran, D., Aly M., (2008) Video Suggestion and Discovery for YouTube: Taking Random Walks Through the View Graph (WWW2008).
  5. Pomikalek, J., Rehurek, R. (2007) The Influence of preprocessing parameters on text categorization, Proceedings of World Academy of Sci, Eng. Tech, V21
  6. McCallum A. and Nigam, K. (1998) A comparison of event models for Naïve Bayes text classification, AAAI-98 Workshop on Learning for Text Categorization.
  7. Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learn. J., 273-297.
  8. Joachims T. (2002), Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer, 2002. (code from svm-lite: http://svmlight.joachims.org/)
  9. Joachims T. (1999), “Transductive Inference for Text Classification using Support Vector Machines”. International Conference on Machine Learning (ICML), 1999.
  10. Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.; Motwani, R.; Ullman, J.D.; Yang, C. (2001) Finding interesting associations without support pruning. Knowledge and Data Engineering, V13:1
  11. Gionis, A., Indyk, P., Motwani, R. (1999), Similarity search in high dimensions via hashing. Proc. International Conference on Very Large Data Bases, 1999.
  12. Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Comp. Nets 30
  13. Zhu, X. (2005) Semi-Supervised Learning with Graphs. Carnegie Mellon U., PhD Thesis.
  14. Zhu, X., Ghahramani G., and Lafferty, J. (2003). Semisupervised learning using Gaussian fields and Harmonic Functions , in International Conference
  15. on Machine Learning-20.
  16. Szummer, M. & Jaakkola, T. (2001) Partially labeled classification with Markov random walks. NIPS-2001.
  17. Azran, A. (2007) The Rendezvous Algorithm: Multiclass semi-supervised learning with markov random walks. In International Conference on Machine Learning -24, 2007.
  18. Baluja, S. & Covell M. (2008) Audio Fingerprinting: Combining Computer Vision & Data Stream Processing, Int. Conf. Acoustics, Speech and Signal Processing (ICASSP-2008).
  19. Ifrim, G. & Weikum, G.,(2006) Transductive Learning for Text Classification using Explicit Knowledge Models, PKDD-2006
  20. Project for Excellence in Journalism (2008). “A Year in the News”, The State of News Media 2008: An Annual Report on American Journalism. http://www.stateofthenewsmedia.org/2008/index. php
Download


Paper Citation


in Harvard Style

Baluja S., Ravichandran D. and Sivakumar D. (2009). TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 174-182. DOI: 10.5220/0002303001740182


in Bibtex Style

@conference{kdir09,
author={Shumeet Baluja and Deepak Ravichandran and D. Sivakumar},
title={TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},
year={2009},
pages={174-182},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002303001740182},
isbn={978-989-674-011-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs
SN - 978-989-674-011-5
AU - Baluja S.
AU - Ravichandran D.
AU - Sivakumar D.
PY - 2009
SP - 174
EP - 182
DO - 10.5220/0002303001740182