Machine Learning Techniques for Topic Spotting

Nadia Shakir, Erum Iftikhar, Imran Sarwar Bajwa

Abstract

Automatically choosing topics for text documents that describe the document contents, is a useful technique for text categorization. For example queries sent on the web can use this technique to identify the query topic and accordingly forward query to small group of people. Similarly online blogs can be categorized according to the topics they are related to. In this paper we applied machine learning techniques to the problem of topic spotting. We used supervised learning techniques which are highly dependent on training data and the particular training algorithm used. Our approach differs from automatic text clustering which uses unsupervised learning for clustering the text. Secondly the topics are known in advance and come from an exhaustive list of words. The machine learning techniques we applied are 1) neural network., 2) Naïve Bayes Classifier, 3) Instance based learning using k-nearest neighbours and 4) Decision Tree method. We used Reuters-21578 text categorization dataset for our experiments.

References

  1. A. Huang, E. Frank, and I. H. Witten. “Clustering document using a Wikipedia-based concept representation,” In Proc. 13th PAKDD, 2009.
  2. A. Huang, D. Milne, E. Frank. and I.H. Witten. “Clustering documents with active learning using Wikipedia,” In Proc. of the 8th IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy.
  3. A. Genkin, D. D. Lewis and D. Madigan. “Large-Scale Bayesian Logistic Regression for Text Categorization,” American Statistical Association and the American Society for Quality. TECHNOMETRICS, vol. 49, no. 3, Aug. 2007.
  4. A. Hotho, S. Staab, and G. Stumme. “Explaining text clustering results using semantic structures,” PKDD, 7th European Conference, Dubrovnik, Croatia, September 22-26, 2003, LNCS. Springer, 2003.
  5. A. Hotho, S. Staab and G. Stumme. “Wordnet improves text document clustering,” In Proceedings of the Semantic Web Workshop at SIGIR'03 .
  6. Jian Hu, Lujun Fang, Yang Cao, Hua-Jun Zeng, Hua Li, Qiang Yang, Zheng Chen. “Enhancing text clustering by leveraging Wikipedia semantics,” In Proc. of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR-08, 2008.
  7. E. Wiener, J. O. Pedersen and A. S. Weigend (1995) A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR'95).
  8. T. M. Mitchel. Machine Learning. McGraw-Hill, 1997.
  9. Reuters-215 data. Available: http://www.daviddlewis.com/ resources/testcollections/reuters21578.
  10. Bajwa, I. S., Naeem, M. A., & Riaz-Ul-Amin, M. A. C. (2006, February). Speech Language Processing Interface for Object-Oriented Application Design using a Rule-based Framework. In 4th International Conference on Computer Applications.
  11. Reuters-21578 text categorization collection (after preprocessing by Gytis Karciauskas): Available: http://staff.utia.cas.cz/vomlel/reuters-data.html.
Download


Paper Citation


in Harvard Style

Shakir N., Iftikhar E. and Bajwa I. (2014). Machine Learning Techniques for Topic Spotting . In Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-027-7, pages 450-455. DOI: 10.5220/0004881604500455


in Bibtex Style

@conference{iceis14,
author={Nadia Shakir and Erum Iftikhar and Imran Sarwar Bajwa},
title={Machine Learning Techniques for Topic Spotting},
booktitle={Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2014},
pages={450-455},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004881604500455},
isbn={978-989-758-027-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Machine Learning Techniques for Topic Spotting
SN - 978-989-758-027-7
AU - Shakir N.
AU - Iftikhar E.
AU - Bajwa I.
PY - 2014
SP - 450
EP - 455
DO - 10.5220/0004881604500455