Machine Learning Techniques for Topic Spotting
Nadia Shakir
1
, Erum Iftikhar
2
and Imran Sarwar Bajwa
2
1
Department of Computer Science, Quaid-i-Azam University, Islamabad, Pakistan
2
Department of Computer Science & IT, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
Keywords: Machine Learning, Topic Spotting, Decision Tree, Neural Networks, K-Nearest Neighbours, Naive Bayes.
Abstract: Automatically choosing topics for text documents that describe the document contents, is a useful technique
for text categorization. For example queries sent on the web can use this technique to identify the query
topic and accordingly forward query to small group of people. Similarly online blogs can be categorized
according to the topics they are related to. In this paper we applied machine learning techniques to the
problem of topic spotting. We used supervised learning techniques which are highly dependent on training
data and the particular training algorithm used. Our approach differs from automatic text clustering which
uses unsupervised learning for clustering the text. Secondly the topics are known in advance and come from
an exhaustive list of words. The machine learning techniques we applied are 1) neural network., 2) Naïve
Bayes Classifier, 3) Instance based learning using k-nearest neighbours and 4) Decision Tree method. We
used Reuters-21578 text categorization dataset for our experiments.
1 INTRODUCTION
Given a text document, can it be classified as being
related to one or more given topics? Automatic topic
assignment to text documents or, topic spotting, has
applications in information retrieval systems and
enterprise portals (Hotho et al., 2003; Wiener et al.,
1995). Two approaches namely supervised learning
and unsupervised learning can be applied to problem
of topic spotting (Hotho et al., 2003). Machine
learning techniques can be based on supervised or
unsupervised learning. However, these types of
learning vary only in used structure of the model. A
model implies the effect of one set of observations,
(such as inputs) on another set of observations, (such
as outputs) in the supervised learning approach. As
a result, chain starts with the inputs and end at the
outputs by introducing arbitrating variables amid the
inputs and outputs.
In this paper, four supervised learning techniques
are applied to address the problem of topic spotting.
These techniques are artificial neural networks,
naïve Bayes classifier, instance based learning using
k-nearest neighbour technique and decision tree. We
used a simplified version of Reuters-21578 data set
to empirically evaluate these techniques. High
information gain words have already been extracted
and are used as attributes for each topic class out of
10 topic classes in simplified Reuters-21578 data set.
We merged documents in all the classes into a single
file and then divided the data set into four equal
parts after arranging the text documents randomly.
We applied k-fold cross validation technique for all
the four approaches described above.
The researchers started addressing problem of
automatic text categorization from 1961 (Huang et
al., 2009). Documents in the text classification
problem are normally represented by vector of
numeric values extracted from the document (Hotho
et al., 2003). A. Genkin et. al. (Genkin et al., 2007)
uses Baysian logistic regression approach for the
problem of text categorization whose results are
comparable with that of support vector machine
technique. Most of the text clustering algorithms are
based on vector space model (Huang et al., 2008).
Some researchers use the term bag of words (BOW)
for vector space model (Hotho et al., 2003). BOW or
vector space model suffers from the difficulty that it
considers only term frequencies and ignores
semantic relationships between the terms (Huang et
al., 2008; Wiener et al., 1995).
Dimensionality reduction is also important for
the problem of document classification because high
dimensional documents require more computations
as compared to low dimensional documents (Huang
et al., 2008). Document dimensionality normally
450
Shakir N., Iftikhar E. and Bajwa I..
Machine Learning Techniques for Topic Spotting.
DOI: 10.5220/0004881604500455
In Proceedings of the 16th International Conference on Enterprise Information Systems (ICEIS-2014), pages 450-455
ISBN: 978-989-758-027-7
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)