SUPPORTING INFORMATION RETRIEVAL IN RSS FEEDS
Georges Dubus, Mathieu Bruyen and Nac
´
era Bennacer
E3S - SUPELEC, 3 rue Joliot-Curie, F-91192, Gif-sur-Yvette, France
Keywords:
Information retrieval, Text mining, Partitioning clustering, k-means, RSS feeds, XML, TFIDF.
Abstract:
Really Simple Syndication (RSS) information feeds present new challenges to information retrieval technolo-
gies. In this paper we propose a RSS feeds retrieval approach which aims to give for an user a personalized
view of items and making easier the access to their content. In our proposal, we define different filters in
order to construct the vocabulary used in text describing items feeds. This filtering takes into account both
the lexical category and the frequency of terms. The set of items feeds is then represented in a m-dimensional
vector space. The k-means clustering algorithm with an adapted centroid computation and a distance measure
is applied to find automatically clusters. The clusters indexed by relevant terms can so be refined, labeled and
browsed by the user. We experiment the approach on a collection of items feeds collected from news sites.
The resulting clusters show a good quality of their cohesion and their separation. This provides meaningful
classes to organize the information and to classify new items feeds.
1 INTRODUCTION
Really Simple Syndication (RSS) information feeds
present new challenges to information retrieval tech-
nologies. These feeds allow people who regularly use
the web to be informed by the latest update from the
sites they are interested in. The number of sites that
syndicate their content as RSS feeds increases contin-
uously. Aggregator tools allow users to grab the feeds
from various sites and to display them. However, the
subscriber could be submerged by the number of pro-
vided news. Besides, different feeds items may speak
about the same information so it is interesting to make
an information more complete and less sparse for the
user. For example, the set of items speaking about Ira-
nian war should be grouped in the same cluster and
those about the ecology and the environment in Eu-
rope should be found in another cluster.
In this paper we propose a RSS Organizing and
Classification System (ROCS) approach which aims
to give for the user a personalized view of items and
making easier the access to their contents.
Many works investigate different aspects of text
information retrieval such mining knowledge, infor-
mation organization and search. In the vector space
model proposed in (Salton et al., 1975) the text is rep-
resented by a bag of terms (words or phrases). Then,
each term becomes an independent dimension in a
very high dimensional vector space. The vocabulary
selection depends strongly on the processed collec-
tion and may be based on statistical techniques, nat-
ural language processing, documents structures and
ontologies ((Cimiano et al., 2005), (Etzioni et al.,
2005) and (Thiam et al., 2009)). Unsupervised clus-
tering methods based on such a representation have
been used for automatic information extraction (Jain
et al., 1999).
In our proposal, we define different filters to se-
lect the vocabulary that will be used in the cluster-
ing model construction. The lexico-syntactic filter se-
lects words according to their lexical category. The
stop-words filter discards the words that are consid-
ered as non-informative. The statistical-based filter
selects the words according to their frequency on all
the items. Weighting terms represents the discrimina-
tory degree of terms using tfidf measure. The unsu-
pervised clustering algorithm k-means is applied with
k-means
++
centroid computation (Arthur and Vas-
silvitskii, 2007) which is a way of avoiding poor or
big clusters. The metric distance used on the vectors
space allows evaluating the similarity/dissimilarity
between items by taking into account the terms that
these items share. So, once the clusters are automat-
ically generated they can be validated, refined and
307
Dubus G., Bruyen M. and Bennacer N.
SUPPORTING INFORMATION RETRIEVAL IN RSS FEEDS.
DOI: 10.5220/0002809103070312
In Proceedings of the 6th International Conference on Web Information Systems and Technology (WEBIST 2010), page
ISBN: 978-989-674-025-2
Copyright
c
2010 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved