as prediction of words related to a single term query.
Experiments on different small domains (web reposi-
tories) are presented and discussed.
The paper is organized as follows. In Section 2 we
introduce basic notions about traditional and proba-
bilistic indexing techniques. A probabilistic model,
namely the topic model, is presented in Section 3
where a procedure for single and multi-words predic-
tion is presented. An algorithm for building a seman-
tic indexing is illustrated in Section 4 where illustra-
tive examples of real environment are provided. Fi-
nally, in Section 5 we present some conclusions.
2 FROM TRADITIONAL TO
PROBABILISTIC INDEXING
TECHNIQUES
Several proposal have been made by researchers
for the information retrieval (IR) problem (R. and
Ribeiro-Neto, 1999). The basic methodology pro-
posed by IR researchers for text corpora - a methodol-
ogy successfully deployed in modern Internet search
engines - reduces each document in the corpus to a
vector of real numbers, each of which represents ra-
tios of counts. Following this methodology we ob-
tain the popular term frequencyinverse document fre-
quency (tf-idf ) scheme (Salton and McGill, 1983),
where a basic vocabulary of “words” or “terms” is
chosen, and, for each document in the corpus, a count
is formed of the number of occurrences of each word.
After suitable normalization, suitable comparison be-
tween term frequency count and inverse document
frequency count, we obtain the term-by-document
matrix W whose columns contain the tf-idf values for
each of the documents in the corpus.
Thus the tf-idf schema reduces documents of ar-
bitrary length to fixed-length lists of numbers, and it
also provides a relatively small amount of reduction
in description length and reveals little in the way of
inter- or intradocument statistical structure. The la-
tent semantic indexing(LSI) (Deerwester et al., 1990)
technique has been proposed in order to address these
shortcomings. Such method uses a singular value de-
composition of the W matrix to identify a linear sub-
space in the space of tf-idf features that captures most
of the variance in the collection. This approach can
achieve significant compression in large collections.
Moreover, a significant step forward a full prob-
abilistic approach to dimensionality reduction tech-
niques was made by Hofmann (Hofmann, 1999), who
presented the probabilistic LSI (pLSI) model, also
known as the aspect model, as an alternative to LSI.
The pLSI approach models each word in a document
as a sample from a mixture model, where the mixture
components are multinomial random variables that
can be viewed as representations of “topics”. Thus
each word is generated from a single topic, and differ-
ent words in a document may be generated from dif-
ferent topics. Each document is represented as a list
of mixing proportions for these mixture components
and thereby reduced to a probability distribution on
a fixed set of topics. This distribution is the reduced
description associated with the document.
While Hofmanns work is a useful step toward
probabilistic modeling of text, it is incomplete in that
it provides no probabilistic model at the level of doc-
uments leading to several problems: overfitting and
probability assignment to a document outside of the
training set is unclear. In order to overcome these
problems a new probabilistic method has been in-
troduced, called Latent Dirichlet Allocation (LDA)
(Blei et al., 2003) that we exploit in this paper in or-
der to catch essential statistical relationships between
words contained in web pages’ index. This method is
based on the bag-of-words assumption - that the or-
der of words in a document can be neglected. In the
language of probability theory, this is an assumption
of exchangeability for the words in a document (Al-
dous, 1985), which holds also for documents; the spe-
cific ordering of the documents in a corpus can also
be neglected. A classic representation theorem es-
tablishes that any collection of exchangeable random
variableshas a representation as a mixture distribution
- in general an infinite mixture. Thus, if we wish to
consider exchangeable representations for documents
and words, we need to consider mixture models that
capture the exchangeability of both words and doc-
uments. In this paper we propose an hybrid proposal
where the LDA technique is embedded in a traditional
technique procedure, the tf-idf schema. More details
are discussed next.
3 PROBABILISTIC TOPIC
MODEL: LDA MODEL
As discussed before a variety of probabilistic topic
models have been used to analyze the content of doc-
uments and the meaning of words. These models all
use the same fundamental idea that a document is
a mixture of topics but make slightly different sta-
tistical assumptions. In this paper we use the topic
model, discussed in (T. L. Griffiths, 2007) based on
the LDA algorithm (Blei et al., 2003), where statistic
dependence among words is assumed. By following
this approach, 4 problems have to be solved: word
SEMANTIC INDEXING OF WEB PAGES VIA PROBABILISTIC METHODS - In Search of Semantics Project
135