USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL
LANGUAGE PROCESSING
Armelle Brun and Anne Boyer
INRIA Lorraine - Universit´e Nancy2 , France
Keywords:
Recommender systems, collaborative filtering, analysis of usage, statistical language model, resource index-
ing.
Abstract:
Due to the huge amount of available information via Internet, the identification of reliable and interesting items
becomes more and more difficult and time consuming. This paper is a position paper describing our intended
work in the framework of multimedia information retrieval by browsing techniques within web navigation. It
relies on a usage-based indexing of resources: we ignore the nature, the content and the structure of resources.
We describe a new approach taking advantage of the similarity between statistical modeling of language and
document retrieval systems. A syntax of usage is computed that designs a Statistical Grammar of Usage
(SGU). A SGU enables resources classification to perform a personalized navigation assistant tool. It relies
both on collaborative filtering to compute virtual communities of users and a new distance dependent trigger
model. The resulting SGU is a community dependent SGU.
1 INTRODUCTION
On Internet the identification of reliable and interest-
ing items becomes more and more difficult and time
consuming, even for skilled people using dedicated
tools, such as powerful search engines. Due to the
huge amount of online resources, the major difficulty
is nevermore to know if a pertinent document is avail-
able but to identify the more reliable and interesting
items among the overwhelming stream of available
information. A key factor of success in information
retrieval and delivery is the development of powerful
tools easy-to-use for a large audience.
Different approaches for resources retrieval are
explored, such as content analysis, keywords index-
ing and identification, topic detection, etc. (Baeza-
Yates and Ribeiro-Neto, 1999). A major difficulty in-
herent to such approaches is that one keyword may
havedifferent meanings, or not, dependentof the user,
his/her context and the history of his/her past navi-
gations. Moreover two different keywords may have
similar meanings, depending on the context. Express-
ing a query is a difficult task for many people and a
lot of research and industrial projects deal with query
assistance. Furthermore automatic indexing of multi-
media resources is still a hard research problem. To
cope with these difficulties we decide to investigate
another way by ignoring the content, the nature, the
format and the structure of resources.
This position paper describes our intended work,
relying on past researches both on collaborative fil-
tering (Castagnos and Boyer, 2006a; Castagnos and
Boyer, 2006b) and statistical language modeling
(Sma¨ıli et al., 1999; Brun et al., 2002). We aim at pro-
viding a new web browsing tool based on an analysis
of usage. This tool enables multimedia information
retrieval by browsing techniques without expressing
any query: users are modelized without requiring any
preferences elicitation. This approach manages eas-
ily heterogeneous items (video, audio, textual, multi-
media) with a single treatment, as classical methods
require dedicated tools for resource tagging.
We plan to extract frequent patterns of consul-
tations by taking advantage of the analogy between
language-based statistical modeling and resource re-
trieval. These frequent patterns allow the design of
syntax of usage, relying on the hypothesis that there is
logic and coherency defining implicit ”rules” inside a
navigation. The resulting Statistical Grammar of Us-
age enables a classification, clustering and selection
of resources to design personalized filtering.
In the next section, the problem of retrieving re-
sources when browsing is stated and our approach
based on the use of statistical language models is de-
tailed. The following section presents the most popu-
lar statistical language models and their appropriate-
ness to web browsing. An adaptation of the trigger
language model is introduced. Section 4 puts forward
220
Brun A. and Boyer A. (2007).
USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL LANGUAGE PROCESSING.
In Proceedings of the Third International Conference on Web Information Systems and Technologies - Web Interfaces and Applications, pages 220-225
DOI: 10.5220/0001278902200225
Copyright
c
SciTePress
the community-based Statistical Grammar of Usage.
Discussion and perspectives conclude this paper.
2 OUR APPROACH
Our web browsing tool helps users during a naviga-
tion process: it suggests the pertinent items to a spe-
cific user, given his/her past navigation and context.
The aim is to compute the pertinence of any resource.
The pertinence of a resource is the interest of a user
for it and allows to compute predictions of resources
(the highest the pertinence of a resource is, the highest
is its probability to be suggested).
First, we hypothesize an implicit search, it means
that the active user has no explicit queries to formu-
late. Secondly, we consider as a consultation the se-
quence of one or more items, dedicated to a given
search. A multi-navigation is the mix of different con-
sultations within a single browsing process. One sin-
gle consultation is called a mono-navigation.
A resource is any item (textual, audio, video or
multimedia document, web page, hyperlink, forum,
blog, website, etc.), viewed as an elementary and in-
divisible entity without any information about its for-
mat, its content or any semantic or topic indexing.
The only data describing a priori a resource is a nor-
malized mark called identifier, enabling to identify
and to locate it.
Our approach relies on an analysis of usage. A us-
age is any data, explicitly or implicitly left by the user
during navigation. For example, history of consulta-
tion, click-stream or log files are implicit data about
the interest of the visited items for the active user. We
call appreciation any measure of the user’s satisfac-
tion. This measure can be either an explicit informa-
tion as votes, annotations or any estimation computed
from implicit data (Chan, 1999).
An advantage of our approach is that it only takes
into account a measure of the user’s interest for a
given resource, which is directly linked to the perti-
nence criterion: the user’s satisfaction. Our approach
computes a personalized indexing of resources not in
terms of its intrinsic nature but in terms of a more
subjective but more reliable and pertinent criterion,
i.e. the user’s context, preferences and habits. Then
this approach manages heterogeneous resources with
a single treatment.
The question is how to estimate the a priori per-
tinence of a resource for a given user. The difficulty
relies on sparsity of data: we don’t have any appre-
ciation of a resource if this user has not seen it and
usually most of resources have not been seen by this
user. As we design a personalized tool, the pertinence
cannot be a context independent measure (the context
is both the user’s profile and history).
To compute the a priori pertinence of a resource,
we plan to design a grammar of usage. As a gram-
mar of language is the set of rules describing the re-
lation between words, a grammar of usage is the set
of rules describing the relation between resources. A
grammar of language estimates if a word is pertinent
given the beginning of a sentence. A grammar of us-
age allows to estimate if a resource is relevant for a
specific user given his/her previous consultations.
There is no a priori grammar of usage, as Internet
is a dynamic and moving environment. A means to
cope with the difficulty of designing an a priori gram-
mar is the use of a statistical approach based on usage
analysis. As huge usage corpora are available (log
files and clickstream) it makes it possible to explicit
regularities in terms of resource consultations. This
statistical approach can be investigated in a similar
way to language modeling based on statistical models
(defining a Statistical Grammar of Language).
The resulting grammar is called a Statistical
Grammar of Usage (SGU). It enables the computa-
tion of the probability of a resource given the active
user and his/her sequence of navigation. This proba-
bility measures the pertinence of the resource.
A SGU, if trained on the whole usage corpus, is
a general grammar since it is learned for all users in
all contexts. The accuracy of such a grammar is in-
sufficient and furthermore, the presupposed logic and
coherency between users becomes a too strong and
unrealistic hypothesis. Given two users, it seems un-
likely that they exhibit the same resource consultation
behavior: the SGU has to be personalized. Never-
theless, learning a user-specific SGU requires a large
amount of data for each user and it is unrealistic to
wait for collecting enough data to train it. It is the
reason why we determine groups of users with simi-
lar behavior called communities. We compute a SGU
for each community and design a community-based
SGU. This approach is used in statistical language
modeling where corpus is split into topic sub-corpora.
Users are preclassified into a set of coherent com-
munities, in terms of resource consultation behav-
ior. Collaborative filtering techniques are a challenge-
able means to build coherent communities in terms of
usage. The principle of collaborative filtering tech-
niques (Herlocker et al., 2004) amounts to identifying
the active user to a set of users having the same tastes
and, that, based on his/her preferences and past vis-
ited resources. This approach relies on the hypothesis
that users who like the same documents have the same
topics of interests. Another is that people have rela-
tively constant likings. Thus, it is possible to predict
USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL LANGUAGE PROCESSING
221
resources likely to match user’s expectations by tak-
ing advantage of experience of his/her community.
A first comment on usual collaborative filtering
techniques is that the structure of navigation is ig-
nored. However, this aspect can be crucial in some ap-
plications such as web browsing. For example, a user
may not like a resource because he/she has not previ-
ously read a prerequisite resource. Thus the SGU sub-
mits a resource when it becomes pertinent for a user,
for example when he/she has read all prerequisites.
As statistical language models emphasize the order of
words in sentences, it seems interesting to determine
if such models and collaborative filtering can be used
together to improve the quality of suggestions.
3 STATISTICAL LANGUAGE
MODELS
3.1 Overview
The role of a statistical language model (SLM) is to
assign a likelihood to a given sentence (or sequence
of words) in a language (Jelinek and Mercer, 1980;
Rosenfeld, 2000). A SLM is defined as a set of proba-
bilities associated to sequences of words. These prob-
abilities reflect the likelihood of those sequences.
SLM are widely used in various natural language
applications such as optical character recognition, au-
tomatic speech recognition, etc.
Let the word sequence W = w
1
, . . . , w
S
. The prob-
ability of W is computed as the product of the condi-
tional probabilities of each word w
i
in the sequence:
P(W) =
S
i=1
P(w
i
| h
i
) =
S
i=1
P(w
i
, | w
1
. . . , w
i1
) (1)
where w
i
is the i
th
word of W. h
i
is the history of
w
i
. To estimate these probabilities, a vocabulary V =
{w
j
} is stated. The probability of sequences of words
are trained on a text corpus, the training corpus.
3.2 Advantage of Slm for Web Browsing
Web browsing and statistical language modeling
domains seems similar in several points. First,
statistical language modeling uses a vocabulary made
up of words. This set can be viewed as similar to
the set of resources R of the web. Then, the text
corpus is made up of sentences of words, they can be
viewed as similar to the sequences of consultations
of the usage corpus. A sequence of S words in a
sentence is similar to a sequence of consultation of
S resources. Finally, the presence of a word in a
sentence mainly depends on its previous words, as
the consultation of a resource mainly depends of the
preceding consultations.
Given these similarities, we can naturally investigate
the exploitation of models used in statistical language
modeling into a web browsing assistant. As noticed
in the previous section, these models have the char-
acteristic that the order of the elements in the history
is crucial. This aspect may be important for specific
resources in web browsing.
However, we have to notice that web browsing
and natural language processing have two major
differences. The first one is that it is possible that a
user may mix different queries within a single history
(”multi-navigation”) but it is unrealistic to mix
different sentences when speaking or writing. This
first remark brings us to consider a generalization
of SLM to take into account ”multi-navigation” in
the browsing process. The second one is that natural
language exhibits strongest constraints: each word
in a sentence is important and deleting or adding a
word may change the meaning of the sentence. Web
browsing is not so sensitive and adding or deleting
a specific resource within a navigation may have no
impact. Then we have to consider permissive models,
able to manage less constrained histories.
3.3 N-grams Language Models
Due to computational constraints and probability re-
liance, the whole history h
i
of a word w
i
cannot be
systematically used to compute the probability of W.
Classical SLM aim at reducing the size of the history
while not decreasing performance.
n-grams models (Jurawski and Martin, 2000) re-
duce the history of words to their n 1 previous
words. These models are the most commonly used
in most of natural language applications. The proba-
bility of a given word w
i
given history h
i
is computed
as follows:
P(w
i
|w
in+1
. . . w
i1
) =
N(w
in+1
. . . w
i1
, w
i
)
N(w
in+1
. . . w
i1
)
(2)
where N(.) is the number of occurrences of the argu-
ment, in the training corpus.
n-grams model can be directly used in web brows-
ing assistance. In the previous section, we put forward
that the quality of the model will be increased if it is
dedicated to a community and trained on the corre-
sponding community usage corpus. Thus, the usage
corpus is split into community usage corpora and a
model is trained on each community corpus.
WEBIST 2007 - International Conference on Web Information Systems and Technologies
222
Let a community c
j
and a sequence of consulta-
tions of resources h
j
= R
j1
, . . . , R
ji1
. The n-grams
model computes the probability for each resource
R
i
R:
P
n
(R
i
| R
in+1
, . . . , R
i1
, c
j
) =
N
c
j
(R
in+1
, . . . , R
i1
, R
i
)
N
c
j
(R
in+1
, . . . , R
i1
)
(3)
where N
c
j
(.) is the number of occurrences of the
parameter in the community usage corpus c
j
. The his-
tory h
j
has been reduced to the n 1 last resources
consulted, other resources are discarded. Thus, this
model assumes that the consultation of a resource R
i
does not mainly depend on resources consulted far
from R
i
.
As previously mentioned, adding or deleting a re-
source in a sequence of consultations has a lower
influence on the result of the search than adding or
deleting a word in a sentence. Thus, this model does
not ideally match our retrieval problem since the his-
tory considered is the exact sequence of consultations
R
in+1
. . . R
i1
, that may be too restrictive in the gen-
eral case. However, this model may be suitable for
frequent sequences of consultations, that can be con-
sidered as “patterns of consultation”. They are as-
signed a high probability, thus increasing the proba-
bility of resources inside such sequences. It should
be interesting to take into account, in a more adequate
way, such “patterns of consultations”.
As n-grams models exhibit strong constraints,
we are also interested in more permissive models.
Trigger-based language models seem to me more ad-
equate to less constraint histories such as navigation.
3.4 Trigger-based Language Models
Trigger-based models (Rosenfeld, 1996) aim at con-
sidering long-time dependence between two words
(w
x
and w
y
for instance). Dependence is measured
by Mutual Information (MI) (Abramson, 1963). This
measure can easily integrate long-time dependence by
using a distance parameter d. d is the maximum num-
ber of words occurring between w
x
and w
y
, a window
of d words is thus considered. MI between words w
x
and w
y
, in a window of d words, is computed as:
MI(w
x
, w
y
, d) = log
P
d
(w
x
, w
y
)
P
d
(w
x
)P
d
(w
y
)
(4)
where P
d
(w
x
, w
y
) is the probability of w
x
preced-
ing w
y
at a distance at most d, in the training corpus.
A couple (w
x
,w
y
) with a high MI value means that
w
x
and w
y
are highly correlated and the presence of w
x
raises the probability of occurrence of w
y
, at a max-
imal distance of d words. (w
x
,w
y
) is named a trig-
ger. This model considers only highly correlated pairs
of words (corresponding to high MI values), useless
pairs are discarded.
In our web browsing assistant tool, the trigger
model is made up of triggers of resources (R
x
,R
y
).
The consultation of R
x
triggers the consultation of R
y
,
at a maximal distance of d resources. As MI mea-
sure is not symmetric (MI(R
x
, R
y
) 6= MI(R
y
, R
x
)), this
model integrates order between resources, that may
be crucial for specific resources.
The advantage of this model is the long-time de-
pendence between resources. In a consultation, two
resources can be viewed with various values of dis-
tance without changing the meaning of the consulta-
tion. Trigger models enable to modelize this kind of
influence, when the distance between items is not dis-
criminant but the order of occurrence is meaningful.
Such a model is less constrained than n-grams models
and seems to be adequate to the navigation problem.
Similarly to n-grams model, a trigger-model is de-
veloped for each community c
j
. MI values are com-
puted for each couple of resources and for each com-
munity. A set of most related triggers is extracted for
each community c
j
. This set is called S
c
j
.
The probability of a resource R
i
, given the com-
munity c
j
, its corresponding set of triggers S
c
j
and the sequence of consultation of resources h
j
=
R
1
, . . . , R
i1
is:
P
t
(R
i
| h
j
, c
j
) =
R
x
h
j
δ
R
x
,R
i
,h
j
,S
c
j
R
x
h
j
R
y
R
δ
R
x
,R
y
,h
j
,S
c
j
(5)
with
δ
R
x
,R
i
,h
j
,S
c
j
=
1 (R
x
, R
i
) S
c
j
and d
j
(R
x
, R
i
) d
0 otherwise
d
j
(R
x
, R
i
) is the distance between R
x
and R
i
in h
j
.
3.5 Distance-dependent Trigger Model
State of the art trigger models, as previously pre-
sented, aim at considering long distance relations
(distance between 0 and d). Each couple of resources
appearing ”frequently” at a maximal distance d is set
in the model.
However, this kind of relation between resources
is too general and we assume that finer relations are
present in the corpus. Let two resources R
x
and R
y
,
always cooccurring at a maximal distance u where
u << d. During training, each cooccurrence (at a dis-
tance at most u) of this couple is taken into account as
a cooccurrence of this couple at a maximal distance
d, the corresponding MI value is computed. If this
MI value is high, the pair is selected as a trigger pair.
During test, if R
x
occurs, the probability of R
y
is in-
creased while d resources have not been consulted.
USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL LANGUAGE PROCESSING
223
This trigger is misused: R
x
and R
y
appear at a dis-
tance at most u during training, this distance u should
also be considered during test: the probability of R
y
should be increased at a distance lower than u.
Thus, we assume that taking into account finer re-
lations by using the actual training distance between
R
x
and R
y
will correspond to a better modelization
of relations between resources and thus increase the
quality of the trigger model.
We propose a model able to consider several kinds
of relations, such as
1. The above relation: two resources are mainly con-
sulted at a distance lower than u with u d.
2. The converse relation : two resources are mainly
consulted at a distance larger than l where 0 l
3. Two resources are mainly consulted at a distance
between l and u with l u d
However fixing, for all triggers, the same value
for l and the same value for u can be suboptimal:
obviously these values depend on both resources of
the trigger.
3.5.1 Computing Optimal Values for l and u
Given a community c
j
and a pair (R
x
,R
y
), the optimal
values of l
and u
are the ones maximizing:
l
, u
= argmax MI
c
j
(R
x
, R
y
, l, u)
l, u
(6)
where l and u rank from 0 to d. MI
c
j
(R
x
, R
y
, l, u) is
the mutual information of resources R
x
and R
y
at a
distance ranking from l to u in the community c
j
and
is computed as follows:
MI
c
j
(R
x
, R
y
, l, u) = log
P
c
j
,l,u
(R
x
, R
y
)
P
c
j
,l,u
(R
x
) P
c
j
,l,u
(R
y
)
(7)
P
c
j
,l,u
(R
x
, R
y
) is the probability of cooccurrence of R
x
and R
y
at a distance ranking from l to u in the com-
munity corpus c
j
.
Let us notice that the MI value is not reliable if
values in the denominator are low. Indeed, when
those values are too low, the MI value is anormally
highed, then does not represent the real correlation
value between the two resources. Thus, MI will not
be computed for pairs with low denominator values.
3.5.2 Formalization of the New Trigger Model
The trigger model we propose here is made up of
highly correlated and distance-dependent pairs: a
lower value of distance l and an upper value of dis-
tance u are considered for each pair. Given the trigger
(R
x
, R
y
, l, u), the probability of R
y
is increased if R
x
occurs in the history h
j
, and the distance d
j
(R
x
, R
y
)
between R
x
and R
y
is between l and u.
Thus, given the community c
j
, the corresponding
set of triggers S
c
j
and a history h
j
, the probability as-
signed to a given resource R
i
by the trigger model is
defined as in equation (5) and
δ
R
x
,R
i
,h
j
,S
c
j
=
1 (R
x
, R
i
) S
c
j
and
l d
j
(R
x
, R
i
) u
0 otherwise
Both n-grams and distance-dependent triggers
models are candidates to integration into a web
browsing tool. A n-grams model computes the prob-
ability of sequences of consultation, a trigger model
extracts pairs of distant resources. Consequently,both
are interesting to achieve our goal and will be inte-
grated in the community-based SGU we propose.
4 TOWARDS A
COMMUNITY-BASED SGU
The SGU we propose has the advantage of consider-
ing both the community of the active user and his/her
consultation history, whereas state of the art models
usually exploit the set of consultations. The use of
this model relies on two steps:
1. Determination of the community c
j
of user U
j
.
2. Computation of the probability of each resource
R
i
, given c
j
and the history h
j
of U
j
.
4.1 Community Determination
The objective is to compute a set of user communities
based on an analysis of usage. To achieve this goal,
we use collaborative filtering techniques. The set of
users is split into classes by using a recursivek-means
like algorithm (Castagnos and Boyer, 2006a), the sim-
ilarity between two users is estimated as the mean of
the distance for each commonly voted resource.
The whole corpus is then split into a set of com-
munity sub-corpora. Each community corpus is made
up of usage of any user in the community. A user
is then assigned to the closest community using the
same similarity measure.
4.2 Probability Computation
Given the community c
j
of user U
j
, and his history
h
j
, the computation of the probability of a resource R
i
WEBIST 2007 - International Conference on Web Information Systems and Technologies
224
relies on three sub-models based on language models
presented in section 4.
The first sub-model computes the probability
P
n
(R
i
| h
j
, c
j
), by exploiting the probabilities of re-
source sequences of the n-grams model. The sec-
ond sub-model is the distance-dependent trigger, it
computes the probability P
t
(R
i
| h
j
, c
j
) This last sub-
model is devoted to resources out of the training cor-
pus. A probability a priori P
a
(R
i
| c
j
) is set to each
resource Ri R.
The resulting model, that can be viewed as
a community-based Statistical Grammar of Usage,
computes the linear combination of the three previ-
ously described sub-models.
P(R
i
| h
j
, c
j
) = (8)
λ
n
P
n
(R
i
| h
j
, c
j
) + λ
t
P
t
(R
i
| h
j
, c
j
) + λ
a
P
a
(R
i
| c
j
)
Where λ
n
, λ
t
and λ
a
sum up to 1 and are optimized
with EM algorithm on a development corpus.
Thus, given a user U
j
and his/her history h
j
, we
first have to determine the community c
j
he/she be-
longs to. The probability of any available resource is
computed given the SGU learned for this community.
The N most likely resources are selected.
5 CONCLUSION AND
PERSPECTIVES
This paper aims at describing a new web browsing as-
sistant, based on usage and natural language process-
ing. This approach exempts the difficult task of con-
tent, structure or format indexing and facilitates het-
erogeneous resources management. Similarities be-
tween SLM and web browsing are put forward, there-
fore the integration of usual statistical models from
statistical language modeling domain is investigated.
The resulting model is a Statistical Grammar of Us-
age (SGU). As a single SGU may be inefficient, it
has to be personalized. To tackle sparsity of data,
a preclassification of users into communities is per-
formed. Community-based SGU are then proposed.
Moreover, a new model is introduced, managing vari-
able distance dependent triggers.
This new model is a first contribution to increase
quality of prediction of resources in web brows-
ing. A second contribution consists in the design
of community-based SGU, predicting the sequen-
tiality of resources during navigation. Moreover, a
community-based SGU builds an a posteriori struc-
ture of navigation based on the subjective but reliable
measure of pertinence of a resource for a user. Con-
sequently it performs a personalized indexing of re-
sources, based on usage analysis.
Collaborative filtering techniques used to build
communities and triggers used to suggest resources
have both proved their efficiency in their respective
domain. A first perspective is the validation of the
community-based SGU in terms of quality of predic-
tions in web browsing. This evaluation can be per-
formed by measuring the perplexity of the model. A
second perspective is the use of the community-based
SGU to compute a personalized classification of re-
sources, depending not only on topics but also on
user’s preferences and context.
REFERENCES
Abramson, N. (1963). Information Theory and Coding.
McGraw-Hill, New-York.
Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern In-
formation Retrieval. ACM Press, New York.
Brun, A., Sma¨ıli, K., and Haton, J. (2002). Contribution
too topic identification by using word similarity. In
ICSLP2002.
Castagnos, S. and Boyer, A. (2006a). A client/server user-
based collaborative filtering algorithm model and im-
plementation. In Proceedings of ECAI 2006, Italy.
Castagnos, S. and Boyer, A. (2006b). Frac+: A distributed
collaborative ltering model for client/server architec-
tures. In WEBIST 2006, Portugal.
Chan, P. (1999). A non-invasive learning approach to build-
ing web user profiles. In KDD 1999 - Workshop on
Web Usage Analysis and User Profiling, USA.
Herlocker, J., Konstan, J., Terveen, L., and Riedl, J. (2004).
Evaluating collaborative filtering recommender sys-
tems. ACM Transactions on Information Systems
(TOIS), 22(1):5–53.
Jelinek, F. and Mercer, R. (1980). Interpolated estimation
of markov source parameters from sparse data. In Wk.
on Pattern Recognition in Practice, pages 381–397.
Jurawski, D. and Martin, J. H. (2000). Speech and Lan-
guage Processing: an Introduction to Natural Lan-
guage Processing. Prentice-Hall.
Rosenfeld, R. (1996). A maximum entropy approach to
adaptative statistical language modeling. Computer
Speech and Language, 10:187–228.
Rosenfeld, R. (2000). Two decades of statistical language
modeling: Where do we go from here.
Sma¨ıli, K., Brun, A., Zitouni, I., and Haton, J. (1999). Au-
tomatic and manual clustering for large vocabulary
speech re cognition: A comparative study. In Eu-
rospeech’99, Hungary.
USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL LANGUAGE PROCESSING
225