the community-based Statistical Grammar of Usage.
Discussion and perspectives conclude this paper.
2 OUR APPROACH
Our web browsing tool helps users during a naviga-
tion process: it suggests the pertinent items to a spe-
cific user, given his/her past navigation and context.
The aim is to compute the pertinence of any resource.
The pertinence of a resource is the interest of a user
for it and allows to compute predictions of resources
(the highest the pertinence of a resource is, the highest
is its probability to be suggested).
First, we hypothesize an implicit search, it means
that the active user has no explicit queries to formu-
late. Secondly, we consider as a consultation the se-
quence of one or more items, dedicated to a given
search. A multi-navigation is the mix of different con-
sultations within a single browsing process. One sin-
gle consultation is called a mono-navigation.
A resource is any item (textual, audio, video or
multimedia document, web page, hyperlink, forum,
blog, website, etc.), viewed as an elementary and in-
divisible entity without any information about its for-
mat, its content or any semantic or topic indexing.
The only data describing a priori a resource is a nor-
malized mark called identifier, enabling to identify
and to locate it.
Our approach relies on an analysis of usage. A us-
age is any data, explicitly or implicitly left by the user
during navigation. For example, history of consulta-
tion, click-stream or log files are implicit data about
the interest of the visited items for the active user. We
call appreciation any measure of the user’s satisfac-
tion. This measure can be either an explicit informa-
tion as votes, annotations or any estimation computed
from implicit data (Chan, 1999).
An advantage of our approach is that it only takes
into account a measure of the user’s interest for a
given resource, which is directly linked to the perti-
nence criterion: the user’s satisfaction. Our approach
computes a personalized indexing of resources not in
terms of its intrinsic nature but in terms of a more
subjective but more reliable and pertinent criterion,
i.e. the user’s context, preferences and habits. Then
this approach manages heterogeneous resources with
a single treatment.
The question is how to estimate the a priori per-
tinence of a resource for a given user. The difficulty
relies on sparsity of data: we don’t have any appre-
ciation of a resource if this user has not seen it and
usually most of resources have not been seen by this
user. As we design a personalized tool, the pertinence
cannot be a context independent measure (the context
is both the user’s profile and history).
To compute the a priori pertinence of a resource,
we plan to design a grammar of usage. As a gram-
mar of language is the set of rules describing the re-
lation between words, a grammar of usage is the set
of rules describing the relation between resources. A
grammar of language estimates if a word is pertinent
given the beginning of a sentence. A grammar of us-
age allows to estimate if a resource is relevant for a
specific user given his/her previous consultations.
There is no a priori grammar of usage, as Internet
is a dynamic and moving environment. A means to
cope with the difficulty of designing an a priori gram-
mar is the use of a statistical approach based on usage
analysis. As huge usage corpora are available (log
files and clickstream) it makes it possible to explicit
regularities in terms of resource consultations. This
statistical approach can be investigated in a similar
way to language modeling based on statistical models
(defining a Statistical Grammar of Language).
The resulting grammar is called a Statistical
Grammar of Usage (SGU). It enables the computa-
tion of the probability of a resource given the active
user and his/her sequence of navigation. This proba-
bility measures the pertinence of the resource.
A SGU, if trained on the whole usage corpus, is
a general grammar since it is learned for all users in
all contexts. The accuracy of such a grammar is in-
sufficient and furthermore, the presupposed logic and
coherency between users becomes a too strong and
unrealistic hypothesis. Given two users, it seems un-
likely that they exhibit the same resource consultation
behavior: the SGU has to be personalized. Never-
theless, learning a user-specific SGU requires a large
amount of data for each user and it is unrealistic to
wait for collecting enough data to train it. It is the
reason why we determine groups of users with simi-
lar behavior called communities. We compute a SGU
for each community and design a community-based
SGU. This approach is used in statistical language
modeling where corpus is split into topic sub-corpora.
Users are preclassified into a set of coherent com-
munities, in terms of resource consultation behav-
ior. Collaborative filtering techniques are a challenge-
able means to build coherent communities in terms of
usage. The principle of collaborative filtering tech-
niques (Herlocker et al., 2004) amounts to identifying
the active user to a set of users having the same tastes
and, that, based on his/her preferences and past vis-
ited resources. This approach relies on the hypothesis
that users who like the same documents have the same
topics of interests. Another is that people have rela-
tively constant likings. Thus, it is possible to predict
USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL LANGUAGE PROCESSING
221