USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL

LANGUAGE PROCESSING

Armelle Brun and Anne Boyer

INRIA Lorraine - Universit´e Nancy2 , France

Keywords:

Recommender systems, collaborative ﬁltering, analysis of usage, statistical language model, resource index-

ing.

Abstract:

Due to the huge amount of available information via Internet, the identiﬁcation of reliable and interesting items

becomes more and more difﬁcult and time consuming. This paper is a position paper describing our intended

work in the framework of multimedia information retrieval by browsing techniques within web navigation. It

relies on a usage-based indexing of resources: we ignore the nature, the content and the structure of resources.

We describe a new approach taking advantage of the similarity between statistical modeling of language and

document retrieval systems. A syntax of usage is computed that designs a Statistical Grammar of Usage

(SGU). A SGU enables resources classiﬁcation to perform a personalized navigation assistant tool. It relies

both on collaborative ﬁltering to compute virtual communities of users and a new distance dependent trigger

model. The resulting SGU is a community dependent SGU.

1 INTRODUCTION

On Internet the identiﬁcation of reliable and interest-

ing items becomes more and more difﬁcult and time

consuming, even for skilled people using dedicated

tools, such as powerful search engines. Due to the

huge amount of online resources, the major difﬁculty

is nevermore to know if a pertinent document is avail-

able but to identify the more reliable and interesting

items among the overwhelming stream of available

information. A key factor of success in information

retrieval and delivery is the development of powerful

tools easy-to-use for a large audience.

Different approaches for resources retrieval are

explored, such as content analysis, keywords index-

ing and identiﬁcation, topic detection, etc. (Baeza-

Yates and Ribeiro-Neto, 1999). A major difﬁculty in-

herent to such approaches is that one keyword may

havedifferent meanings, or not, dependentof the user,

his/her context and the history of his/her past navi-

gations. Moreover two different keywords may have

similar meanings, depending on the context. Express-

ing a query is a difﬁcult task for many people and a

lot of research and industrial projects deal with query

assistance. Furthermore automatic indexing of multi-

media resources is still a hard research problem. To

cope with these difﬁculties we decide to investigate

another way by ignoring the content, the nature, the

format and the structure of resources.

This position paper describes our intended work,

relying on past researches both on collaborative ﬁl-

tering (Castagnos and Boyer, 2006a; Castagnos and

Boyer, 2006b) and statistical language modeling

(Sma¨ıli et al., 1999; Brun et al., 2002). We aim at pro-

viding a new web browsing tool based on an analysis

of usage. This tool enables multimedia information

retrieval by browsing techniques without expressing

any query: users are modelized without requiring any

preferences elicitation. This approach manages eas-

ily heterogeneous items (video, audio, textual, multi-

media) with a single treatment, as classical methods

require dedicated tools for resource tagging.

We plan to extract frequent patterns of consul-

tations by taking advantage of the analogy between

language-based statistical modeling and resource re-

trieval. These frequent patterns allow the design of

syntax of usage, relying on the hypothesis that there is

logic and coherency deﬁning implicit ”rules” inside a

navigation. The resulting Statistical Grammar of Us-

age enables a classiﬁcation, clustering and selection

of resources to design personalized ﬁltering.

In the next section, the problem of retrieving re-

sources when browsing is stated and our approach

based on the use of statistical language models is de-

tailed. The following section presents the most popu-

lar statistical language models and their appropriate-

ness to web browsing. An adaptation of the trigger

language model is introduced. Section 4 puts forward

220

Brun A. and Boyer A. (2007).

USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL LANGUAGE PROCESSING.

In Proceedings of the Third International Conference on Web Information Systems and Technologies - Web Interfaces and Applications, pages 220-225

DOI: 10.5220/0001278902200225

 SciTePress

the community-based Statistical Grammar of Usage.

Discussion and perspectives conclude this paper.

2 OUR APPROACH

Our web browsing tool helps users during a naviga-

tion process: it suggests the pertinent items to a spe-

ciﬁc user, given his/her past navigation and context.

The aim is to compute the pertinence of any resource.

The pertinence of a resource is the interest of a user

for it and allows to compute predictions of resources

(the highest the pertinence of a resource is, the highest

is its probability to be suggested).

First, we hypothesize an implicit search, it means

that the active user has no explicit queries to formu-

late. Secondly, we consider as a consultation the se-

quence of one or more items, dedicated to a given

search. A multi-navigation is the mix of different con-

sultations within a single browsing process. One sin-

gle consultation is called a mono-navigation.

A resource is any item (textual, audio, video or

multimedia document, web page, hyperlink, forum,

blog, website, etc.), viewed as an elementary and in-

divisible entity without any information about its for-

mat, its content or any semantic or topic indexing.

The only data describing a priori a resource is a nor-

malized mark called identiﬁer, enabling to identify

and to locate it.

Our approach relies on an analysis of usage. A us-

age is any data, explicitly or implicitly left by the user

during navigation. For example, history of consulta-

tion, click-stream or log ﬁles are implicit data about

the interest of the visited items for the active user. We

call appreciation any measure of the user’s satisfac-

tion. This measure can be either an explicit informa-

tion as votes, annotations or any estimation computed

from implicit data (Chan, 1999).

An advantage of our approach is that it only takes

into account a measure of the user’s interest for a

given resource, which is directly linked to the perti-

nence criterion: the user’s satisfaction. Our approach

computes a personalized indexing of resources not in

terms of its intrinsic nature but in terms of a more

subjective but more reliable and pertinent criterion,

i.e. the user’s context, preferences and habits. Then

this approach manages heterogeneous resources with

a single treatment.

The question is how to estimate the a priori per-

tinence of a resource for a given user. The difﬁculty

relies on sparsity of data: we don’t have any appre-

ciation of a resource if this user has not seen it and

usually most of resources have not been seen by this

user. As we design a personalized tool, the pertinence

cannot be a context independent measure (the context

is both the user’s proﬁle and history).

To compute the a priori pertinence of a resource,

we plan to design a grammar of usage. As a gram-

mar of language is the set of rules describing the re-

lation between words, a grammar of usage is the set

of rules describing the relation between resources. A

grammar of language estimates if a word is pertinent

given the beginning of a sentence. A grammar of us-

age allows to estimate if a resource is relevant for a

speciﬁc user given his/her previous consultations.

There is no a priori grammar of usage, as Internet

is a dynamic and moving environment. A means to

cope with the difﬁculty of designing an a priori gram-

mar is the use of a statistical approach based on usage

analysis. As huge usage corpora are available (log

ﬁles and clickstream) it makes it possible to explicit

regularities in terms of resource consultations. This

statistical approach can be investigated in a similar

way to language modeling based on statistical models

(deﬁning a Statistical Grammar of Language).

The resulting grammar is called a Statistical

Grammar of Usage (SGU). It enables the computa-

tion of the probability of a resource given the active

user and his/her sequence of navigation. This proba-

bility measures the pertinence of the resource.

A SGU, if trained on the whole usage corpus, is

a general grammar since it is learned for all users in

all contexts. The accuracy of such a grammar is in-

sufﬁcient and furthermore, the presupposed logic and

coherency between users becomes a too strong and

unrealistic hypothesis. Given two users, it seems un-

likely that they exhibit the same resource consultation

behavior: the SGU has to be personalized. Never-

theless, learning a user-speciﬁc SGU requires a large

amount of data for each user and it is unrealistic to

wait for collecting enough data to train it. It is the

reason why we determine groups of users with simi-

lar behavior called communities. We compute a SGU

for each community and design a community-based

SGU. This approach is used in statistical language

modeling where corpus is split into topic sub-corpora.

Users are preclassiﬁed into a set of coherent com-

munities, in terms of resource consultation behav-

ior. Collaborative ﬁltering techniques are a challenge-

able means to build coherent communities in terms of

usage. The principle of collaborative ﬁltering tech-

niques (Herlocker et al., 2004) amounts to identifying

the active user to a set of users having the same tastes

and, that, based on his/her preferences and past vis-

ited resources. This approach relies on the hypothesis

that users who like the same documents have the same

topics of interests. Another is that people have rela-

tively constant likings. Thus, it is possible to predict

USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL LANGUAGE PROCESSING

221

resources likely to match user’s expectations by tak-

ing advantage of experience of his/her community.

A ﬁrst comment on usual collaborative ﬁltering

techniques is that the structure of navigation is ig-

nored. However, this aspect can be crucial in some ap-

plications such as web browsing. For example, a user

may not like a resource because he/she has not previ-

ously read a prerequisite resource. Thus the SGU sub-

mits a resource when it becomes pertinent for a user,

for example when he/she has read all prerequisites.

As statistical language models emphasize the order of

words in sentences, it seems interesting to determine

if such models and collaborative ﬁltering can be used

together to improve the quality of suggestions.

3 STATISTICAL LANGUAGE

MODELS

3.1 Overview

The role of a statistical language model (SLM) is to

assign a likelihood to a given sentence (or sequence

of words) in a language (Jelinek and Mercer, 1980;

Rosenfeld, 2000). A SLM is deﬁned as a set of proba-

bilities associated to sequences of words. These prob-

abilities reﬂect the likelihood of those sequences.

SLM are widely used in various natural language

applications such as optical character recognition, au-

tomatic speech recognition, etc.

Let the word sequence W = w

, . . . , w

. The prob-

ability of W is computed as the product of the condi-

tional probabilities of each word w

in the sequence:

P(W) =

∏

i=1

P(w

| h

) =

∏

i=1

P(w

, | w

. . . , w

i−1

) (1)

where w

is the i

word of W. h

is the history of

. To estimate these probabilities, a vocabulary V =

} is stated. The probability of sequences of words

are trained on a text corpus, the training corpus.

3.2 Advantage of Slm for Web Browsing

Web browsing and statistical language modeling

domains seems similar in several points. First,

statistical language modeling uses a vocabulary made

up of words. This set can be viewed as similar to

the set of resources R of the web. Then, the text

corpus is made up of sentences of words, they can be

viewed as similar to the sequences of consultations

of the usage corpus. A sequence of S words in a

sentence is similar to a sequence of consultation of

S resources. Finally, the presence of a word in a

sentence mainly depends on its previous words, as

the consultation of a resource mainly depends of the

preceding consultations.

Given these similarities, we can naturally investigate

the exploitation of models used in statistical language

modeling into a web browsing assistant. As noticed

in the previous section, these models have the char-

acteristic that the order of the elements in the history

is crucial. This aspect may be important for speciﬁc

resources in web browsing.

However, we have to notice that web browsing

and natural language processing have two major

differences. The ﬁrst one is that it is possible that a

user may mix different queries within a single history

(”multi-navigation”) but it is unrealistic to mix

different sentences when speaking or writing. This

ﬁrst remark brings us to consider a generalization

of SLM to take into account ”multi-navigation” in

the browsing process. The second one is that natural

language exhibits strongest constraints: each word

in a sentence is important and deleting or adding a

word may change the meaning of the sentence. Web

browsing is not so sensitive and adding or deleting

a speciﬁc resource within a navigation may have no

impact. Then we have to consider permissive models,

able to manage less constrained histories.

3.3 N-grams Language Models

Due to computational constraints and probability re-

liance, the whole history h

of a word w

cannot be

systematically used to compute the probability of W.

Classical SLM aim at reducing the size of the history

while not decreasing performance.

n-grams models (Jurawski and Martin, 2000) re-

duce the history of words to their n − 1 previous

words. These models are the most commonly used

in most of natural language applications. The proba-

bility of a given word w

given history h

is computed

as follows:

P(w

i−n+1

. . . w

i−1

) =

N(w

i−n+1

. . . w

i−1

, w

)

N(w

i−n+1

. . . w

i−1

)

(2)

where N(.) is the number of occurrences of the argu-

ment, in the training corpus.

n-grams model can be directly used in web brows-

ing assistance. In the previous section, we put forward

that the quality of the model will be increased if it is

dedicated to a community and trained on the corre-

sponding community usage corpus. Thus, the usage

corpus is split into community usage corpora and a

model is trained on each community corpus.

WEBIST 2007 - International Conference on Web Information Systems and Technologies

222

Let a community c

and a sequence of consulta-

tions of resources h

= R

, . . . , R

ji−1

. The n-grams

model computes the probability for each resource

∈ R:

| R

i−n+1

, . . . , R

i−1

, c

) =

i−n+1

, . . . , R

i−1

, R

)

i−n+1

, . . . , R

i−1

)

(3)

where N

(.) is the number of occurrences of the

parameter in the community usage corpus c

. The his-

tory h

has been reduced to the n − 1 last resources

consulted, other resources are discarded. Thus, this

model assumes that the consultation of a resource R

does not mainly depend on resources consulted far

from R

As previously mentioned, adding or deleting a re-

source in a sequence of consultations has a lower

inﬂuence on the result of the search than adding or

deleting a word in a sentence. Thus, this model does

not ideally match our retrieval problem since the his-

tory considered is the exact sequence of consultations

i−n+1

. . . R

i−1

, that may be too restrictive in the gen-

eral case. However, this model may be suitable for

frequent sequences of consultations, that can be con-

sidered as “patterns of consultation”. They are as-

signed a high probability, thus increasing the proba-

bility of resources inside such sequences. It should

be interesting to take into account, in a more adequate

way, such “patterns of consultations”.

As n-grams models exhibit strong constraints,

we are also interested in more permissive models.

Trigger-based language models seem to me more ad-

equate to less constraint histories such as navigation.

3.4 Trigger-based Language Models

Trigger-based models (Rosenfeld, 1996) aim at con-

sidering long-time dependence between two words

and w

for instance). Dependence is measured

by Mutual Information (MI) (Abramson, 1963). This

measure can easily integrate long-time dependence by

using a distance parameter d. d is the maximum num-

ber of words occurring between w

and w

, a window

of d words is thus considered. MI between words w

and w

, in a window of d words, is computed as:

MI(w

, w

, d) = log

, w

)

(4)

where P

, w

) is the probability of w

preced-

ing w

at a distance at most d, in the training corpus.

A couple (w

) with a high MI value means that

and w

are highly correlated and the presence of w

raises the probability of occurrence of w

, at a max-

imal distance of d words. (w

) is named a trig-

ger. This model considers only highly correlated pairs

of words (corresponding to high MI values), useless

pairs are discarded.

In our web browsing assistant tool, the trigger

model is made up of triggers of resources (R

The consultation of R

triggers the consultation of R

at a maximal distance of d resources. As MI mea-

sure is not symmetric (MI(R

, R

) 6= MI(R

, R

)), this

model integrates order between resources, that may

be crucial for speciﬁc resources.

The advantage of this model is the long-time de-

pendence between resources. In a consultation, two

resources can be viewed with various values of dis-

tance without changing the meaning of the consulta-

tion. Trigger models enable to modelize this kind of

inﬂuence, when the distance between items is not dis-

criminant but the order of occurrence is meaningful.

Such a model is less constrained than n-grams models

and seems to be adequate to the navigation problem.

Similarly to n-grams model, a trigger-model is de-

veloped for each community c

. MI values are com-

puted for each couple of resources and for each com-

munity. A set of most related triggers is extracted for

each community c

. This set is called S

The probability of a resource R

, given the com-

munity c

, its corresponding set of triggers S

and the sequence of consultation of resources h

, . . . , R

i−1

is:

| h

, c

) =

∑

∈h

∑

∈h

∑

∈R

(5)

with



1 (R

, R

) ∈ S

and d

, R

) ≤ d

0 otherwise

, R

) is the distance between R

and R

in h

3.5 Distance-dependent Trigger Model

State of the art trigger models, as previously pre-

sented, aim at considering long distance relations

(distance between 0 and d). Each couple of resources

appearing ”frequently” at a maximal distance d is set

in the model.

However, this kind of relation between resources

is too general and we assume that ﬁner relations are

present in the corpus. Let two resources R

and R

always cooccurring at a maximal distance u where

u << d. During training, each cooccurrence (at a dis-

tance at most u) of this couple is taken into account as

a cooccurrence of this couple at a maximal distance

d, the corresponding MI value is computed. If this

MI value is high, the pair is selected as a trigger pair.

During test, if R

occurs, the probability of R

is in-

creased while d resources have not been consulted.

USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL LANGUAGE PROCESSING

223

This trigger is misused: R

and R

appear at a dis-

tance at most u during training, this distance u should

also be considered during test: the probability of R

should be increased at a distance lower than u.

Thus, we assume that taking into account ﬁner re-

lations by using the actual training distance between

and R

will correspond to a better modelization

of relations between resources and thus increase the

quality of the trigger model.

We propose a model able to consider several kinds

of relations, such as

1. The above relation: two resources are mainly con-

sulted at a distance lower than u with u ≤ d.

2. The converse relation : two resources are mainly

consulted at a distance larger than l where 0 ≤ l

3. Two resources are mainly consulted at a distance

between l and u with l ≤ u ≤ d

However ﬁxing, for all triggers, the same value

for l and the same value for u can be suboptimal:

obviously these values depend on both resources of

the trigger.

3.5.1 Computing Optimal Values for l and u

Given a community c

and a pair (R

), the optimal

values of l

∗

and u

∗

are the ones maximizing:

∗

, u

∗

= argmax MI

, R

, l, u)

l, u

(6)

where l and u rank from 0 to d. MI

, R

, l, u) is

the mutual information of resources R

and R

at a

distance ranking from l to u in the community c

and

is computed as follows:

, R

, l, u) = log

,l,u

, R

)

,l,u

) P

,l,u

)

(7)

,l,u

, R

) is the probability of cooccurrence of R

and R

at a distance ranking from l to u in the com-

munity corpus c

Let us notice that the MI value is not reliable if

values in the denominator are low. Indeed, when

those values are too low, the MI value is anormally

highed, then does not represent the real correlation

value between the two resources. Thus, MI will not

be computed for pairs with low denominator values.

3.5.2 Formalization of the New Trigger Model

The trigger model we propose here is made up of

highly correlated and distance-dependent pairs: a

lower value of distance l and an upper value of dis-

tance u are considered for each pair. Given the trigger

, R

, l, u), the probability of R

is increased if R

occurs in the history h

, and the distance d

, R

)

between R

and R

is between l and u.

Thus, given the community c

, the corresponding

set of triggers S

and a history h

, the probability as-

signed to a given resource R

by the trigger model is

deﬁned as in equation (5) and







1 (R

, R

) ∈ S

and

l ≤ d

, R

) ≤ u

0 otherwise

Both n-grams and distance-dependent triggers

models are candidates to integration into a web

browsing tool. A n-grams model computes the prob-

ability of sequences of consultation, a trigger model

extracts pairs of distant resources. Consequently,both

are interesting to achieve our goal and will be inte-

grated in the community-based SGU we propose.

4 TOWARDS A

COMMUNITY-BASED SGU

The SGU we propose has the advantage of consider-

ing both the community of the active user and his/her

consultation history, whereas state of the art models

usually exploit the set of consultations. The use of

this model relies on two steps:

1. Determination of the community c

of user U

2. Computation of the probability of each resource

, given c

and the history h

of U

4.1 Community Determination

The objective is to compute a set of user communities

based on an analysis of usage. To achieve this goal,

we use collaborative ﬁltering techniques. The set of

users is split into classes by using a recursivek-means

like algorithm (Castagnos and Boyer, 2006a), the sim-

ilarity between two users is estimated as the mean of

the distance for each commonly voted resource.

The whole corpus is then split into a set of com-

munity sub-corpora. Each community corpus is made

up of usage of any user in the community. A user

is then assigned to the closest community using the

same similarity measure.

4.2 Probability Computation

Given the community c

of user U

, and his history

, the computation of the probability of a resource R

WEBIST 2007 - International Conference on Web Information Systems and Technologies

224

relies on three sub-models based on language models

presented in section 4.

The ﬁrst sub-model computes the probability

| h

, c

), by exploiting the probabilities of re-

source sequences of the n-grams model. The sec-

ond sub-model is the distance-dependent trigger, it

computes the probability P

| h

, c

) This last sub-

model is devoted to resources out of the training cor-

pus. A probability a priori P

| c

) is set to each

resource Ri ∈ R.

The resulting model, that can be viewed as

a community-based Statistical Grammar of Usage,

computes the linear combination of the three previ-

ously described sub-models.

P(R

| h

, c

) = (8)

| h

, c

) + λ

| h

, c

) + λ

| c

)

Where λ

, λ

and λ

sum up to 1 and are optimized

with EM algorithm on a development corpus.

Thus, given a user U

and his/her history h

, we

ﬁrst have to determine the community c

he/she be-

longs to. The probability of any available resource is

computed given the SGU learned for this community.

The N most likely resources are selected.

5 CONCLUSION AND

PERSPECTIVES

This paper aims at describing a new web browsing as-

sistant, based on usage and natural language process-

ing. This approach exempts the difﬁcult task of con-

tent, structure or format indexing and facilitates het-

erogeneous resources management. Similarities be-

tween SLM and web browsing are put forward, there-

fore the integration of usual statistical models from

statistical language modeling domain is investigated.

The resulting model is a Statistical Grammar of Us-

age (SGU). As a single SGU may be inefﬁcient, it

has to be personalized. To tackle sparsity of data,

a preclassiﬁcation of users into communities is per-

formed. Community-based SGU are then proposed.

Moreover, a new model is introduced, managing vari-

able distance dependent triggers.

This new model is a ﬁrst contribution to increase

quality of prediction of resources in web brows-

ing. A second contribution consists in the design

of community-based SGU, predicting the sequen-

tiality of resources during navigation. Moreover, a

community-based SGU builds an a posteriori struc-

ture of navigation based on the subjective but reliable

measure of pertinence of a resource for a user. Con-

sequently it performs a personalized indexing of re-

sources, based on usage analysis.

Collaborative ﬁltering techniques used to build

communities and triggers used to suggest resources

have both proved their efﬁciency in their respective

domain. A ﬁrst perspective is the validation of the

community-based SGU in terms of quality of predic-

tions in web browsing. This evaluation can be per-

formed by measuring the perplexity of the model. A

second perspective is the use of the community-based

SGU to compute a personalized classiﬁcation of re-

sources, depending not only on topics but also on

user’s preferences and context.

REFERENCES

Abramson, N. (1963). Information Theory and Coding.

McGraw-Hill, New-York.

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern In-

formation Retrieval. ACM Press, New York.

Brun, A., Sma¨ıli, K., and Haton, J. (2002). Contribution

too topic identiﬁcation by using word similarity. In

ICSLP2002.

Castagnos, S. and Boyer, A. (2006a). A client/server user-

based collaborative ﬁltering algorithm model and im-

plementation. In Proceedings of ECAI 2006, Italy.

Castagnos, S. and Boyer, A. (2006b). Frac+: A distributed

collaborative ﬁltering model for client/server architec-

tures. In WEBIST 2006, Portugal.

Chan, P. (1999). A non-invasive learning approach to build-

ing web user proﬁles. In KDD 1999 - Workshop on

Web Usage Analysis and User Proﬁling, USA.

Herlocker, J., Konstan, J., Terveen, L., and Riedl, J. (2004).

Evaluating collaborative ﬁltering recommender sys-

tems. ACM Transactions on Information Systems

(TOIS), 22(1):5–53.

Jelinek, F. and Mercer, R. (1980). Interpolated estimation

of markov source parameters from sparse data. In Wk.

on Pattern Recognition in Practice, pages 381–397.

Jurawski, D. and Martin, J. H. (2000). Speech and Lan-

guage Processing: an Introduction to Natural Lan-

guage Processing. Prentice-Hall.

Rosenfeld, R. (1996). A maximum entropy approach to

adaptative statistical language modeling. Computer

Speech and Language, 10:187–228.

Rosenfeld, R. (2000). Two decades of statistical language

modeling: Where do we go from here.

Sma¨ıli, K., Brun, A., Zitouni, I., and Haton, J. (1999). Au-

tomatic and manual clustering for large vocabulary

speech re cognition: A comparative study. In Eu-

rospeech’99, Hungary.

USAGE BASED INDEXING OF WEB RESOURCES WITH NATURAL LANGUAGE PROCESSING

225