COMPLEX USER BEHAVIORAL NETWORKS

AT ENTERPRISE INFORMATION SYSTEMS

Peter G

eczy, Noriaki Izumi, Shotaro Akaho and K

oiti Hasida

National Institute of Advanced Industrial Science and Technology (AIST)

Keywords:

Complex networks, web behavior, behavior segmentation, navigation space, knowledge workers, enterprise

systems, information services, data mining.

Abstract:

We analyze human behavior on a large-scale enterprise information system. Employing a novel framework that

efﬁciently captures complex spatiotemporal dimensions of human dynamics in electronic spaces we present

vital ﬁndings about knowledge workers’ behavior on enterprise intranet portal. Browsing behavior of knowl-

edge workers resembles a complex network with signiﬁcant concentration on navigational starters. Common

browsing strategy utilizes the knowledge of the starting navigation point and recollection of the traversal path-

way to the target. Complex traversal network topology has a small number of behavioral hubs concentrating

and disseminating the browsing pathways. Human browsing network topology, however, does not match the

link topology of the web environment. Knowledge workers generally underutilize the available resources,

have focused interests, and exhibit diminutive exploratory behavior.

1 INTRODUCTION

Elucidation of human dynamics in electronic environ-

ments is of central importance in personalization tech-

nologies (Baraglia and Silvestri, 2007), recommender

systems (Adomavicius and Tuzhilin, 2005), and col-

laborative ﬁltering engines (Jin et al., 2006). Cor-

porate sector has been exploring the customer web

behavior primarily for commercial purposes (Park

and Fader, 2004), (Moe, 2003) and search ranking

(Agichtein et al., 2006). Little attention has been de-

voted to the study of user behavior in enterprise in-

ternal information environments. This study presents

the scarce results of knowledge worker behavior on a

large enterprise intranet portal.

It has been reported that the individual human ac-

tions in web environments follow non-Poisson sta-

tistics characterized by the long tails (Dezso et al.,

2006), (Vazquez et al., 2006). The long tail attributes

of human dynamics (Barabasi, 2005) are equivalent

to those observed in complex networks (Newman,

2003),(Newman et al., 2005), (Caldarelli, 2007). A

common property of complex networks is that the ver-

tex connectivities follow a long tail distribution. The

long tiled power-law has been detected in the tempo-

ral characteristics of human information access on the

web (Dezso et al., 2006). Similar results have been re-

ported from workload studies of search engines and

server systems (Bedue et al., 2006),(Schroeder and

Harchol-Balter, 2006). The long tails of human in-

teractions have been modeled by power distributions

(Vazquez et al., 2006), (Vazquez, 2005), lognormal

and Pareto distributions (Downey, 2005), or Zipf dis-

tribution (Leskovec et al., 2005).

This work focuses on frequency rather than tem-

poral characteristics of human dynamics in elec-

tronic environments, and targets traversal networks of

knowledge worker intranet browsing behavior. Ap-

plying novel analytic and exploratory framework we

present valuable behavioral ﬁndings.

2 CONCEPT PRESENTATION

User browsing interactions in web environments are

reasonably represented by the clickstream sequences.

The clickstream sequences of page transitions are seg-

mented into sessions and subsequences. The ses-

sions outline tasks of various complexities, under-

taken by the users, that are further divided into the

subtasks represented by the subsequences. Segmen-

tation is done according to the users’ temporal ac-

tivity characteristics. Consider the sequence of the

form: {(p

, d

)}

where p

denotes the visited page

URL

and d

denotes a delay between the consecutive

views p

→ p

i+1

. User browsing activity {(p

, d

)}

233

Géczy P., Izumi N., Akaho S. and Hasida K. (2008).

COMPLEX USER BEHAVIORAL NETWORKS AT ENTERPRISE INFORMATION SYSTEMS.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - HCI, pages 233-239

DOI: 10.5220/0001700502330239

 SciTePress

divided into subelements according to the periods of

inactivity d

satisfying certain criteria.

Deﬁnition 1. (Session, Subsequence, Train)

Let {(p

, d

)}

be a sequence of pages p

with delays

between consecutive transitions p

→ p

i+1

Browsing session is a sequence B = {(p

, d

)}

where

each d

≤ T

. Length of the browsing session is |B|.

Browsing session is often referred to simply as a ses-

sion.

Subsequence of an individual browsing session B is a

sequence S = {(p

, d p

)}

where each delay d p

≤ T

and {(p

, d p

)}

⊂ B. The length of subsequence is

|S|.

A browsing session B = {(S

, ds

)}

thus consists of a

train of subsequences S

separated by inactivity de-

lays ds

Important issue is determining the appropriate val-

ues of T

and T

that segment the user activity into

sessions and subsequences. The former research

(Catledge and Pitkow, 1995) indicated that student

browsing sessions last on average 25.5 minutes. How-

ever, we adopt the average maximum attention span

of 1 hour as a value for T

. If the user’s browsing ac-

tivity was followed by a period of inactivity greater

than 1 hour, it is considered a single session, and the

following activity comprises the next session.

Value of T

is determined dynamically and com-

puted as an average delay in a browsing session:

∑

i=1

. If the delays between page views are

short, it is useful to bound the value of T

from below.

This is preferable in environments with frame-based

and/or script generated pages where numerous logs

are recorded in a rapid transition. Since our situation

contained both cases, we adjusted the value of T

bounding it from below by 30 seconds:

= max

30,

∑

i=1

. (1)

Using these primitives we deﬁne navigation space

and subspace as follows.

Deﬁnition 2. (Navigation Space and Subspace)

Navigation space is a triplet G = (P , B, S ) where P

is a set of points (e.g. URLs), B is a set of browsing

sessions, and S is a set of subsequences.

Navigation subspace of G is a space A = (D,H,K)

where D ⊆ P , H ⊆ B, and K ⊆ S; denoted as A ⊆ G .

Separation of subspaces within a navigation space

reﬂects the nature of detected or deﬁned sequences.

For example, a human navigation space consists of

human generated sequences, and a machine naviga-

tion space may contain only the machine generated

sequences. Different spaces may have distinctly dif-

ferent characteristics.

Important aspect to observe in human browsing

behavior is to identify the starting and attracting

points in navigation space, as well as the single user

actions.

Deﬁnition 3. (Starter, Attractor, Singleton)

Let G = (P, B, S ) be a navigation space and

B = {(S

, ds

)}

, B ∈ B, be a browsing session, and

S = {(p

, d p

)}

, S ∈ S, be a subsequence.

Starter is the ﬁrst point of an element of subsequence

or session with length greater that 1, that is, p

∈ P

such that there exist B ∈ B or S ∈ S where |B| > 1 or

|S| > 1 and (p

, d

) ∈ B or (p

, d p

) ∈ S.

Attractor is the last point of an element of subse-

quence or session with length greater that 1, that is,

∈ P or p

∈ P such that there exist B ∈ B or

S ∈ S where |B| > 1 or |S| > 1 and (p

, d

) ∈ B or

, d p

) ∈ S.

Singleton is a point p ∈ P such that there exist B ∈ B

or S ∈ S where |B| = 1 or |S| = 1 and (p, d) ∈ B or

(p, d p) ∈ S.

The starters refer to the initial navigation points of

users, whereas the attractors denote the users’ targets.

The singletons relate to the single user actions such

as use of hotlists (e.g. history or bookmarks) (Thakor

et al., 2004).

Page traversal network may contain points that are

occasionally accessed and also points concentrating

trafﬁc—hubs. Hubs have larger incoming and out-

going spectrum of navigational choices. To quantify

a variety of navigational pathways that lead into and

out of a point, we deﬁne the in and out degrees.

Deﬁnition 4. (In and Out Degrees)

Let p

∈ P be a point in a navigation space

G = (P , B, S) such that there exists B ∈ B where

|B| > 1 and (p

, d

) ∈ B.

In degree of a point p

is the cardinality of a set of all

preceding points p

i−1

in sessions; p

i−1

→ p

, denoted

as:

In(p

) = |{p

i−1

|(p

i−1

, d

i−1

) ∈ B ∧ (p

, d

) ∈ B}|.

Out degree of a point p

is the cardinality of a set of

all following points p

i+1

in sessions; p

→ p

i+1

, de-

noted as:

Out(p

) = |{p

i+1

|(p

i+1

, d

i+1

) ∈ B ∧ (p

, d

) ∈ B}|.

The in degree of a point reﬂects the variety of

choices from which the users access it. The point’s

out degree represent the spectrum of branches from it

that users utilize. Note that the deﬁned in and out

degrees delineate browsing behavior characteristics

rather than the number of links pointing to and out of

a given point. Some pathways might not be exploited

by the users, or users may choose to utilize hotlists at

a given browsing stage. The human browsing behav-

ior hubs in the navigation space may differ from the

link hubs.

ICEIS 2008 - International Conference on Enterprise Information Systems

234

3 INFORMATION SYSTEM CASE

STUDY

The information system investigated in this study is

the large-scale intranet portal of The National Insti-

tute of Advanced Industrial Science and Technology.

The core comprises of six servers connected to the

high-speed backbone in a load balanced conﬁgura-

tion. The accessibility is provided via wide ranging

connectivity options (from high-speed optical to wire-

less) accommodating several platforms (up to mobile

devices). The portal provides extensive range of web

services and documents vital to the organization (Ta-

ble 1). The rich intranet services support business

processes for management, accounting and adminis-

tration, research cooperation with industry and other

institutes, and resource localization; but also bulletin

boards and networking within organization. The in-

stitute has a number of branches throughout the coun-

try, thus several services and resources are distrib-

uted. Visible web space exceeded 1 GB, and deep

web space was substantially larger, but difﬁcult to es-

timate due to the decentralized architecture and vary-

ing back-end data.

Table 1: Case study data information.

Data Volume ∼60 GB

Average Daily Volume ∼54 MB

Number of Servers 6

Number of Log Files 6814

Average File Size ∼9 MB

Time Period 3/2005 - 4/2006

Log Records 315 005 952

Clean Log Records 126 483 295

Unique IP Addresses 22 077

Services 855

Unique URLs 3 015 848

Scripts 2 855 549

HTML Documents 35 532

PDF Documents 33 305

DOC Documents 4 385

Others 87 077

Sessions 3 454 243

Unique Sessions 2 704 067

Subsequences 7 335 577

Unique Subsequences 3 547 170

Valid Subsequences 3 156 310

Unique Valid Subsequences 1 644 848

Users ∼10 000

The majority of the enterprise portal users were

skilled knowledge workers. Signiﬁcant trafﬁc on the

portal resulted in a large web log data pool. The traf-

ﬁc was both human and machine generated, thus the

data required cleaning. The data preparation, process-

ing, ﬁltering, and segmentation to sessions and subse-

quences are described in (G

eczy et al., 2007). The ini-

tial data cleaning eliminated most of the machine gen-

erated trafﬁc, however, further ﬁltering was needed

after subsequence extraction. It is noticeable that the

data cleaning and ﬁltering reduced the number of log

records by 59.85%, as well as the number of unique

valid subsequences by 53.6%.

4 BROWSING BEHAVIOR

ANALYSIS

By analyzing the point characteristics we infer several

relevant observations. The point characteristics of a

navigation space highlight the initial and the terminal

targets of knowledge worker activities, and also the

single-action behaviors. Analysis demonstrates the

applicability and usefulness of the approach.

It is evident that knowledge worker navigation

space is substantially smaller, with respect to the

essential navigation points, than the observed com-

plete navigation space. The unique valid sets of

starters (115770), attractors (288075), and singletons

(57 894) are very small in comparison to the set of

unique URLs (3015848) in the navigation space (see

Table 1 and Table 2). The largest set, unique valid at-

tractors, is only 9.55% of unique URLs. Unique valid

starters and singletons represent only approximately

3.84% and 1.92% of unique URLs, respectively.

Browsing behavior of knowledge workers resem-

bles the complex networks. Topology of knowledge

worker navigation space clearly corresponds to the

complex network. Characteristic feature of complex

networks is a long tailed distribution of the in and out

degrees of the nodes. Histograms of in and out de-

grees of starters and attractors distinctly display long

tail characteristics–with small number of high fre-

quency elements gradually progressing to the large

number of low frequency elements (Figure 1 and 2).

The network of starting navigation points as well as

the network of users’ targets are both complex net-

works. Certain points in the navigation space concen-

trate the human web trafﬁc and serve as hubs.

Knowledge workers’ browsing behavior concen-

trates on the navigational starters. Starters are the

major concentration points of the users’ complex nav-

igational network. They are the main hubs. There

are approximately one hundred primary starter hubs

and three hundred primary attractor hubs. These one

COMPLEX USER BEHAVIORAL NETWORKS AT ENTERPRISE INFORMATION SYSTEMS

235

Table 2: Statistics for starters, attractors, and singletons.

Starters Attractors Singletons

Total 7 335 577 7 335 577 1 326 954

Valid 2 392 541 2 392 541 763 769

Filtered 4 943 936 4 943 936 563 185

Unique 187 452 1 540 093 58 036

Unique Valid 115 770 288 075 57 894

hundred primary starters constitute 0.086% of unique

valid starters, and three hundred primary attractors ac-

count for 0.1% of unique valid attractors. Thus the

ratio between the primary starter and attractor hubs

is approximately one to three. This one-to-three ra-

tio approximately holds also between the numbers of

unique valid starters (115 770) and attractors (288

075) — see Table 2.

Figure 1: Histograms and quantiles of starter: a) in degrees,

b) out degrees. Right y-axis contains a quantile scale. X-

axis is in a logarithmic scale.

The initial navigation points primarily dissemi-

nate the knowledge worker browsing pathways. The

starters disperse the navigation more than the attrac-

tors. This is evident from the quantiﬁcation of the

in and out degrees of the major starters and attrac-

tors. In and out degrees of starters range from one to

over twenty thousand. Range of attractor in degrees

(1 to about 6800) and out degrees (1 to about 3400) is

approximately three to six times lower, respectively.

Top ten starters (approximately 0.0086% of unique

valid starters) have in and out degrees ranging from

ﬁve thousand to over twenty thousand (Figure 1).

Compound in and out degrees of top thirty starter

hubs (approximately 0.026% of unique valid starters)

represented approximately 20% of total starter in and

out degrees.

Knowledge workers are more behaviorally diverse

in reaching their targets than proceeding to the start-

ing points of the following sub-tasks. The attractors’

in degree range is two times greater than the out de-

gree range (refer to Figure 2). Thus the users employ

approximately two times more arriving pathways to

the targets than the departing ones. They are more di-

verse in reaching the targets than proceeding to the

following navigation points of the consequent sub-

tasks. Only approximately top twenty attractors have

in and out degrees greater than one thousand. Dis-

crepancies between their in degrees are greater than

between their out degrees.

Variability of arriving and departing pathways to

and from starters is relatively balanced. Both, in and

out degrees of starters extend to approximately 20000

(Figure 1). The in and out degree ranges of starters

are signiﬁcantly greater than the attractor ranges (see

Figures 1 and 2). Hence the users have richer traversal

repertoire when reaching and leaving the initial navi-

gation points rather than the targets.

Knowledge workers utilized a small spectrum

of starting navigation points and targeted relatively

small number of resources during their browsing. The

set of unique valid starters (115770), i.e. the initial

navigation points of knowledge workers’ (sub-)goals,

was approximately 3.84% of total navigation points

(see Tables 1 and 2). Although the set of unique valid

attractors (288075), i.e. (sub-)goal targets, was ap-

proximately three times higher than the set of initial

navigation points, it is still relatively minor portion

ICEIS 2008 - International Conference on Enterprise Information Systems

236

Figure 2: Histograms and quantiles of attractor: a) in de-

grees, b) out degrees. Right y-axis contains a quantile scale.

X-axis is in a logarithmic scale.

(approximately 9.55% of unique URLs). Knowledge

workers initiated their browsing experiences from a

small number of navigation points and aimed at rela-

tively few resources.

Few resources were perceived of value to be book-

marked. Number of unique single user actions was

minuscule. Single actions, such as use of hotlists

(Thakor et al., 2004), followed by delays greater than

1 hour are represented by the singletons. Unique valid

singletons (57894) accounted for only 1.92% of nav-

igation points (see Tables 1 and 2). The number of

singletons is approximately two times lower than the

number of starters and almost ﬁve times lower than

the number of attractors (Table 2). If only small num-

ber of starters and/or attractors were perceived useful,

there is a possibility that they were bookmarked and

accessed directly in the future browsing experiences.

Knowledge workers had focused interests and ex-

hibited minuscule exploratory behavior. A narrow

spectrum of starters, attractors, and singletons was

frequently used. The histograms and quantile char-

acteristics of starters, attractors, and singletons (see

Figure 3: Histograms and quantiles: a) starters, b) attrac-

tors, and c) singletons. Right y-axis contains a quantile

scale. X-axis is in a logarithmic scale.

Figure 3) indicate that higher frequency of occur-

rences is concentrated to relatively small number of

elements. Approximately ten starters and singletons,

and ﬁfty attractors were very frequent. About one

hundred starters and singletons, and one thousand at-

tractors were relatively frequent. The quantile analy-

sis in Figure 3 reveals that ten starters (0.0086%

COMPLEX USER BEHAVIORAL NETWORKS AT ENTERPRISE INFORMATION SYSTEMS

237

of unique valid starters) and singletons (0.017% of

unique valid singletons), and ﬁfty frequent attrac-

tors (0.017% of unique valid attractors) accounted for

about 20% of total occurrences. One hundred starters

(0.086% of unique valid starters) and one thousand at-

tractors (0.35% of unique valid attractors) constituted

about 45% and 48% of total occurrences, respectively.

Analogously, one hundred twenty singletons (0.21%

of unique valid singletons) compounded to about 37%

of total occurrences.

Knowledge workers were generally more familiar

with the starting navigation points rather than the tar-

gets. Smaller number of starters repeats substantially

more frequently than the adequate number of attrac-

tors. That is, the users knew where to start and were

familiar with the navigational path to the target (in-

stead of just utilizing shortcuts such as bookmarks).

In and out degrees of frequent starters are also signif-

icantly higher than those of attractors (see Figures 1

and 2). The frequent starters have in and out degrees

between 5000 and 20000, whereas the frequent attrac-

tor in degrees are between 1000 and 6800, and out

degrees between 1000 and 3400.

Complex networks of knowledge worker browsing

behavior differ from the web topology constituted by

links. Hubs in the web topology are the pages with

large number of incoming and outgoing links. Behav-

ioral hubs are the navigation points that have large in

and out degrees — resulting from the user traversal

patterns. It has been discovered that the behavioral

hubs in the knowledge worker navigation space did

not substantially match the link hubs. High out de-

grees of behavioral hubs (reaching almost 7000) also

signiﬁcantly exceed the number of links on the served

pages at any given time.

5 CONCLUSIONS

We introduced a novel analytic framework for explo-

ration and modeling of human browsing behavior in

electronic environments. It utilizes a temporal seg-

mentation of browsing activity. The framework was

applied to browsing behavior analysis of the knowl-

edge workers on a large enterprise information sys-

tem. Numerous vital behavioral features have been

revealed. Knowledge worker browsing behavior con-

centrated on the navigational starters. They remem-

bered the starting point and recalled the navigational

path to the target. The knowledge workers effectively

utilized only a small amount of available resources.

A large number of resources have been occasionally

accessed.

Topology of knowledge worker traversal path-

ways resembles complex networks. However, the be-

havioral complex network differs from the hypertext

link network. The traversal hubs do not identically

correspond to the link hubs. Signiﬁcant long tail char-

acteristics of the essential navigation points have been

exposed both in terms of frequencies as well as in and

out degrees.

REFERENCES

Adomavicius, G. and Tuzhilin, A. (2005). Toward the

next generation of recommender systems: A survey

of the state-of-the-art and possible extensions. IEEE

Transactions on Knowledge and Data Engineering,

17:734–749.

Agichtein, E., Brill, E., and Dumais, S. (2006). Improv-

ing web search ranking by incorporating user behav-

ior information. In Proceedings of The 29th SIGIR,

pp. 19–26, Seattle, Washington, USA.

Barabasi, A.-L. (2005). The origin of bursts and heavy tails

in human dynamics. Nature, 435:207–211.

Baraglia, R. and Silvestri, F. (2007). Dynamic personaliza-

tion of web sites without user intervention. Commu-

nications of the ACM, 50:63–67.

Bedue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A.,

and Ziviani, N. (2006). Modeling performance-driven

workload characterization of web search systems. In

Proceedings of CIKM, pp. 842–843, Arlington, USA.

Caldarelli, G. (2007). Scale-Free Networks: Complex Webs

in Nature and Technology. Oxford University Press,

Cambridge, UK.

Catledge, L. and Pitkow, J. (1995). Characterizing browsing

strategies in the world wide web. Computer Networks

and ISDN Systems, 27:1065–1073.

Dezso, Z., Almaas, E., Lukacs, A., Racz, B., Szakadat, I.,

and Barabasi, A.-L. (2006). Dynamics of information

access on the web. Physical Review, E73:066132(6).

Downey, A. (2005). Lognormal and pareto distributions in

the internet. Computer Communications, 28:790–801.

eczy, P., Akaho, S., Izumi, N., and Hasida, K. (2007). Us-

ability analysis framework based on behavioral seg-

mentation. In Psaila, G. and Wagner, R., Eds., Elec-

tronic Commerce and Web Technologies, pp. 35–45,

Springer-Verlag, Heidelberg.

Jin, R., Si, L., and Zhai, C. (2006). A study of mixture mod-

els for collaborative ﬁltering. Information Retrieval,

9:357–382.

Leskovec, J., Kleinberg, J., and Faloutsos, C. (2005).

Graphs over time: Densiﬁcation laws, shrinking di-

ameters and possible explanations. In Proceedings of

KDD, pp. 177–187, Chicago, Illinois, USA.

Moe, W. (2003). Buying, searching, or browsing: Differen-

tiating between online shoppers using in-store naviga-

tional clickstream. Journal of Consumer Psychology,

13:29–39.

ICEIS 2008 - International Conference on Enterprise Information Systems

238

Newman, M. (2003). The structure and function of complex

networks. SIAM Review, 45:167–256.

Newman, M., Barabasi, A.-L., and Watts, D. (2005).

The Structure and Dynamics of Complex Networks.

Princeton University Press, Princeton, N.J.

Park, Y.-H. and Fader, P. (2004). Modeling browsing behav-

ior at multiple websites. Marketing Scien ce, 23:280–

303.

Schroeder, B. and Harchol-Balter, M. (2006). Web servers

under overload: How scheduling can help. ACM

Transactions on Internet Technology, 6:20–52.

Thakor, M., Borsuk, W., and Kalamas, M. (2004). Hotlists

and web browsing behavior–an empirical investiga-

tion. Journal of Business Research, 57:776–786.

Vazquez, A. (2005). Exact results for the barabasi

model of human dynamics. Physical Review Letters,

95:248701(6).

Vazquez, A., Oliveira, J., Dezso, Z., Goh, K.-I., Kondor,

I., and Barabasi, A.-L. (2006). Modeling bursts and

heavy tails in human dynamics. Physical Review,

E73:036127(19).

COMPLEX USER BEHAVIORAL NETWORKS AT ENTERPRISE INFORMATION SYSTEMS

239