Access Prediction for Knowledge Workers in Enterprise Data

Repositories

Chetan Verma

, Michael Hart

, Sandeep Bhatkar

, Aleatha Parker-Wood

and Sujit Dey

Electrical and Computer Engineering, University of California San Diego, San Diego, CA, U.S.A.

Symantec Research Labs, Mountain View, CA, U.S.A.

eywords:

Information Retrieval, Machine Learning, Enterprise, File Systems.

Abstract:

The data which knowledge workers need to conduct their work is stored across an increasing number of

repositories and grows annually at a signiﬁcant rate. It is therefore unreasonable to expect that knowledge

workers can efﬁciently search and identify what they need across a myriad of locations where upwards of

hundreds of thousands of items can be created daily. This paper describes a system which can observe user

activity and train models to predict which items a user will access in order to help knowledge workers discover

content. We speciﬁcally investigate network ﬁle systems and determine how well we can predict future access

to newly created or modiﬁed content. Utilizing ﬁle metadata to construct access prediction models, we show

how the performance of these models can be improved for shares demonstrating high collaboration among its

users. Experiments on eight enterprise shares reveal that models based on ﬁle metadata can achieve F scores

upwards of 99%. Furthermore, on an average, collaboration aware models can correctly predict nearly half of

new ﬁle accesses by users while ensuring a precision of 75%, thus validating that the proposed system can be

utilized to help knowledge workers discover new or modiﬁed content.

1 INTRODUCTION

Enterprise knowledge workers are inundated with

new options for conducting their work with the

rise of Enterprise Social Networks (Leonardi et al.,

2013) and cloud based applications (Salesforce, 2015;

Ofﬁce365, 2015) alongside traditional technologies

such as email, source control repositories, network

ﬁle servers, and ofﬁce software suites. Enterprises

are also embracing new computing devices such

as mobile devices and tablets in addition to exist-

ing personal computers, laptops, workstations and

servers. The amount of enterprise data grows signif-

icantly each year: studies estimate that unstructured

data grows annually by 40-50% (Gantz and Reinsel,

2012). The fragmentation in the tools and devices

used to work and the sheer growth of data places in-

creasingly unrealistic demands on knowledge work-

ers to keep up with the inﬂux of data. In fact, it has

been reported that 65% of users have felt at times

overwhelmed by the amount of incoming data (IDG

Enterprise, 2014).

This paper presents a system that utilizes machine

learning and natural language processing to automate

the discovery of important new or modiﬁed content

and identify which subset of users will likely use or

beneﬁt from it. The system is designed for ﬁle servers

and is evaluated with activity collected over eight net-

work ﬁle servers from an enterprise customer. Enter-

prises use ﬁle servers for a myriad of purposes includ-

ing storing application data, back up, enabling collab-

oration, and hosting personal home directories. The

proposed system will support a wide range of appli-

cations, such as recommender systems or servercache

management systems, by providing predictions about

what data will likely be accessed in the near future.

The system bases its predictions on user activity

and content metadata. We track content accessed by

users over a speciﬁed training interval. Data (i.e. ac-

cess to a particular piece of content) are represented

by a set of features that include path components

(e.g., parent and ancestral directories), keywords in

the path, and extension. Each datum represents an in-

stance in training our model. Combining this training

data with the ﬁles speciﬁcally accessed by the user,

this system builds personalized models to predict fu-

ture ﬁle accesses. While traditional approaches for

ﬁle access prediction such as (Yeh et al., 2001a; Yeh

et al., 2001b) cannot be applied to recommend new

ﬁles, the proposed user model based approach is gen-

150

Verma C., Hart M., Bhatkar S., Parker-Wood A. and Dey S..

Access Prediction for Knowledge Workers in Enterprise Data Repositories.

DOI: 10.5220/0005374901500161

In Proceedings of the 17th International Conference on Enterprise Information Systems (ICEIS-2015), pages 150-161

ISBN: 978-989-758-096-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

eralizable enough to be applied even to content that

has not been accessed before, or is newly created.

Additionally, analysis of ﬁle activity yields the in-

sight that very regular patterns of collaboration occur.

The paper demonstrates a method in which an indi-

vidual’s prediction precision and recall can be greatly

improved by incorporating the predictions of all other

user models.

This work makes the following contributions:

• Observations about the nature of enterprise ﬁle ac-

tivity

• A system that analyzes ﬁle activity and meta-

data and applies machine learning and natural lan-

guage processing to provide predictions

• A strategy to combine the personalized models of

multiple ﬁle users to improve the predictions of

individual users

The paper is organized as follows. Section 2

provides observations about ﬁle activity and pre-

processing. Section 3 details the feature space and

Section 4 describes the construction of metadata and

collaborative ﬁltering-based user models. Section 5

presents an evaluation of the user models and Sec-

tion 6 discusses the contributions and characteristics

of the features used in the models, and scalability and

deployment related aspects. Section 7 identiﬁes re-

lated work followed by a discussion on directions for

future work in Section 8. Section 9 concludes.

2 DATA

For this work, the system focuses on network ﬁle

servers from corporate enterprises with signiﬁcant

user collaboration. We select network ﬁle servers

based on social network analysis and use an afﬁnity

function where an edge connects two users if the users

have accessed at least one ﬁle in common. Collabora-

tion is measured by the triangle count, the number of

triangles formed by sets of three users mutually con-

nected to each other. The ﬁle servers are selected from

the 90

percentile based on the triangle count. The

normalized triangle count is calculated by averaging

triangles over all ﬁles in a share. For the purposes

of this paper, eight ﬁle servers are selected to evalu-

ate the system. Table 1 compares different statistics

from each of the eight servers. Note that the statis-

tics are calculated before removal of scripted activity

as described below. Removing scripted activities is

an important step since our intent is to model the ﬁle

access patterns of users for applications such as ﬁle

recommendation, and not necessarily to model the ﬁle

access patterns of automated processes.

2.1 Detecting Scripted Activities

Our data suggests that users access ﬁles in at least

two modalities. Normal ﬁle access activity for users

typically consists of a small number of ﬁle accesses

in a short period of time, such as an hour. Another

mode is when a large number of ﬁles are accessed

which manifest as a sudden burst of activities. In or-

der to remove such scripted activities, we record the

number of activities performed by every user in each

hour session, and label the sessions having exception-

ally large number of activities as scripted. In order to

obtain an appropriate threshold for this purpose, we

utilize Tukey’s outlier factor (Wang et al., 2011) as

shown in Algorithm 1. A (user, hour) tuple is ﬂagged

as scripted if the number of activities corresponding

to the tuple exceeds the threshold calculated as

Q3+ k × IQR. (1)

Here Q3 is the third quartile and IQR is the interquar-

tile range of the number of activities of (user, hour)

tuples. We empirically set k to 5. Since our focus

is on removing scripted activities, if the threshold de-

termined using Tukey’s outlier factor is less than 1000

activities per user per hour, we use 1000 as the thresh-

old. That is, if a user performs more than 1000 ﬁle ac-

tivities per hour, we label his/her ﬁle activities in that

hour as scripted. Based on such an approach, if an ab-

Algorithm 1: Detecting scripted activities.

Input: Num-activities(u, h) tuples ∀ user u in share, ∀ hour h in

dataset. k = 5

Q1= ﬁrst quartile of Num-activities(u, h) ∀u, h

Q3= third quartile of Num-activities(u, h) ∀u, h

IQR=Q3-Q1

Output: Threshold = Q3 + k × IQR

normally high number of ﬁle activities are performed

from a user’s account in an hour, it can be reasonably

expected that at least a large fraction of them were not

performed by the user directly and the result of an au-

tomated or scripted activities. The activities that are

determined to be scripted are removed from the ﬁle

activity log over which models are trained and evalu-

ated. Table 1 provides the proportion of ﬁle activities

remaining in each share after scripted activities are

removed. From this point, we only provide results on

the shares after scripted activity is removed.

2.2 Metadata Tokenization

File metadata in enterprise environments does not

share consistent capitalization or delimitation. For

example, in the Directory Services, a group such as

AccessPredictionforKnowledgeWorkersinEnterpriseDataRepositories

151

Table 1: Statistics of shares used for evaluation. Degree of collaboration in shares is measured using normalized triangle

counts as outlined in Section 2. The types of events (create, read, write, delete) offer useful insight into the use of each share.

For example, share D observes high “write” workload and may be used as a repository for logs.

Share Sample

period

(days)

Users Files

operated

Total ﬁle

opera-

tions

% after

burst

removal

Triangle

Count

Normalized

Triangle

Count

Create% Read% Write% Delete%

A 123 992 36,009 11M 99.9 280M 8K 2.3 92.7 2.8 2.2

B 122 464 1,309 136K 99.9 4M 3K 0.7 38.9 60.0 0.4

C 122 160 1,044,779 3M 9.3 50K <0.1 0.9 96.8 1.6 0.6

D 121 183 746 11K 99.8 710K 951 5.4 88.4 5.9 2.6

E 66 1,288 99,733 292K 16.3 263M 3K 15.6 50.2 17.3 16.9

F 66 937 6,911 4M 100.0 3K 0.4 0.2 99.5 0.2 0.2

G 66 198 334 14K 100.0 1M 3K 0.2 98.7 0.7 0.4

H 57 398 133,006 4M 93.6 6M 45 15.3 57.8 10.0 16.8

Table 2: Minimum, median, maximum, ﬁrst

(Q1) and third (Q3) quartile of number of

unique ﬁles operated upon weekly.

Share Min Q1 Q2 Q3 Max

A 1 13 17 41 585

B 1 2 3 5 65

C 1 4 17 54 504,692

D 1 2 3 5 65

E 1 1 1 1 15,478

F 1 1 1 2 1,182

G 1 3 6 10 307

H 1 6 17 101 29,604

Table 3: Popular ﬁle extensions in different shares that users access.

Share D sees large workload on log ﬁles and Windows Performance

Monitor (.pma) data ﬁles, corresponding with observation in Table 1.

Share Most common extensions (% ﬁle activities)

A tmp (78) dat (19) gnt (2) pm (< 1) docx (< 1)

B pdf (16) log (12) pma (11) bak (10) txt (8)

C pdf (85) doc (4) xls (3) tif (1) msg (1)

D xls (87) xlsx (6) htm (3) pdf (1) dat (1)

E pdf (92) doc (2) deploy (1) resources (1) vb (1)

F txt (70) stk (19) xls (10) exe (< 1) log (< 1)

G cat (38) bat (18) lnk (8) dll (8) mpr (7)

H xls (35) ret (25) rpt (4) unv (3) pdf (3)

ABC administrators could be recorded as “ABC ad-

mins”, but referred to in a directory name as “AB-

CADMINS”. Therefore, extracting meaningful enti-

ties from metadata requires tokenization that is not

only sensitive to natural language delimiters (e.g.

whitespace), but also the likely concatenation of enti-

ties in alphanumeric substrings.

This system employs a heuristic based approach to

more traditional Natural Language Processing mor-

phological extraction (Bybee, 1985). The system

ﬁrst tries to develop a list of organizational speciﬁc

entities (which may be unique to only this organiza-

tion) by analyzing a deﬁnitive entity source. A direc-

tory service such as Active Directory (Active Direc-

tory, 2015) is an example of such a source, which we

use for training our system. Entities could be mined

from group names, lines of businesses, and other user

and group metadata. A ground truth set of entities is

blindly constructed by splitting on a set of hard de-

limiters. In our case, since the organization resides

primarily in the United States, we used whitespace

and non-alphanumeric characters to split entries. We

compute a frequency dictionary for all entities and de-

note D as the total number of entities extracted. This

frequency dictionary is denoted as the internal dictio-

nary.

In addition to an organizational frequency dictio-

nary, the system will also leverage a general word

frequency dictionary. Not all entities in metadata

could come from the authoritative source mentioned

in the previous paragraph. A frequency dictionary

computed from the frequency of terms in general us-

age, such as Corpus of Contemporary American En-

glish (coca, 2008), will provide information about the

likely tokenization a user would also arrive at. This

frequency dictionary is denoted as the external dictio-

nary.

Tokenization of metadata leverages dynamic pro-

gramming to address the possible concatenation of

multiple entities. The algorithm starts by ﬁrst split-

ting the metadata on hard delimiters, such as whites-

pace. For each substring, the algorithm applies an-

other split if the substring matches a known regular

expression for concatenating entities into one contin-

uous alphanumeric sequence. In this system, we con-

sider variations of CamelCase (CamelCase, 2015).

After splitting on hard delimiters and regular expres-

sions, we apply dynamic programming to determine

if the substring should be split into two or more to-

kens. The algorithm iterates by ﬁnding the optimal

tokenization of each preﬁx, starting with the preﬁx of

length 1. A split is scored by multiplying the optimal

solution for the left side (e.g. preﬁx) and the score for

the right side. The score of the right side is a linear in-

terpolation of the internal and external frequency dic-

tionaries:

β∗ f

internal

(term) ∗ (1 − β) ∗ f

external

(term). (2)

Empirical results show a β of 0.9 works well in prac-

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

152

tice. A penalty factor for either dictionary is applied

when a term is not found. The penalty factor used in

this paper is

D∗2

len(term)

. Note that this approach favors

tokenization with internal entities and fewer tokens.

3 FEATURES

As mentioned in Section 1, the proposed system is

trained over primarily three types of features. For

each ﬁle f, we show below how these features are

calculated.

1. Folder Features. Typically, the ﬁles in a ﬁle sys-

tem are organized into folders and sub-folders,

with the intent of placing together ﬁles that are

expected to be used together or are related to

the same task. Our aim is to capture the folder

that a ﬁle belongs to, and to capture the proxim-

ity between folders with respect to the ﬁle sys-

tem hierarchy without having to explicitly de-

ﬁne a folder to folder proximity metric. Let’s

say that the folders in the ﬁle system are repre-

sented by F where F (i) represents the i

folder.

The folder features of a ﬁle f are represented

as the vector X

f,F

where the cardinality of X

f,F

is the number of folders in the ﬁle system, i.e.,

| F |. The i

element of X

f,F

, i.e., X

f,F

(i) is 1

if F (i) lies in the path of ﬁle f. For example,

if f is ‘\ folder1\ folder2\ filename’ then X

f,F

would be [1, 1, 0, 0, 0, ...] where the ﬁrst in-

dex of X

f,F

corresponds to the folder ‘\ f older1\’

and the second index corresponds to the folder

‘\ folder1\ folder2\ ’.

2. Token Features. In addition to folder organiza-

tion, the nomenclature of ﬁles and folders also

provides useful insights into ﬁle content and cate-

gorization. In order to capture this, we tokenize

the ﬁle path including the ﬁle name, and con-

struct a vocabulary based on the popular tokens

(keywords). Each ﬁle is then represented as a

bag of words based on the constructed vocabulary,

and based on the tokens present in its path name.

Speciﬁcally, if T is the set of tokens in the con-

structed vocabulary, then the token features of ﬁle

f are represented as the vector X

f,T

where the car-

dinality of X

f,T

is equal to | T |. The i

element

of X

f,T

, i.e., X

f,T

(i) is equal to the number of

times the i

token is present in the path of f. Sec-

tion 2.2 details the tokenization strategy that ad-

dresses concatenation of entities, something quite

common in enterprise metadata.

3. Extension Features. In order to understand

users’ afﬁnities towards certain types of ﬁles, we

record the ﬁle extension as a categorical feature.

To utilize this in our models, we construct a vo-

cabulary based on popular ﬁle extensions in the

share, E. We then represent the extension of a ﬁle

f by a binary vector X

f,E

where the i

value of

f,E

, i.e., X

f,E

(i) is 1 if f has the i

extension,

and is 0 otherwise. Cardinality of X

f,E

is equal to

| E |.

For a ﬁle f, the metadata feature vector X

is obtained

by concatenating the above three types of features.

Speciﬁcally, for ﬁle f , the metadata feature vector is

obtained as

= [X

f,F

, X

f,T

, X

f,E

]. (3)

4 MODELING

In order to model users’ ﬁle access patterns, we deﬁne

a training period to train the models, and a testing pe-

riod to evaluate. We follow a personalized modeling

approach where we train one model for each user. For

evaluation, we select 30 users based on the number of

ﬁle activities of all users in a share. Details on the se-

lection of evaluation users are provided in Section 5.1.

All ﬁles that were operated on during the training pe-

riod by at least one user in the share are the training

instances. For user u, the training label of a ﬁle is 1 if

u accessed the ﬁle during training period, 0 otherwise.

Testing instances are the ﬁles that are operated upon

in testing period, after removing the ﬁles that were

observed in the training period. This ensures that the

testing ﬁles are new relative to the training ﬁles. The

testing labels are determined in the same fashion as

for training. Note that we only focus on training and

testing over ﬁle read events. The total set of training

instances is the same across all users, but the labels

can differ. The same holds for testing. The overall

approach is described in Figure 1(a). We approach

the modeling of ﬁle access patterns and its evaluation

as a classiﬁcation problem and utilize the features de-

tailed in Section 3.

4.1 Collaborative Filtering Aware

Modeling

As discussed in Section 2, ﬁle accesses typically

demonstrate a high degree of collaboration among

users as evidenced by the triangle count. The features

deﬁned in Section 3 only capture metadata attributes

of ﬁles. The models trained on these features can

be improved by utilizing the predictions from models

of other users in the same share. With this motiva-

tion, we describe how the system augments the per-

AccessPredictionforKnowledgeWorkersinEnterpriseDataRepositories

153

Figure 1: Overall approach of the proposed system. a)

shows the training of user models based on only metadata

features (metadata models). The ﬁles accessed by users are

represented in terms of their metadata features, followed by

training of the metadata model. b) shows how individual

metadata user models are applied on validation ﬁles to train

collaborative ﬁltering (CF) aware models (Section 4.1).

sonalized user models with additional information to

achieve collaborative ﬁltering.

Figure 1(b) shows the modiﬁed approach to make

the trained models aware of the collaboration among

users. We obtain validation instances from the train-

ing instances. There are several ways to do so. We ex-

perimented with sampling validation instances from

training instances for different sampling rates. The

best performance, however, was observed when the

validation set was kept the same as the training set.

On the other hand, the testing set, as required, is com-

pletely independent of the training or validation set.

Let U represent the set of users in a share. The

models labeled with “Metadata models” in Figure 1

represent the | U | personalized classiﬁcation models

based on the metadata features. Each of the meta-

data models is trained over the training instances,

and applied on the validation instances. For a ﬁle f

among the validation instances, the predicted labels

from metadata models are concatenated to form P

f,U

where the j

value i.e., P

f,U

( j) is equal to the pre-

dicted label of the validation instance f by the meta-

data model for the j

user in the share. These pre-

dicted labels are concatenated with the metadata fea-

ture vector of the validation instances to form the fea-

ture vector for a second layer of models. Speciﬁcally,

= [X

, P

f,U

], (4)

and the second layer of models are binary classiﬁca-

tion models trained with X

as the feature vector for

each validation ﬁle f. These collaborative ﬁltering

aware models are represented as “CF aware” models

in Figure 1(b).

During the testing phase, the predicted label for

a given user u on a test ﬁle f is obtained as follows.

First the metadata models of all the users in U are ap-

plied on f to obtain their predicted labels. These la-

bels are then concatenated with the metadata features

of f as shown in Eq. 4. Finally, the predicted label for

a user is obtained by applying her CF aware model on

the concatenated feature vector of f.

Constructing CF aware user models based on the

above approach has two key advantages. First, CF

aware models can leverage collaboration by factoring

the predicted labels of other users’ metadata models.

It should be noted that this approach does not require

explicitly deﬁning a similarity metric between users

or their access patterns, and yet enables the model to

improve its predictive performance. For example, if

the users u

and u

have similar behavior, it can be

expected that the validation ﬁles for which the meta-

data model of u

predicts a positive label, are also

likely to be accessed by u

. Now consider a third user

whose behavior is very different from that of u

and u

. Thus, if the metadata model of u

predicts

a positive label for a validation ﬁle f, the likelihood

of the user u

or u

accessing f would automatically

decrease. The CF model can leverage such learned

knowledge of similar and dissimilar access patterns

to improve its correctness.

The second advantage of our CF aware modeling

is that it does not suffer from the cold start problem

that the traditional collaborative ﬁltering systems suf-

fer from. To make recommendations to the user u,

these systems ﬁrst identify other users who share sim-

ilar preferences with u, and then propose items which

were favored by the other users but not seen by u.

Such systems fail to make recommendationfor a com-

pletely new item. Our approach gets around the cold

start problem by utilizing the predicted labels of users

in a share along with the metadata features. In Sec-

tion 5.6, we show how the CF aware models improve

the classiﬁcation performanceoverthe metadata mod-

els. With a higher degree of collaboration, we are ex-

pected to observe higher gains of the CF aware model.

Our results strongly corroborate this observation.

5 EVALUATION

In this section, we describe the evaluation procedure

for our modeling approach. In particular, we pro-

vide performance results over eight shares (network

ﬁle servers) with varying time duration and separa-

tion between the training and the testing periods. We

ﬁrst describe the procedure for selecting a subset of

users for our experiments.

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

154

5.1 Selecting Users for Evaluation

A ﬁle recommendation system is useful for active

users only. Therefore, we select a subset of users in a

share based on the number of their ﬁle activities. We

rank the users in a share in the increasing order of the

number their ﬁle activities. For the purpose of evalu-

ation, we then randomly sample 30 users from those

users whose activity numbersare abovethe third quar-

tile. In an actual deployment, all the active users must

be considered.

5.2 Evaluation Metrics

For a user u and a testing ﬁle f, the true label is 1 if u

accessed f in the testing period, and 0 otherwise. The

model for u is used to predict the label of the testing

ﬁle. Let F

true+,u

be the set of test ﬁles that are actually

accessed by u in the testing period, and F

pred+,u

the set of test ﬁles that are predicted to be accessed by

u. We use precision and recall, which are commonly

used metrics to evaluate classiﬁcation tasks. For our

problem, the precision and recall are

true+,u

∩F

pred+,u

and

true+,u

∩F

pred+,u

true+,u

respectively. For a ﬁle recom-

mendation type of an application, while having high

recall is deﬁnitely useful, having high precision is es-

sential for usability. If most of the recommendations

(i.e., F

pred+,u

) are wrong, the end user will simply ig-

nore the recommendations. Considering this, we use

the following metrics for the evaluation.

• F-score. F-score is calculated as the harmonic

mean of precision and recall and thus provides a

balanced picture of the overall predictions. We

use the F-score averaged across the evaluation

users (AF) as one of the metrics to discuss the re-

sults.

• Recall@75P. We also evaluate the ﬁle access

modeling by keeping the precision ﬁxed to a high

value. Using the conﬁdence score for each pre-

diction, we reduce the number of positive predic-

tions until the precision is 75%, and use the recall

at this precision as a performance metric. Thus

recall at 75% precision provides the fraction of a

user’s actual ﬁle accesses that a model was able

to correctly predict while ensuring that only 25%

of the model’s positive predictions are not shown

in the user’s activities. AR@75P is its averaged

value across all the evaluation users.

5.3 Varying Training and Testing

Periods

We perform evaluation for several combinations of

training and testing periods, with varying time dura-

tion and separation between them. For convenience,

we divide the entire duration of each share into 7

equal size slices as shown in Figure 2. For our evalu-

ation, we experiment with varying lengths of training

periods. For this, we ﬁx the last 2 slices for testing,

and use the ﬁrst 5 slices for training, while ensuring

that the testing period starts right after the training pe-

riod (See Figure 2(a)). Similarly, we also experiment

with varying lengths of testing periods as shown in

Figure 2(b).

Fixed training period

Test 1

Test 2

Test 5

Fixed testing period

Train 1

Train 2

Train 5

(a) Vary training periods

(b) Vary testing periods

Figure 2: Splitting dataset into various training and testing

periods.

5.4 Selecting Classiﬁcation Model

We experimented with different classiﬁcation models.

Table 4 compares the performance of a few models

using the metadata features for share A. It provides

model effectiveness in terms of Avg AF score, which

is the average of F-score across 30 evaluation users,

and across different training and testing periods, as

outlined in Section 5.3. The regularization coefﬁ-

cient C for SVM models is obtained by logarithmic

grid search over { 10

−2

, 10

−1

, 1, 10

, 10

}. The

gamma parameter for polynomial kernel SVMs is ob-

tained in a similar manner. The best parameters as ob-

tained by three fold cross-validation are ﬁnally used

for training. Also, L2 regularization is used.

The table also provides the total model training

time across all training periods and evaluation users.

The time is measured as real time on a 32-core, 64GB,

and 2.6GHz machine. The system is implemented us-

ing scikit-learn (scikit-learn, 2015), a Python library

for machine learning. Our implementation uses

multiprocessing to speed-up the overall training.

AccessPredictionforKnowledgeWorkersinEnterpriseDataRepositories

155

1 2 3 4 5

Tra in in g p e rio d s #

2 0

4 0

6 0

8 0

1 0 0

AF (% )

Figure 3: AF for the metadata models with the ﬁxed testing

and varying training periods.

1 2 3 4 5

Tra in in g p e rio d s #

2 0

4 0

6 0

8 0

1 0 0

AR @7 5 P (% )

Figure 4: AR@75P for the metadata models with the ﬁxed

testing and varying training periods.

Based on the results, we pick Linear SVM for our

modeling because it provides the best trade-off be-

tween effectiveness and training time. Moreover, Lin-

ear SVM also provides learned feature weights, which

are very useful for understanding the signiﬁcance of

features (see Section 6).

Table 4: Performance comparison between different ma-

chine learning algorithms.

Metric Linear

SVM

Polynomial

SVM degree 2

Multinomial

Naive Bayes

Decision

Tree

Avg AF 80.7 83.0 25.0 77.6

Train time

(mins)

31 236 6.2 78

5.5 Metadata Modeling

Figures 3 and 4 respectively show how AF and

AR@75P for metadata models are affected by vary-

ing length of training periods. The longest training

period, the one corresponding to index 1, leads to

the best performance for most shares. We observe

moderate degradation in performance as the training

window shrinks. This suggests that we need a sufﬁ-

ciently long training window for better performance.

It should be noted that the variations are observed

over different durations of the datasets, ranging from

from 57 to 123 days (Table 1). It can be expected that

1 2 3 4 5

Te s tin g p e rio d s #

2 0

4 0

6 0

8 0

1 0 0

AF (% )

Figure 5: AF for the metadata models with the ﬁxed training

and varying test periods.

1 2 3 4 5

Te s tin g p e rio d s #

2 0

4 0

6 0

8 0

1 0 0

AR @7 5 P (% )

Figure 6: AR@75P for the metadata models with the ﬁxed

training and varying test periods.

as the window of the training period gets longer, after

a point the model performance would decrease. This

is because the model may give more importance to

access patterns that are outdated with respect to the

content that the user is recently accessing. We do

not provide a recommended or optimal training pe-

riod because that would depend on several factors in-

cluding the number of users, their workload, and the

rate of change of access patterns. Nonetheless, in Sec-

tion 6.6, we discuss the potential to learn such type of

conﬁguration parameters based on online model eval-

uation.

Figures 5 and 6 show similar results but for dif-

ferent testing periods. These results show that beyond

testing periods with indices 1 and 2, which are small

(Table 2) and thus potentially noisy, the model per-

formance drops mildly as the length of separation be-

tween training and testing periods increases. The mild

drop shows sustaining ability of our models.

Table 5 summarizes the metadata model results

across all the above combinations using the average

and the maximum values of AF and AR@75P. AF

averaged across all training and test periods is pro-

vided as Avg AF. Max AF showsthe best AF achieved

which is an indicative of the realistic performance of

a properly tuned ﬁle recommender system. The high

AF and AR@75P values seen for most shares demon-

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

156

Table 5: Performance summary of the metadata models.

Performance numbers are averaged over 5 iterations with

random initialization of the Linear SVM model training.

Numbers are listed along with the standard deviations.

Share Avg AF Max

Avg

AR@75P

Max

AR@75P

% TP

others

A 80.7±0.0 91.1±0.0 77.6±0.0 93.9±0.0 20.4

B 46.4±0.0 51.2±0.0 34.9±0.0 44.2±0.0 73.1

C 23.1±0.0 24.2±0.0 11.4±0.0 12.4±0.0 87.7

D 30.7±0.0 36.3±0.0 19.0±0.0 24.7±0.0 82.8

E 26.9±0.0 35.0±0.0 17.9±0.0 24.1±0.0 66.0

F 81.5±0.0 84.3±0.0 82.9±0.0 84.6±0.0 9.8

G 44.8±0.0 51.3±0.0 50.6±0.0 58.3±0.0 99.9

H 47.6±0.0 50.2±0.0 49.9±0.0 53.0±0.0 74.4

strate practicality of our metadata models. The sub-

stantial variation in the performance across different

shares reﬂects differences in their characteristics such

as rate of activity, rate of change in user preferences,

and collaboration.

The last column of Table 5 shows that most of the

correctly recommended ﬁles to a user were not cre-

ated by that user in the testing period. This is good be-

cause recommending a newly created ﬁle to the user,

who created it, is obviously futile.

5.6 Collaborative Filtering Aware

Modeling

Table 6 provides a summary of the performance of

our CF aware models. On comparing Table 6 with

Table 5, we observe that CF aware models provide

substantial performance improvement over metadata

models for most shares. As discussed in Section 4.1,

the CF aware models are expected to beneﬁt shares

with high amount of collaboration between users.

This is conﬁrmed from the fact that shares B, D, E,

G which show the most improvement are amongst the

top ﬁve shares in terms of normalized triangle counts

(Table 1). Although share A is also among the top

ﬁve shares, the performance of its metadata models is

already too high to show signiﬁcant improvement.

In the next section, we discuss the contribution of

different types of features to the performance results.

We also discuss scalability issues, and considerations

for real world deployment.

6 DISCUSSION

Given the features in our model, which were most sig-

niﬁcant in the trained models? In order to perform this

analysis, for each user, we train a Linear SVM for CF

model (Section 4.1) based on each of the six training

Table 6: Performance summary of the CF aware models.

Performance numbers are averaged over 5 iterations with

random initialization of the Linear SVM model training.

Numbers are listed along with the standard deviations.

Share Avg AF Max

Avg

AR@75P

Max

AR@75P

% TP

others

A 80.7±0.0 100.0±0.0 78.5±0.0 100.0±0.0 21.2

B 48.6±0.0 77.1±0.0 32.2±0.6 62.1±0.0 73.1

C 23.5±0.0 36.0±0.0 10.1±0.0 16.0±0.0 87.5

D 32.3±0.0 47.5±0.0 21.5±0.0 38.3±0.0 83.3

E 27.6±0.0 41.1±0.0 25.3±0.0 100.0±0.0 65.7

F 81.5±0.0 87.9±0.0 83.3±0.0 89.2±0.0 9.8

G 55.5±0.1 76.2±0.0 58.0±0.2 89.9±0.4 96.6

H 47.6±0.0 57.2±0.0 49.6±0.0 66.5±0.0 75.4

Table 7: Analysis of features with respect to feature types.

These numbers are obtained by aggregating the weights per

feature in collaborative ﬁltering aware user models for dif-

ferent training periods as described in Figure 2 and for the

eight shares used for evaluation.

Feature

type

Percentage of feature

type in top 10 model

features

Total number

of features of

the type

Folder 10.4% 82.0%

Token 47.9% 14.8%

Extension 6.3% 0.1%

User 35.4% 3.1%

periods as outlined in Section 5.3. The signiﬁcance

of a feature with respect to a trained user model can

be obtained based on the absolute weight given to the

feature in the model. For each user model, we select

the top ten most signiﬁcant features. Table 7 shows

the proportions of different features among top fea-

tures per user model, aggregated across different eval-

uation users, shares, and training periods. We provide

insights about ﬁle user activitybased on how the mod-

els leveraged each feature type below.

6.1 Folder Feature Analysis

Despite the fact that folder features accounted for

more than 80% of the feature space, only 10.4% of

top features were drawn from this category by the per-

sonalized models. Folders within three levels of the

root account for more than 80% of top ten folder fea-

tures. This makes intuitive sense because folders that

are farther from the root are intrinsically sparser, and

our models apply regularization, which discourages

applying signiﬁcant weights to sparser features when

more frequent and predictive features are present. De-

spite their proximity to the root, these “shallow” fold-

ers still wield signiﬁcant predictive power. Interest-

ingly, we found that no ﬁle in our test set was imme-

diately descendant of a folder feature (i.e. a folder

AccessPredictionforKnowledgeWorkersinEnterpriseDataRepositories

157

distance of zero) in the top ten folder features. Files

in the test set were at least one folder away from the

folders constituting the folder features. In fact, the

data shows that more than a third of the ﬁles active in

testing period were 3 or more folders below a folder

feature in the top ten features. Despite the distance,

the ancestral folder still provides quite a bit of predic-

tive value.

6.2 Token Features

To preserve the privacy of individuals, groups and or-

ganizations, we can only discuss trends observed in

the tokens that were highly inﬂuential in classiﬁca-

tion. We noticed that tokens were drawn from the ap-

plications used to generate the data. The tokens would

either refer to the application name or the application

generated unique preﬁxes or sufﬁxes in the folder or

ﬁle paths. Additionally, paths contained the names of

groups for this organization. This would likely help

members of that group identify which subtrees of the

ﬁle system hierarchy contained data integral to their

role. Lastly, tokens referring to the month, year and

content type were important in many models. As ex-

pected, the timestamps of ﬁle activity aligned with the

month and year referred to in the path. This circum-

stantially corroborates the importance of temporal in-

formation in our model.

6.3 Extension Features

Analyzing the sign of the weights for extension fea-

tures yields an interesting observation: more than

85% of the weights of the extension feature in the top

ten features were negative. Extensions can yield in-

sight into the type of content and/or application that

generated it (which may be used to infer the role

or function of users). Since weights for this feature

were predominantly negative, this indicates it would

be quite unlikely that the user would use the applica-

tion that generated this ﬁle, which could imply some-

thing about the nature of their role, e.g. what it is not.

It is possible that for the model, which applies regular-

ization, ﬁnding extensions that are strongly anticorre-

lated with the user activity served as a strong indicator

and would substantially contribute to minimizing the

penalty attributed to the model. This would explain

why even though this feature category accounts for

0.1% of all features, it still accounted for 6.3% of the

top ten features.

6.4 User Feature Analysis

Providing to the model the likelihood that another

user will access this ﬁles allows the classiﬁcation

model to achieve collaborative ﬁltering by account-

ing for other users’ preferences. What is notewor-

thy is that 35.4% of the top ten features for the per-

sonalized models are the probabilities of a user ac-

cessing this ﬁle, which account for 3.1% of total fea-

tures. The weights for user features where the user is

not the same as the user for whom the personalized

model is being built were negative 32% of the time

and positive 68% of the time. For users where the

weight of another user’s ﬁle access likelihood is nega-

tive serves as an interesting signal that these two users

have different ﬁle access patterns and we should not

expect them to have a signiﬁcant intersection of ﬁles

accessed in common. On the other hand, 68% of the

time the weight of the user feature was positive, indi-

cating that the user for whom the model is trained and

this other user have a signiﬁcant interest in the same

type of ﬁles. Interestingly, particular users appeared

as top ten features for many different user models in

the same share. In fact, we observed that the same

user appeared as a top ten feature in ﬁve of the thirty

models 20% of the time and the same user appeared

as a top ten feature for ten of the thirty models 9% of

the time. This suggests that perhaps there are particu-

lar users whose ﬁle access patterns serve as exemplars

for how users access resources.

6.5 Scalability

As ﬁles are generated or modiﬁed on a share, our sys-

tem needs to apply personalized model of each user

to make the predictions. Therefore, a high rate of

ﬁle operations, or a high number of users will both

adversely affect the scalability our system. We can

address these factors to optimize the testing time as

discussed in Section 8. Moreover, a recommendation

system may not be essential for all the ﬁle servers. For

instance, it would not make much sense to deploy our

system on a networked home directory or on a backup

server. Additionally, it may not make much sense to

provide ﬁle recommendation to low-volumeﬁle users,

but rather focus on enterprise search when they do

need to ﬁnd information. It may be prudent to train

a model for only the users that are determined to be

sufﬁciently active in a share. Furthermore, for shares

that do not demonstrate high degree of user collabora-

tion, training and using CF aware models may not be

recommended since they are computationally much

more expensive than metadata based models.

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

158

6.6 Considerations for Real World

Deployment

In this section, we discuss the considerations for de-

ploying the proposed system in an actual enterprise

environment.

In this paper, we validate our system against ﬁle

activities that occurred in the past. Whereas for an

actual deployment, the system can be evaluated in an

online manner. This will help in monitoring workload

characteristics, and measuring the effectiveness more

accurately, and in a continuous manner. This can be

used as a feedback to tune the models and make them

adaptive. For instance, parameters such as the length

of the training window and frequency of retraining

can be tuned based on the feedback.

The precision, as reported by our evaluation, pro-

vides a lower bound to the precision that may be ob-

served in an actual deployment. To understand this,

consider that a user is recommended a ﬁle that he/she

was not aware of, and the user ends up accessing the

ﬁle. This would be a case of true positive,whereas our

current evaluation would show this as a false positive.

Lastly, since a recommendation system can inter-

act with users, it may be possible to obtain subjective

evaluations of the recommendations such as recom-

mendation quality.

7 RELATED WORK

Prior works on modeling ﬁle access patterns have

been mostly focused on performance enhancement

of storage systems, e.g., reducing I/O latency by

prefetching (Amer et al., 2002; Xia et al., 2008;

Kroeger and Long, 2001; Yeh et al., 2002; Yeh et al.,

2001a; Yeh et al., 2001b; Whittle et al., 2003; Paris

et al., 2003). These systems make predictions for only

existing ﬁles, whereas our approach can make predic-

tions for newly created ﬁles too. However, our focus

is on recommendation rather than caching.

The approach by Song et al. (Song et al., 2014) is

closer to our work since it aims at assisting knowledge

workers by recommending ﬁles and actions. It uses

a data mining technique to ﬁrst group similar ﬁles

into abstract tasks, and then mines frequent sequences

of abstract tasks into workﬂows. It then makes rec-

ommendations by identifying the workﬂow that best

matches the current ﬁle usage pattern of the user.

While the technique attempts to generalize beyond

exact ﬁle matching, it cannot provide recommenda-

tions for new ﬁles. In contrast, we train personalized

machine learning models that provide recommenda-

tions even for new ﬁles. For evaluation of our ap-

proach, we use only those test ﬁles that are new with

respect to the training ﬁles. The ability to recommend

new content is important in order to connect knowl-

edge workers with new data which is being generated

at a tremendous rate (Gantz and Reinsel, 2012).

Unlike all the previous works, we use much richer

ﬁle metadata including ﬁlename, path, ﬁle system hi-

erarchy, extensions and collaborative ﬁltering in our

models. As a result of this, our work can also supple-

ment existing Data Governance systems with predic-

tive capabilities. While our approach is not ideal for

ﬁle caching in performance sensitive applications, it

could be effective in cloud services to reduce network

latency by caching ﬁles on client-facing web servers

or directly on clients. It could also be useful for sce-

narios with intermittent connectivity, such as choos-

ing ﬁles to cache on mobile devices.

The personalized model based ﬁle recommenda-

tion as proposed in our paper is a content-based rec-

ommendation system. As compared to traditional col-

laborative ﬁltering based recommender systems (Lin-

den et al., 2003; ?), our approach does not suffer from

cold start problem, i.e., inability to recommend a new

item (ﬁle). It should however be noted that unlike tra-

ditional collaborative ﬁltering techniques, we do not

use actual access information. Rather, we predict the

access likelihood of a user for a particular test item

and combine it with metadata features. This enables

us to circumvent the cold start problem, and thus ben-

eﬁt from collaborative ﬁltering.

Finally, advanced machine learning models such

as factorization machines (Rendle, 2010), deep neu-

ral networks (Salakhutdinov et al., 2007; Hinton et al.,

2006) and topic models (Nagori and Aghila, 2011; ?)

can also be employed for modeling ﬁle access predic-

tions. To a large extent, these techniques are com-

plimentary and can contribute in making our mod-

els more effective. Notwithstanding, we approach the

problem as a classiﬁcation problem and show reason-

able effectiveness even with a simple Linear SVM-

based model. Our focus is more on the domain spe-

ciﬁc application, with the goal of extracting meaning-

ful features from ﬁle metadata and user activities.

8 FUTURE WORK

There are several directions in which the proposed

system can be extended to improve both its efﬁciency

and efﬁcacy, and to make it applicable to new and

emerging scenarios.

Optimization techniques can be developed that

can make the model testing much faster, by judi-

ciously selecting the user models that need to be ap-

AccessPredictionforKnowledgeWorkersinEnterpriseDataRepositories

159

plied on a new ﬁle. Such techniques may be able to

trade off model correctness for testing speed in some

scenarios.

The metadata features show a high degree of spar-

sity as a result of how they are constructed. A ﬁle

typically has very few keywords in its path, and thus

most of its token features would be zero. Simi-

larly, very few of its folder features, and at the max-

imum of one extension feature of a ﬁle are non-

zero. While sparsity can be helpful for training user

models (Ngiam et al., 2011), the large dimension-

ality of data may negatively affect the performance

of the models. It is possible that the correctness

and speed of the proposed system can be further im-

proved by capturing the interdependence between dif-

ferent features through dimensionality reduction tech-

niques such as Principal Component Analysis (Jol-

liffe, 2005)(Van der Maaten et al., 2009). For exam-

ple the folder features demonstrate substantial inter-

dependence and redundancy and techniques to trans-

form them to a suitable space may be explored.

Modeling the ﬁle metadata and user features in

context of temporal nature of ﬁle accesses could also

be a potential direction for further work. For exam-

ple, giving more importance to recent events while

training user models may accommodate shifts in user

interests, leading to improved performance. Identi-

fying and modeling repetitive activity may also be

informative since users may be interested in similar

tasks after ﬁxed time intervals, such as on the same

day each week. As mentioned in Section 6.6, deploy-

ment of the proposed system in an enterprise envi-

ronment offers an online framework to evaluate the

trained models. Online model training or update tech-

niques can be developed that utilize the model evalu-

ation information to improve the trained models by

adapting them to new access patterns or newly ob-

served features. For example, consider a scenario

where a trained model is seen to perform poorly be-

cause most of the recent activity for a user is conﬁned

to a recently created folder that was not part of the

folder features in the trained model. Such information

can be derived from the online evaluation and can be

used to update features of the trained model and to

adapt the model to reﬂect the updated access patterns.

In addition to training personalized user models,

insight into directed preferences of users may be use-

ful for recommending content. For example if it is

determined that user u

often accesses documents cre-

ated by user u

, then a recent modiﬁcation by u

may

be useful information for u

and can be used as an

indicator to recommend relevant content.

Lastly, ﬁle access prediction offers interesting

possibilities for applications such as information se-

curity, by offering new measures of access improba-

bility.

9 CONCLUSION

This paper presents a system that provides ﬁle rec-

ommendation to assist knowledge workers process in-

creasing volumes of data. The system utilizes nat-

ural language processing to derive usable informa-

tion from ﬁle metadata, and machine learning to

train personalized user models that have good pre-

dictive value, even for ﬁles that have not been ob-

served in the past. Through extensive experiments

on real world data we demonstrate the feasibility of

the system to offer high quality recommendations,

which is reﬂected particularly in the signiﬁcant re-

call at high precision across eight shares. We also

show that for shares exhibiting a high degree of col-

laboration between its users, the predictions from dif-

ferent user models can be combined to improve the

performance of an individual user’s model. It is ob-

served that the trained models have a high tempo-

ral longevity, and experience moderate performance

degradation for short training periods. Since the sys-

tem requires training personalized models for each

user under consideration, it should be applied only on

shares and users that display sufﬁcient activity and are

determined to be of interest.

REFERENCES

Active Directory (2015). Active directory. http://msdn.

microsoft.com/en-us/library/bb742424.aspx.

Amer, A., Long, D. D. E., Paris, J.-F., and Burns, R. C.

(2002). File access prediction with adjustable accu-

racy. In International Performance Conference on

Computers and Communication (IPCCC).

Breese, J. S., Heckerman, D., and Kadie, C. (1998). Empir-

ical analysis of predictive algorithms for collaborative

ﬁltering. In Conference on Uncertainty in artiﬁcial

intelligence.

Bybee, J. L. (1985). Morphology: A study of the relation be-

tween meaning and form, volume 9. John Benjamins

Publishing.

CamelCase (2015). Capitalization styles. http://msdn.

microsoft.com/en-us/library/x2dbyw72%28v=vs.71%

29.aspx.

coca (2008). The corpus of contemporary american english:

450 million words, 1990-present. Available online at

http://corpus.byu.edu/coca/.

Gantz, J. and Reinsel, D. (2012). The digital universe in

2020: Big data, bigger digital shadows, and biggest

growth in the far east. In IDC iView: IDC Analyze the

Future.

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

160

Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast

learning algorithm for deep belief nets. Neural com-

putation, 18(7):1527–1554.

IDG Enterprise (2014). Big data survey.

Jolliffe, I. (2005). Principal component analysis. Wiley

Online Library.

Kroeger, T. and Long, D. D. E. (2001). Design and imple-

mentation of a predictive ﬁle prefetching algorithm.

In USENIX Annual Technical Conference, pages 105–

118.

Leonardi, P. M., Huysman, M., and Steinﬁeld, C. (2013).

Enterprise social media: Deﬁnition, history, and

prospects for the study of social technologies in or-

ganizations. In Journal of Computer-Mediated Com-

munication.

Linden, G., Smith, B., and York, J. (2003). Amazon. com

recommendations: Item-to-item collaborative ﬁlter-

ing. Internet Computing, 7(1):76–80.

Nagori, R. and Aghila, G. (2011). LDA based integrated

document recommendation model for e-learning sys-

tems. In International Conference on Emerging

Trends in Networks and Computer Communications

(ETNCC).

Ngiam, J., Chen, Z., Bhaskar, S. A., Koh, P. W., and Ng,

A. Y. (2011). Sparse ﬁltering. In Advances in Neural

Information Processing Systems, pages 1125–1133.

Ofﬁce365 (2015). Microsoft ofﬁce 365. http://en.

wikipedia.org/wiki/Ofﬁce

365.

Ovsjanikov, M. and Chen, Y. (2010). Topic modeling for

personalized recommendation of volatile items. In

The European Conference on Machine Learning and

Principles and Practice of Knowledge Discovery in

Databases.

Paris, J.-F., Amer, A., and Long, D. D. E. (2003). A stochas-

tic approach to ﬁle access prediction. In International

Workshop on Storage Network Architecture and Par-

allel I/Os (SNAPI).

Rendle, S. (2010). Factorization machines. In IEEE Inter-

national Conference on Data Mining (ICDM).

Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Re-

stricted boltzmann machines for collaborative ﬁlter-

ing. In ACM International Conference on Machine

Learning.

Salesforce (2015). Salesforce.com. http://

www.salesforce.com/.

scikit-learn (2015). scikit-learn Machine Learning in

Python. http://scikit-learn.org/.

Song, Q., Kawabata, T., Ito, F., Watanabe, Y., and Yokota,

H. (2014). File and task abstraction in task workﬂow

patterns for ﬁle recommendation using ﬁle-access log.

In IEICE Transactions on Information and Systems.

Van der Maaten, L. J., Postma, E. O., and van den Herik,

H. J. (2009). Dimensionality reduction: A compara-

tive review. Journal of Machine Learning Research,

10(1-41):66–71.

Wang, C., Viswanathan, K., Choudur, L., Talwar, V., Sat-

terﬁeld, W., and Schwan, K. (2011). Statistical tech-

niques for online anomaly detection in data centers.

In IFIP/IEEE International Symposium on Integrated

Network Management, pages 385–392.

Whittle, G. A. S., Paris, J.-F., Amer, A., Long, D. D. E.,

and Burns, R. (2003). Using multiple predictors to

improve the accuracy of ﬁle access predictions. In

International Conference on Massive Storage Systems

and Technology (MSST), pages 230–240.

Xia, P., Feng, D., Jiang, H., Tian, L., Xia, P., Feng, D.,

Jiang, H., Tian, L., and Wang, F. (2008). Farmer: A

novel approach to ﬁle access correlation mining and

evaluation reference model for optimizing peta-scale

ﬁle systems performance. In The International ACM

Symposium on High-Performance Parallel and Dis-

tributed Computing (HPDC).

Yeh, T., Long, D. D. E., and Brandt, S. A. (2001a). Per-

forming ﬁle prediction with a program-based succes-

sor model. In Modeling, Analysis and Simulation

of Computer and Telecommunication Systems (MAS-

COTS).

Yeh, T., Long, D. D. E., and Brandt, S. A. (2001b). Using

program and user information to improve ﬁle predic-

tion performance. In International Symposium on Per-

formance Analysis of Systems and Software (ISPASS).

Yeh, T., Long, D. D. E., and Brandt, S. A. (2002). Increas-

ing predictive accuracy by prefetching multiple pro-

gram and user speciﬁc ﬁles. In Annual International

Symposium on High Performance Computing Systems

and Application (HPCS).

AccessPredictionforKnowledgeWorkersinEnterpriseDataRepositories

161