ACCESS RIGHTS IN ENTERPRISE FULL-TEXT SEARCH

Searching Large Intranets Effectively using Virtual Terms

Jan Kasprzak, Michal Brandejs, Mat

Cuhel and Toma

s Ob

siva

Faculty of Informatics, Masaryk University, Brno, Czech Republic

Keywords:

Full-text search, Intranet, Security, Access rights, Document permissions, Document-level security.

Abstract:

One of the toughest problems to solve when deploying an enterprise-wide full-text search system is to handle

the access rights of the documents and intranet web pages correctly and effectively. Post-processing the

results of general-purpose full-text search engine (ﬁltering out the documents inaccessible to the user who sent

the query) can be an expensive operation, especially in large collections of documents. We discuss various

approaches to this problem and propose a novel method which employs virtual tokens for encoding the access

rights directly into the search index. We then evaluate this approach in an intranet system with several millions

of documents and a complex set of access rights and access rules.

1 INTRODUCTION

Many enterprises have put effort in developing web-

based authenticated systems, covering many aspects

of internal processes, called intranet systems. Some

intranets provide not only access to various struc-

tured pieces of information (such as list of members

of some work group, or a list of occupied time slots of

a meeting room), but also unstructured text documents

(such as study materials for the university courses,

whitepapers, internal product documentation, etc.).

It is often difﬁcult for users to navigate inside

a large web-based intranet system. Full-text search

can provide an alternative (or even primary) means

of navigation (Zhu et al., 2007). The task of search-

ing the intranet system effectively has many prob-

lems which are completely different to the problems

of searching the publicly-available world-wide-web

(Hawking, 2004).

An important problem is to be able to follow the

access rights of the web pages and documents effec-

tively: the intranet systems often contain documents

and pages, which are not permitted to be read by ev-

ery authenticated user, so it is not possible to allow

them to be discovered using a full-text search sys-

tem. In this paper we propose a solution to the ac-

cess rights problem in the enterprise full-text search,

and we evaluate this approach on Masaryk University

Information System (IS MU, 2010), a large intranet

system for study administration.

The rest of this paper is organized as follows:

in Section 2 we discuss related work, in Section 3

we present an overview of general-purpose full-text

search engines, especially with respect to our pro-

posed method. In Section 4 we discuss access rights

management. The main contribution of this paper is

in Section 5, where we propose a novel method of

incorporating security features to the full-text search

system. We further evaluate this method on a real

world intranet system with tens of thousands users

and several milion documents and pages in Section

6. Finally, the Section 7 contains conclusions.

2 RELATED WORK

Several authors have attempted to deal with issues

concerning full-text search in intranet systems, where

the access rights to documents have to be taken into

account.

In (Bailey et al., 2006), the two architectures of

enterprise systems are presented. In the ﬁrst one the

search engine is not involved in security policies or

access rights: users communicate with a repository,

which forwards the query to the search engine, and

then ﬁlters out the documents inaccessible to the user.

In the second approach the search engine handles the

access control by querying an external service such

as LDAP. The cost of security checks in the second

approach grows linearly with the number of results

to check, leading to query times of 100 seconds when

Kasprzak J., Brandejs M.,

Cuhel M. and Obšiva

c T. (2010).

ACCESS RIGHTS IN ENTERPRISE FULL-TEXT SEARCH - Searching Large Intranets Effectively using Virtual Terms.

In Proceedings of the 12th International Conference on Enterprise Information Systems - Databases and Information Systems Integration, pages 32-39

DOI: 10.5220/0002896900320039

 SciTePress

the number of candidate results reaches 1 million doc-

uments, even when caching the access control data.

In (Hurley, 2009), the author realizes that ﬁltering

out the results in the post-processing phase may have

negative impact on the search performance. A differ-

ent approach is presented, where the users’ creden-

tials are calculated for every user when the document

is being indexed. A per-user view of the index is cre-

ated, and subsequently used for evaluation of user’s

queries. To be able to maintain consistent copy of

access control lists, quick check to the directory ser-

vice is performed before each query. Two implemen-

tations are presented: PrivaSearch1, where each user

has speciﬁc view of the corpus, and PrivaSearch2,

where the membership in groups is also reﬂected. As

a result of this method each user is provided with a

separate index. Three types of document distributions

are discussed: disjoint, overlapping, and hierarchical.

The implementation is tested on data sets of 2,000 to

20,000 documents. For larger data sets, the author

suggests to distribute the index. We consider this ap-

proach (per-user indexes) completely feasible only for

a limited range of systems where users have access to

disjoint sets of documents (e.g. e-mail messages).

Google Search Appliance (GSA, 2009) is as a

widely known commercial black box solution for in-

tranet or website full-text search. It performs on-line

access control checks for each document from the re-

sult set. Standardized SAML (SAML, 2005) identity

provider (or specialized one for batch processing) has

to be present or implemented on the customer side.

The post-processing approaches have limitations

on large scale systems. For example, when there is

a thousand of higher-ranked documents matching the

query, but inaccessible to a given user, the system will

need to check the access rights to all of those docu-

ments before ﬁnally discovering a lower-ranked doc-

ument accessible to that user.

3 FULL TEXT SEARCH ENGINES

The full-text search systems are often constructed as a

software, which maintains the index of the documents

to be searched, and executes queries against this in-

dex. We refer to (Zobel and Moffat, 2006) for possi-

ble approaches for creating and using the index.

3.1 Index

For the method we propose, it is sufﬁcient to view

the full-text search engine as a maintainer and user of

an inverted index, consisting of lexicon of all words

from the set of documents, and mapping of each of

those words to the list of document IDs, listing the

documents which contain that word.

These lists are often sorted by the document ID,

allowing to store them differentially, and compressed

using e.g. Elias delta encoding (Elias, 1975). They

also might be accompanied by weights of the given

word in a particular document.

For evaluating multi-word proximity and phrase

queries, it may be useful to allow fast forward search-

ing of these lists, which can be done e.g. by accom-

panying larger lists with skip-lists for semi-direct ac-

cess.

The index also contains other data structures,

which are not relevant to this paper (forward index,

lists of word positions, etc.).

3.2 Query Format

The query itself is usually entered by user as a se-

quence of words, meaning “ﬁnd the documents which

contain all of those words, preferably near to each

other” (i.e. the implicit AND/NEAR operator). Many

search engines allow also the NOT operator, and some

of them support also the OR operator. The support for

exact phrase searches is common as well.

The OR operator—even when not available di-

rectly to the end user—is often used internally

for handling lemmatization, inﬂections, diacritics,

acronyms, and so on. For example, the query “Citro

cars” can be internally transformed as follows:

(citro

en OR citroen) AND (cars OR car)

For our purposes, it will be sufﬁcient when the

search engine can efﬁciently handle the queries in the

above format—i.e. the logical conjunction (AND) of

the logical disjunctions (OR). More precisely formu-

lated, the expected query format is the following:

(token

1,1

OR token

1,2

OR token

1,3

OR ···)

AND (token

2,1

OR token

2,2

OR token

2,3

OR ···)

AND (token

3,1

OR token

3,2

OR token

3,3

OR ···)

. (1)

4 ACCESS RIGHTS

The availability of a document (or, more generally,

object) to some user (i.e. subject) is often described by

a matrix with per-subject rows, per-object columns,

and cells containing the actual permissions settings

(a subset of “can read”, “can write”, “can delete”,

etc.). For the full-text search, only the read permis-

sion is signiﬁcant. Note, however, that some systems

ACCESS RIGHTS IN ENTERPRISE FULL-TEXT SEARCH - Searching Large Intranets Effectively using Virtual Terms

(e.g. LDAP) can have a separate “can read” and “can

search” levels of access.

Storing the matrix as a whole is often impractical,

so the systems (be it web-based intranets or operating

systems) often store the parts of the matrix either as

rows, often called capabilities (e.g.: “user root can

read every ﬁle”, or “the teacher can read all of the

study materials of his own courses”), or as columns,

called access control lists (ACLs) in the Chapter 5 of

(Anderson, 2008)—e.g. “this document can be read

by members of group staff”, or “this document can

be read by all present and past students of the course

‘UNIX’”.

Some systems store the access rights information

explicitly (e.g. ﬁlesystems), while other systems have

some of the implicit rules hardcoded inside the system

(this is the case of many enterprise systems).

To reduce the size of the access rights matrix

(stored either as capabilities or ACLs), subjects are

often organized into groups (e.g. group staff, or “stu-

dents of the course ‘UNIX’” from the above exam-

ples).

For the approach we propose in Section 5 it is suf-

ﬁcient to be able to evaluate the following two lists:

• the list of groups a given user belongs to,

• and the list of groups which can read a given doc-

ument.

As a last resort, if no smarter method of grouping

is available, it is possible to put every user in his own

separate group.

5 PROPOSED METHOD

The high-level view of the proposed system architec-

ture can be seen in Figure 1. Some components are

present in general intranet systems as well:

SQL database stores structured data of the intranet

system. It is often encapsulated by a middleware

layer.

Document Storage. Large static text and multimedia

ﬁles are usually stored outside the database.

HTTP Front End. This is the system which the

user’s browser communicates with. It handles au-

thentication, parsing the HTTP requests, and com-

puting the resulting pages.

The following components are added for the pur-

pose of full-text search:

Crawler/indexer downloads the static documents

from the document storage, and either down-

loads the dynamically generated pages from the

HTTP server, or constructs them from data in the

database the same way as the HTTP server does.

The documents are then parsed and added to the

index.

Search Index. Contains the data structures described

in Section 3.1.

Search Server. Receives the pre-parsed queries from

the HTTP server, executes them using the pre-

computed index, and sends the results back to the

HTTP server, which then formats them to the user.

Note that the search server does not communicate

with the SQL database, so the pre-parsed query can

be executed using the search index as the only source

of information.

SQL database

Search indexSearch server

User

HTTP front end

Document storage

Crawler/indexer

Figure 1: Architecture of an intranet search system.

5.1 Virtual Tokens

We propose to enhance the search engine with sup-

port for what we call virtual tokens: virtual token is a

special word, added to the inverted index, as if it was

a part of the indexed document itself. The difference

between words an virtual tokens is the following:

• The virtual token has no position within the docu-

ment. It is thus excluded from either the proximity

or exact phrase searches.

• It has no weight, so the search results cannot be

sorted using the weight of a virtual token. Another

(non-virtual) token has to be present in a query.

• The virtual token in one document cannot be an

ordinary word in another document. This is easily

satisﬁed by adding a character which can never be

a part of a word. We use the colon (:) for this

purpose.

The above requirements allow us to handle the

virtual tokens inside the inverted index differently:

the list of document IDs for a virtual token does not

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

need to include anything beyond the (differentially-

encoded and delta-compressed) document IDs them-

selves. No pointer to the list of positions and no

weight of a word in a given document is needed.

5.2 Access Rights Encoding

The access rights (be it ACLs or capabilities) are then

named and encoded as groups of users (e.g. “teachers

of the ‘UNIX’ course” or “students of the ‘Informat-

ics’ study programme”).

The permissions of a given document are repre-

sented as a set of groups which have the read ac-

cess to the given document. This set can be then

encoded into the inverted index as virtual tokens—

derived for example from the name or the num-

ber of a given group (e.g. “p:teachers UNIX”, or

“p:programme Informatics”; the “p:” preﬁx we re-

serve for virtual tokens describing the access rights).

For each group, members of which can read a given

document, we add a virtual token with the group iden-

tiﬁcation for this document to the inverted index.

When the document should be accessible with-

out authentication (i.e. when it is world-readable), we

add a reserved virtual token “p:noauth” to it. When

the document should be available to any authenticated

user, we add another virtual token “p:auth” to it.

5.3 Query Execution

The user’s query is parsed by the HTTP server to the

format (1). To limit the search results to the doc-

uments which the user has permission to read, the

HTTP server computes the list of all groups the user

belongs to (for non-authenticated user, the list con-

sists of the single group “noauth”). It then modi-

ﬁes the query by adding another OR-list the following

way:

The preparsed query

AND (p:group

OR p:group

OR ···) (2)

With this part added, the search server returns only

those documents matching the user’s query, which are

also readable by this user.

5.4 Expected Performance

The proposed system has three places, in which the

overall performance is affected by adding the access

rights:

Indexing the Document. This part of the system

runs in a batch mode, so the interactive latency

of the system is not affected. As for overall per-

formance, we expect the evaluation of the docu-

ment’s access rights to have negligible cost rela-

tive to e.g. tokenizing the text or converting the

document from formats like PDF or DOC.

Evaluating the Per-user Group List. This can be

either very fast or relatively expensive operation,

depending on the format of the source data. For

example, fetching the list of groups from the

UNIX /etc/group table or from LDAP can be

fast, while computing the list from the implicit

intranet data can be expensive (for example, for

study materials, a separate group for each course

and possibly semester has to be created, and the

list of courses the user has ever enrolled in should

be evaluated). Should this operation be expen-

sive, we can cache the resulting list of groups for

a given user for some time (a day, or possibly an

hour). We can even precompute the list on back-

ground after displaying the full-text search form

in order to improve the latency.

Executing the Qquery. The impact in this part can

potentially be very high: the search server needs

to process many and/or long additional lists of

document IDs for many virtual tokens. However,

with some optimizations, the search can theoreti-

cally be even faster than without the access rights.

For example, when the search server processes the

OR-list with the lowest number of matching doc-

uments ﬁrst, executing the query for a very fre-

quent word can be faster for users which can read

only a small number of documents. On the other

hand, when the query is executed on behalf of user

who can access large portion of the available doc-

ument base, we can optimize the query execution

by adding skip-lists to the large lists of document

IDs. The impact is further lowered by simpliﬁed

lists for virtual tokens, as described in Section 5.1.

The main advantage of this system is that it does

not require the access rights of the individual doc-

uments to be evaluated in the query results post-

processing. This helps especially when the search re-

sults contain lots of documents which are highly rel-

evant, but inaccessible to a given user. With our ap-

proach, we are able to ﬁlter out the inaccessible docu-

ments during the query execution phase the same way

multi-word queries work.

5.5 Drawbacks and Limitations

The system we have just described—like many other

computing systems—contains a tradeoff between the

efﬁciency and accuracy. We will try to identify and

ACCESS RIGHTS IN ENTERPRISE FULL-TEXT SEARCH - Searching Large Intranets Effectively using Virtual Terms

describe the possible drawbacks of using such a sys-

tem:

The Permissions are “Cached” in the Index. This

means that the system can sometimes allow the

document to be found even after the access for

a given user has been revoked. From our point

of view the problem is not so critical: for urgent

cases (such as when the user can no longer

be trusted) the common approach is to revoke

the access to the intranet system as a whole,

so the full-text search system is also no longer

accessible.

For revoking the permissions for a given group,

the situation is the same as when removing a part

of the document in a new version of the docu-

ment: we do not expect the immediate reindex-

ing the document when modifying its contents, so

we should not expect the immediate revoking of

the rights as well. In our experience, it is feasi-

ble to invoke the reindexing of the modiﬁed doc-

uments frequently (e.g. every minute), so the de-

lay of propagation of modiﬁed access rights can

be kept relatively low. Also note that in this case

we are trying to deny a read access to something

which could have been readable for a long time

now.

Negative Access Rights. Some systems of access

rights can use the group membership also to lower

the permissions for a given user. For example, the

classical UNIX access rights allow the ﬁle to have

permissions set to rw----r--, which means that

the ﬁle is readable and writable by its owner, and

readable by everybody except the members of the

group the ﬁle belongs to. This can be mapped to

our approach only by a new virtual token/group

“all users except members of the group XY”.

Tree-like Rights are not Straightforward. It is also

possible for documents to be organized in a direc-

tory tree, in which not only the permissions of the

document itself are evaluated, but also the permis-

sions of the directories in the path from the doc-

ument to the root of the tree. The problem arises

when the subdirectory or a ﬁle has permissions,

which are not a subset of the permissions of the

upper directory. For example, the directory can be

readable by a group “students of the programme

‘Informatics’”, and the document in it can be read-

able by a group “students who have enrolled in the

course ‘UNIX’”. Since the ﬁrst group is not nec-

essarily a superset of the second one, we should

effectively provide the access only to the users

which are members of both of the above groups.

This also can be solved by creating another vir-

tual token/group “students of the programme ‘In-

formatics’, who also have enrolled in the course

‘UNIX’”. This approach requires additional work

when evaluating the list of groups for a given user,

but it can be implemented.

Despite the above drawbacks, we believe that our

approach is usable for many intranet systems, as it can

be faster than on-line veriﬁcation of the access rights.

5.6 Other Uses of Virtual Tokens

Virtual tokens as described in Section 5.1 can be used

for other purposes beyond access rights: the docu-

ment date can be encoded into virtual tokens, and then

user can specify the time range. For example, the fol-

lowing query can select the time span of 13 months

starting with January 2009:

The preparsed query AND (d:2009 OR d:201001)

It can also be used for improving the search engine

relevance: we can add weight to the documents con-

taining a given virtual token instead of ﬁltering them

out when the required virtual token is not present. It

is then possible to give preference to the study mate-

rials of courses attended by student, even when other

courses may have their study materials publicly ac-

cessible.

Advanced search forms can specify additional

search criteria like increased weights for documents

in a given subtree. Another possible use of virtual

tokens is in search log evaluation (e.g. ﬁnding terms

most searched by students of the ‘UNIX’ course).

6 CASE STUDY

We have tested our approach on a real-world intranet

system with relatively complicated system of access

rights.

6.1 About IS MU

The Masaryk University Information System (IS MU,

2010) is a study administration and e-learning system.

It has been developed since 1999. More than 130,000

users—present and past students and employees of

the university—can access the system. As of January

2010, the system is accessed by more than 30,000

unique users daily, and handles more than 2,000,000

authenticated HTTPS requests per day. The document

storage contains over 20,000,000 objects (documents

in various formats, images, multimedia data, theses,

e-mails, etc.).

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

6.2 Access Rights System

The system of access rights in IS MU consists of the

explicit access rights, which can be set by the owner

of the document (e.g. “this document can be read by

the students of the course “UNIX’”), and implicit ac-

cess rights, which are mostly hardcoded in the appli-

cations or PL/SQL procedures. This includes things

like “the study materials of a given course are irrevo-

cably readable by all teachers of that course”, or “the

permissions of the discussion forum posts cannot be

set at all, the permissions are derived from the top-

level node (representing the forum itself)”, etc. There

are several types of explicit rules (based on the course,

study programme, workgroup membership of an em-

ployee, faculty, alumni status, type of study, etc.).

6.3 Testing Platform

We have implemented the access rights handling

method as described in Section 5 into the IS MU full-

text search system. The searcher and indexer run on a

single server SGI Altix XE (a rebranded SuperMicro)

with dual Xeon E5472 CPUs at 3.0 GHz (8 cores to-

tal), 64 GB RAM. The system runs 64-bit version of

Fedora Linux.

The tokenization of documents (including evalu-

ating the access rights) has been implemented in Perl,

creating the partial inverted index, merging indexes,

and searching were implemented in plain C with Judy

arrays (Judy, 2010) for storing associative data such

as lexicon. The HTTP front-end applications are writ-

ten in Perl: pre-parsing the queries including evalua-

tion and caching the per-user list of groups runs there.

The index is segmented into 8 parts of similar size

(per-document split), and during the query execution

each CPU core works on top of its own part of index.

The partial results are then merged together.

6.4 Users

We have decided to test the system on a subset of

users, who have used the system during October 2009

(October is one of the busiest months of a year for

the system, because of the start of semester). There

are 51,022 such users, which form 768,442 different

groups. Each user is in at least two groups (“all au-

thenticated users”, and a per-user group). The median

is 178 groups per user, mean is 210, and the maxi-

mum is 9,942 groups per user. The histogram of the

number of users with a particular number of groups

(without the long tail of users which are members of

more than 800 groups) is in Figure 2.

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 100 200 300 400 500 600 700 800

Number of users

Number of groups

Users and Groups

Figure 2: Number of users with a given number of groups.

The median of the time for evaluating the per-

user list of groups was 0.028 s, mean 0.048 s, and

maximum was 1.415 s (we have took the minimum

from ﬁve runs). The tests were run on a production

database without exclusive access to the computing

resources.

6.5 Documents

As a testing set we have used 1,370,200 documents

from various agendas, both static (e.g. theses and

study materials), and dynamically generated (e.g. per-

sonal home pages). The documents are accessible by

60,493 different groups (this number is much smaller

than the total number of groups for all users, be-

cause e.g. some courses contain no study materials

at all). The largest number of documents a single

group can access is 1,129,995. Another important

groups are “non-authenticated users” which is the

third with 398,391 documents, and “all authenticated

users” pseudo-group (14th, 107,520 documents).

Further optimizations of the group memberships

in IS MU are possible, but are outside of the scope of

this article.

The performance of the indexer depends highly on

the document size. For example, for theses, the sys-

tem can tokenize 9 documents per second on average,

while for the rest of the testing set (including the study

materials) the speed is around 80 documents per sec-

ond. The tokenizing and generating the inverted index

of the whole document set takes slightly more than 6

hours. This time does not include converting various

foreign formats (DOC, PDF) to plain text.

The resulting index occupies about 18.3 GB. The

data relevant for our test (lexicons and document lists

for all words) occupy 2.7 GB. The biggest part of

the index is forward index for generating document

snippets (7.8 GB) and word position lists for phrase

searches (7.7 GB).

The overhead of adding the virtual tokens is in

the lexicon, where it adds 60,493 more words to the

16,868,473 words which were already present, and in

the lists of document IDs for those words, where it

costs additional 8,449,607 list entries to the already

ACCESS RIGHTS IN ENTERPRISE FULL-TEXT SEARCH - Searching Large Intranets Effectively using Virtual Terms

existing 932,424,699 entries. We can say the index

size overhead is less than 1 %.

6.6 Search Performance

In this section we evaluate the performance of the

search server as described in Figure 1. One of the

most computing-intensive parts of the query execu-

tion is decoding the lists of document IDs. We have

measured that with our implementation, a single core

of the search server can decode about 55 millions

of differentially-encoded delta-compressed numbers

(with up to 64-bit size) per second.

We have tried to measure the pre-parsed query

execution times for various single-word queries with

and without applying the access rights. We do not in-

clude multi-word and phrase queries here for brevity.

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

100 1000 10000 100000 1e+06 1e+07

Time [seconds]

Word frequency [docs]

Query Execution Time

root

noauth (1)

1st quartile (93)

median (178)

3rd quartile (295)

last percentile (1811)

maximum (9942)

Figure 3: Query execution times for various users and terms

of a given frequency.

Another difference from the production use is that

the search server did not generate the text snippets of

the resulting documents, as this part is not relevant to

the measurement of the access rights overhead.

We have chosen a set of words such that the words

are present in approximately power-of-two number of

documents, and are English words, because of the lan-

guage of this paper. The most frequent word is ‘a’,

which is present in 1,221,642 documents. The other

words in descending frequency are 7, on, role, both,

party, speech, warning, movies, broadest, peptides,

initiators, prefect, realigning, prescript, and ﬁnally

noncyclic, which is present in 32 documents.

We have measured the query execution time for

the above words for a sample set of users. Firstly—

as a baseline—we have the search without applying

the access rights at all (superuser access), then a non-

authenticated search (i.e. limited to the group “non-

authenticated requests”), and then as users with 93,

178, 295, 1811, and 9942 groups (ﬁrst quartile, me-

dian, third quartile, last percentile, and maximum;

from the users with similar number of groups the ones

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

100 1000 10000 100000 1e+06

Time [seconds]

Word frequency [docs]

Query Execution Time

root

noauth (1)

1st quartile (93)

median (178)

3rd quartile (295)

last percentile (1811)

Figure 4: Detailed view on the query execution times.

with maximum number of accesses during October

2009 have been chosen, in order to simulate the real-

world performance for more active users). We took

the lowest time from 10 runs for each query. Note

that generating the per-user group lists, as described

and measured in Section 6.4, is not included in these

times.

The Figure 3 shows the execution times of queries.

Figure 4 shows the same graph with limited range of

the vertical axis to highlight the details. Finally the

Figure 5 shows the overhead relative to the search

without applying the access rights.

-100

-50

100

150

200

250

300

350

400

450

100 1000 10000 100000 1e+06 1e+07

Overhead [%]

Word frequency [docs]

Access Rights Overhead

root (0)

noauth (1)

1st quartile (93)

median (178)

3rd quartile (295)

last percentile (1811)

maximum (9942)

Figure 5: The overhead of following the access rights.

6.7 Discussion of the Results

The cost of generating per-user group lists in IS MU

is non-trivial and introduces a signiﬁcant latency. We

have therefore added caching of the lists for a day to

the production system. However, the system is not

typical because of large number of groups and com-

plicated evaluation of group membership.

The measurements of pre-parsed query execution

times show visible overhead of evaluating the access

rights. We can see that the overhead stays within

200 % over the baseline for every user except the last

percentile. And even then, the maximum measured

overhead is 400 % more than the baseline of searching

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

without the access rights. For frequent words, how-

ever, there is a negative overhead of the access rights

evaluation, because of restricting the documents to

evaluate to those accessible by the user. The reason

of such behaviour is that decoding of long list of doc-

ument IDs is fast enough (and it can be made even

faster using skip-lists), and what does matter is how

many times the decoding has to be interrupted in or-

der to look at another list (for another word/virtual

token).

We expect the overhead will be even lower for

phrase and multi-word queries, which are prevalent:

their cost is already much higher because of evaluat-

ing the positions of words, and the access rights eval-

uation time would remain about the same.

7 CONCLUSIONS

We have proposed a novel method for handling the ac-

cess rights within an intranet full-text search engine

by extending a general-purpose search engine with

additional support for virtual tokens, which can be

processed more efﬁciently than ordinary words during

the query execution. This extension can be used for

many other purposes apart from access rights encod-

ing (such as social search, date/time limiting, subtree

limiting, per-agenda search, etc.).

Many systems of the access rights can be trans-

lated to the search engine data as virtual tokens repre-

senting groups of users or individual users. There are

some cases where this representation is not straight-

forward, such as a set complement of the group, or

intersection of two arbitrary groups.

We have implemented the proposed system, in-

dexing static documents and dynamically-generated

content of a large intranet system with a complex sys-

tem of access rights and rules. The measurements

proved that the overhead of evaluating the access

rights for a given data set is under 200 % for 99 %

of users, and at most 400 % for the worst case.

The main drawback of the proposed method is that

caching of the access rights of the documents is an in-

tegral part of the system, so the permissions cannot be

revoked except by reindexing the document again or

marking it as deleted. The caching of per-user access

rights is, on the other hand, purely optional.

Further improvements are possible: the most

promising one is to use skip-lists for large lists of doc-

ument IDs.

ACKNOWLEDGEMENTS

The authors would like to thank Pavel

Smerk and

Mirka Kram

arekov

a for careful proofreading of this

paper.

REFERENCES

Anderson, R. J. (2008). Security Engineering: A Guide

to Building Dependable Distributed Systems. Wiley

Publishing.

Bailey, P., Hawking, D., and Matson, B. (2006). Secure

search in enterprise webs: tradeoffs in efﬁcient im-

plementation for document level security. In CIKM

’06: Proceedings of the 15th ACM international con-

ference on Information and knowledge management,

pages 493–502, New York, NY, USA. ACM.

Elias, P. (1975). Universal codeword sets and representa-

tions of the integers. IEEE Trans. Inform. Theory,

pages 194–203.

GSA (2009). Google Search Appliance.

http://code.google.com/ intl/en/apis/searchappliance/

documentation/.

Hawking, D. (2004). Challenges in enterprise search.

In ADC ’04: Proceedings of the 15th Australasian

database conference, pages 15–24, Darlinghurst, Aus-

tralia, Australia. Australian Computer Society, Inc.

Hurley, J. (2009). Preventing information leakage in the

search engine. Master’s thesis, University of Tromsø.

IS MU (1999–2010). Masaryk University Information Sys-

tem. http://is.muni.cz/.

Judy (2002–2010). Judy Arrays Web Page.

http://judy.sourceforge.net/.

SAML (2005). Security Assertion Markup Language.

http://docs.oasis-open.org/security/saml/ /v2.0/saml-

core-2.0-os.pdf.

Zhu, H., Raghavan, S., Vaithyanathan, S., and L

oser, A.

(2007). Navigating the intranet with high precision.

In WWW ’07: Proceedings of the 16th international

conference on World Wide Web, pages 491–500, New

York, NY, USA. ACM.

Zobel, J. and Moffat, A. (2006). Inverted ﬁles for text search

engines. ACM Comput. Surv., 38(2):6.

ACCESS RIGHTS IN ENTERPRISE FULL-TEXT SEARCH - Searching Large Intranets Effectively using Virtual Terms