the number of candidate results reaches 1 million doc-
uments, even when caching the access control data.
In (Hurley, 2009), the author realizes that filtering
out the results in the post-processing phase may have
negative impact on the search performance. A differ-
ent approach is presented, where the users’ creden-
tials are calculated for every user when the document
is being indexed. A per-user view of the index is cre-
ated, and subsequently used for evaluation of user’s
queries. To be able to maintain consistent copy of
access control lists, quick check to the directory ser-
vice is performed before each query. Two implemen-
tations are presented: PrivaSearch1, where each user
has specific view of the corpus, and PrivaSearch2,
where the membership in groups is also reflected. As
a result of this method each user is provided with a
separate index. Three types of document distributions
are discussed: disjoint, overlapping, and hierarchical.
The implementation is tested on data sets of 2,000 to
20,000 documents. For larger data sets, the author
suggests to distribute the index. We consider this ap-
proach (per-user indexes) completely feasible only for
a limited range of systems where users have access to
disjoint sets of documents (e.g. e-mail messages).
Google Search Appliance (GSA, 2009) is as a
widely known commercial black box solution for in-
tranet or website full-text search. It performs on-line
access control checks for each document from the re-
sult set. Standardized SAML (SAML, 2005) identity
provider (or specialized one for batch processing) has
to be present or implemented on the customer side.
The post-processing approaches have limitations
on large scale systems. For example, when there is
a thousand of higher-ranked documents matching the
query, but inaccessible to a given user, the system will
need to check the access rights to all of those docu-
ments before finally discovering a lower-ranked doc-
ument accessible to that user.
3 FULL TEXT SEARCH ENGINES
The full-text search systems are often constructed as a
software, which maintains the index of the documents
to be searched, and executes queries against this in-
dex. We refer to (Zobel and Moffat, 2006) for possi-
ble approaches for creating and using the index.
3.1 Index
For the method we propose, it is sufficient to view
the full-text search engine as a maintainer and user of
an inverted index, consisting of lexicon of all words
from the set of documents, and mapping of each of
those words to the list of document IDs, listing the
documents which contain that word.
These lists are often sorted by the document ID,
allowing to store them differentially, and compressed
using e.g. Elias delta encoding (Elias, 1975). They
also might be accompanied by weights of the given
word in a particular document.
For evaluating multi-word proximity and phrase
queries, it may be useful to allow fast forward search-
ing of these lists, which can be done e.g. by accom-
panying larger lists with skip-lists for semi-direct ac-
cess.
The index also contains other data structures,
which are not relevant to this paper (forward index,
lists of word positions, etc.).
3.2 Query Format
The query itself is usually entered by user as a se-
quence of words, meaning “find the documents which
contain all of those words, preferably near to each
other” (i.e. the implicit AND/NEAR operator). Many
search engines allow also the NOT operator, and some
of them support also the OR operator. The support for
exact phrase searches is common as well.
The OR operator—even when not available di-
rectly to the end user—is often used internally
for handling lemmatization, inflections, diacritics,
acronyms, and so on. For example, the query “Citro
¨
en
cars” can be internally transformed as follows:
(citro
¨
en OR citroen) AND (cars OR car)
For our purposes, it will be sufficient when the
search engine can efficiently handle the queries in the
above format—i.e. the logical conjunction (AND) of
the logical disjunctions (OR). More precisely formu-
lated, the expected query format is the following:
(token
1,1
OR token
1,2
OR token
1,3
OR ···)
AND (token
2,1
OR token
2,2
OR token
2,3
OR ···)
AND (token
3,1
OR token
3,2
OR token
3,3
OR ···)
.
.
. (1)
4 ACCESS RIGHTS
The availability of a document (or, more generally,
object) to some user (i.e. subject) is often described by
a matrix with per-subject rows, per-object columns,
and cells containing the actual permissions settings
(a subset of “can read”, “can write”, “can delete”,
etc.). For the full-text search, only the read permis-
sion is significant. Note, however, that some systems
ACCESS RIGHTS IN ENTERPRISE FULL-TEXT SEARCH - Searching Large Intranets Effectively using Virtual Terms
33