4.1.1 Information Need
The first job of the À Propos agent starts whenever a
user is writing or reading a document. Determining
the information need is aided by the Observers,
software agents that monitor user's activity in
different applications such as Internet browsers,
word processors and email clients. The Free Search
Observer monitors explicit searches for information
in the À Propos search window. Observers collect
the paragraph the user is currently writing or the text
displayed on screen if the user is reading a
document. In a later stage the À Propos agent uses
this context to estimate the user's information need
by formulating appropriate queries.
Currently, there is almost no difference between
context extractions for different Observers. We plan
to extend the query component by experimenting
with different context sizes and considering different
context extraction constraints for each type of
Observer. In addition, the frequency with which
context is extracted and subsequent queries are
launched, should be tuned to the stage in the writing
process. Another factor we wish to investigate is the
probable beneficial influence of using the personal
and group search profiles in the extraction stage.
4.1.2 Search
Searching for documents relevant to the user's
information need is the second step in À Propos. The
context extracted in the previous stage is sent to the
À Propos server and distributed to the search
engines in the form of specific queries. Distilling an
appropriate query from the user's context is done by
spotting key terms and phrases in the context using
domain-specific taxonomies and heuristics for
ranking query terms. The extracted terms are then
used to generate queries, enabling À Propos to
perform searches without user-formulated queries.
This approach is similar to the query-free
approaches to news search (Henzinger et al, 2003),
and expert diagnostics (Hart and Graham, 1997). To
prevent overloading the search engines with too
many, irrelevant, and/or redundant search requests,
each information source has separate Gatekeepers
and Filters.
Gatekeepers generate queries for their respective
information sources and determine whether or not to
execute the queries suggested by the Observers. For
each information source, the Gatekeepers use a
domain-specific taxonomy of which a certain
number of query terms need to match for the query
to be executed. In addition, the Gatekeepers compare
new queries to queries that were submitted recently
to prevent redundancy. Filters guard the relevance
of the documents and filter out data that is not
relevant for inclusion in a query (e.g. function words
such as ‘and', ‘the', ‘of', etc. are suppressed). The
filters also transform a query into the format
required for the different information sources.
The information sources can be divided into four
types. External documents are typical electronic
document stores such as Scopus
®
, ACM, Springer,
etc., but also the Internet (Google Scholar, CiteSeer).
Company-internal document stores include Intranet
databases and other in-company information
resources such as patent databases and technical
reports. The group documents cover all the work
done by the workgroup the user is a part of. Finally,
the user's personal document collection, consisting
of self-authored documents and other downloaded
papers, is another information source.
We plan to extend this component by using
personal and group profiles to enhance the
performance of the Filters and the Gatekeepers.
These profiles could also be used for expanding
queries that are submitted to public search engines.
4.1.3 Combining and Filtering
After a query is submitted to different search
engines, the different sets of search results need to
be combined and filtered to present a single list of
recommendations to the user. The first step in this
third stage is deduplication, a well-researched
problem in distributed information retrieval (e.g.
Callan et al, 1995). Bibliographic screening
techniques similar to those developed in the CiteSeer
project (Giles et al, 1998) are used for deduplication
in À Propos. The next step is filtering the results.
Depending on the stage of the writing process,
documents that the user has seen before may have to
be filtered out. In the end, only documents whose
ranking scores exceed a strict relevancy threshold
are recommended to the user.
We plan to test the influence of personal and
group profiles on this filtering step. Filtering and re-
ranking the list of results depends on these search
profiles and on other characteristics of user's
workgroup. The search profiles could also be used to
re-rank the results, giving preference to documents
that match better with the user's personal profile.
4.1.4 Personalization
The relevance of suggested documents is strongly
affected by the topic of the user's current text and by
the user's research interests. Personalization is
handled in À Propos by generating and applying
search profiles that are generated on the basis of a
collection of documents previously written by the
workgroup. À Propos distinguishes between
individual user profiles and the workgroup profile.
WHAT A PROACTIVE RECOMMENDATION SYSTEM NEEDS - Relevance, Non-Intrusiveness, and a New
Long-Term Memory
89