contributes differently to the semantics of the whole
text. Some words, the keywords, more or less specify
the topic whereas others are just fillers, like the stop-
words “as”, “so”, “for” etc. They can be filtered as
they do not contribute to the semantics of the text. To
find out, which of the remaining words are keywords,
we calculate the TF ∗ IDF vector (Salton & McGill
1983) for a document. Our calculation is based on a
text corpus consisting of several million international
websites. The terms with the highest TF ∗ IDF val-
ues are then fed into a standard search engine and the
resulting resources are analyzed for relevance.
Apparently, the resulting documents are not neces-
sarily relevant, as this approach suffers from the very
same problems like users of a search engine do: there
is an increasing percentage of Web-spam that is not
related to the topic in question but just contains the
keywords (Henzinger, Motwani & Silverstein 2002).
Furthermore, we can not be sure having extracted the
right keywords, that is, any automated technique for
extracting keywords from a text will be error-prone
to some extent. Thus, in the worst case, the search
engine returns a list of resources that might not be
relevant. This is the same for any link the Focused
Crawler follows and thus, there is no disadvantage in
applying this technique besides a slightly increased
number of downloads, as already the Focused Crawler
can discard irrelevant documents. But the benefit is
that if there are relevant documents among the search
engine results, we are able to find them even if they
are not well connected to already crawled resources.
4.4 Expert Identification
Once relevant documents have been identified, the
third phase (Figure 1) can be entered: extracting ex-
pert information from these documents.
4.4.1 Name Extraction from Text
To identify all names in a text, advanced techniques
would be needed because the text to be analyzed usu-
ally is natural language. As natural language follows
complex grammars and is highly ambiguous, full un-
derstanding of such a text cannot be achieved in gen-
eral. A full understanding, on the other hand, would
be required to reliably identify persons in a text.
However, Palmer & Day (1997) as well as Mikheev,
Moens & Grover (1999) showed, that two simplifi-
cations reduce the effort in both implementation and
runtime while still producing good results:
• Searching for known names, based on a name
database. Lots of names can be identified that way
if some simple points are considered: often there
are abbreviations like “J. Smith” or the order of sur-
name and forename is inverted (“Smith, John”).
• Searching for phrases indicating that a name is
mentioned close to this phrase. Some examples are
“according to J. Smith”, “Mr. Smith”, “Smith says”
etc. This way, also a single forename or a single
surname can be identified that would otherwise be
ignored because many names are also used as terms
in different contexts. For example “April” is a com-
mon forename as well as the name of a month.
Using these techniques in EXPOSE already led to
quite good results. As we wanted to show the feasibil-
ity of our approach, we did not yet focus on optimiz-
ing the name recognition. However, in future work,
we will use more advanced techniques, e.g. from the
Named Entity Recognition domain.
All names identified that way form the input for the
next step. We identified a set of four roles in which a
person is named in a text: (1) the person is the au-
thor of the text, (2) there is a discussion the person
is (actively or passively) involved in, (3) the person
is referred to, (4) the person is mentioned although
he/she is not related to the topic at all. In case 1-3, the
assumption that the named person is an expert is at
least potentially right. In the latter case however, this
assumption is likely to be wrong. This shows, sim-
ply from the occurrence of names in a single topic-
related text, no experts can be identified. In this next
step we therefore have to find out, which of the named
persons are really experts on the domain in question.
Two problems have to be tackled then: (1) The occur-
rence of a name is not recognized although a person
has been named in the text. (2) One or more terms
in the text are assumed to name a person while in fact
they do not (e.g. “. ..for fuel cell manufacturers in the
U.S. Smith denotes that . .. ” may produce U.S. Smith
while U.S. is the last term in sentence A and refers
to the United States of America whereas sentence B
starts with some text referring to a person Smith,but
not U.S. Smith). While (1) can be attended by increas-
ing the name database or the associative rules, (2) is
more complex and will be discussed in the following.
4.4.2 Expertise Rating
As the extraction of expertise from a single text re-
quires a good understanding of the text, which in gen-
eral we do not have, our approach bases mostly on
statistical properties that are extracted from a set of
relevant documents. We identified four criteria from
which we derive the expertise of a person. There-
fore, we evaluate each of these criteria and compile
an overall rating from the singular results by normal-
izing and summing up the results from each rating.
The first quantity is simply how often a person is
named in any relevant document. The more often a
person occurs in texts related to the topic, the more
likely this person has expertise on this topic.
FINDING EXPERTS ON THE WEB
367