tion where is possible to understand what the mod-
ule must do and how.
4.2 Extraction Process
The application starts looking for author’s informa-
tion in a specific digital library (in our work, Pub-
Med), which provide information pre-evaluated,
needed as a start point. After, a specialized search
engine is used (in our work, Google Scholar) and,
after that, a generic search engine is used (in that
case, Google). In each activity, some support tools
are used. The SAXON
5
is used to generate the
RDF/XML file with author’s model. Concerning the
activities related with data extracting from special-
ized and generic search engine, is used the tool Web-
Harvest
6
.
In the case of PubMed, after informs an author
name, it is possible to obtain a XML file with the
results founded for this author. In resulting XML
file, there is information related to author’s publica-
tions: title, co-authors of the paper, author’s organi-
zation and keywords. To present this informative
data, is employed the MeSH (Medical Subject Head-
ings) vocabulary
7
.
After using the title of the publication, the tool
retrieves the number of citations from Google
Scholar. To retrieve any information that is not
present in PubMed and in Google Scholar, the tool
utilizes Google as a generic Web-search engine. The
strategy used differs according with the data re-
trieved. For example, in the case of e-mail - when
this e-mail is not present in the publication-, the
strategy consists in the following steps.
At first, a search is done with a Web search en-
gine (Google), using: the name of the author; a set of
keywords related to his/her publications (the key-
words are MeSH descriptors - the 3 more frequent-
ly); author’s institution name and indicatives of e-
mail presence (string like e-mail, contact, etc). This
approach increases the precision, reducing homo-
nymous problems.
After using strings like e-mail and contact, the
author’s e-mail is showed in the Web Page resume
generated by Google, so it is possible to extract an e-
mail from a Web page with Google’s results without
access and process the Web pages with this informa-
tion. This strategy represents a performance gain.
After, from each page, are extracted strings that
represent e-mails (author@xxx.xx).
Finally, using each e-mail founded, a new search
is made using Google. Basically, the tool retrieves
the number of pages that contain e-mails and the
number of Web pages that contains e-mail and au-
thor’s name. These values are used to calculate a
rate (1).
r = nea*100/ ne (1)
Where ne is the number of times where an e-mail
was found for the search engine and nea is the num-
ber of times where an e-mail was found with au-
thor’s name. The e-mail with higher rate is consi-
dered the author’s e-mail. Thus, it is considered that
an e-mail with author’s name has higher probability
to be the real e-mail’s author.
To discover the author´s home page, is used a
query with author’s name, author’s organization and
a set of keywords related to his/her publications. The
process is similar to e-mail retrieval. In the future,
this process could be improved considering others
techniques as in Xi and Fox (2002).
One important point is that an inexperience user
cannot evaluate some information about an author,
as the number of citations, for example, because the
user does not know if a specific number of citations
is high or low. Thus, the idea is show to users some
information to facilitate the evaluation process. To
give this information, some strategies were defined.
The real convenience of these strategies needs to be
evaluated in a near future, especially in terms of
computational cost.
Concerning citations number, the proposal is to
give an average of citations related to the same area.
To obtain this information, we utilize the search
engine Google Scholar. The strategy consists on
retrieves documents from Scholar using the key-
words related to a specific author’s publication. Af-
ter that, the author’s publications are located on this
set (positionScholar, section 3). At the same time,
the average citations of this set are calculated (avg-
Citations, section 3).
In the same sense, using Google Scholar’s in-
formation is possible to show information about
author’s h-index. However, this information must to
be explained to users. In this sense, one possibility is
to compare the author’s h-index with others authors
(e.g. authors who the user have been looked up be-
fore).
5 SCENARIO
This scenario shows how information extracted from
Web can be used on partial implementation of the
"Oh, yeah?" button (see section 2.1).
Initially, a user accesses a Web page about Alz-
heimer’s disease, which has the site author’s name.
The user, who is interested in an evaluation of the
WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies
342