which returns the number of documents on
www.loria.fr containing at least one link to
documents on www.irisa.fr. For us, it is a weight of
the edge from www.loria.fr to www.irisa.fr. We had
to construct 6 320 queries in this way. Of course, the
construction and submission of queries, storing of
results, and the graph creation were automated. (The
figure of the Web graph with 393 edges is available
at http://home.zcu.cz/~dalfia/papers/France.svg.)
The drawbacks of relying solely upon search
engines are discussed a great deal in (Thelwall,
2003). The problem consists primarily in
“instability” of the results. This means that results
obtained one day differ from those of another one.
Another disadvantage is that the results are not
transparent. We do not know which document
formats are taken into account, how duplicate
documents are treated, etc.
2.1.1 Results and Discussion
We applied three ranking methods to the Web graph
of 80 sites of choice. First, we computed in-degrees
of the nodes in the citation graph without respect to
edge weights (i.e. each edge has a weight of one).
Then, we computed HITS authorities for the graph
nodes and, finally, we generated PageRanks
(HostRanks, in fact) for all of the nodes. We can see
the results in tables 1 and 2. The sites are sorted by
in-links (citations), i.e. by the total number of links
to this site from other sites in the set (with some
limitations imposed by the search engine). The first
place belongs to www-futurs.inria.fr, whose
positions achieved by the other methods, though, are
much worse. We can suppose that the reason for this
is a very strong support from a particular site. (After
inspecting the Web graph, we can see that it is
www.lifl.fr.) The following sites always have high
ranks - www-sop.inria.fr, www.loria.fr, www.lri.fr.
We can surely consider them as authoritative.
Of course, the number of in-links often depends
on the number of documents on the target site. Their
numbers vary greatly due to different sizes of
hosting institutions, existence of server aliases,
preference of various document formats and
document generation (dynamic Web pages), etc.
One way of tackling this problem is to normalize the
number of citations somehow. For instance, it is
possible to divide the number of citations by the
number of documents on a particular site or by the
number of staff of the corresponding institution
(Thelwall, 2003).
The phase of finding significant institutions
enables us to reduce the set of Web sites that we are
going to analyze in the next stage. For example, we
might discard the last eight sites in Table 2, i.e. the
least important sites. However, our case study
(French academic computer science Web sites) has a
sufficiently small data set so that no reduction is
necessary. Measuring the quality of academic
institutions with webometric tools is justified in
(Thelwall, 2003), where Web-based rankings
correlated with official rankings.
2.2 Authoritative Researchers
In addition to studying links in a collection of
computer science Web sites, we were also interested
in the documents themselves found on these Web
sites. Thus, we downloaded potential research
papers from the sites in question. In practice, that
meant collecting PDF and PostScript files because
most research publications publicly accessible on the
Web are in these two formats. First, we had to
preprocess our download corpus. We unpacked
archives and converted observed files to plain text
via external utilities. So, at the beginning, we had
about 45 thousand potential research papers. We
discarded duplicates and examined the remaining
documents. We used a simple rule to categorize the
documents. In case they included some kind of
references section they were considered as papers. In
this way, we obtained some 16 000 papers in the
end, i.e. over thirty thousand documents did not look
like research articles.
2.2.1 Information Extraction
The next task is to extract information from the
papers needed for citation analysis, i.e. names of
authors, titles of papers, etc. We employ the same
methodology with use of Hidden Markov Models
(HMM) as in (Seymore, 1999).
We had to construct a graph with authors
(identified by surnames and initials of their first and
middle names) as nodes and citations in publications
as edges. The final graph (without duplicate edges
and self-citations) had almost 86 000 nodes and
about 477 000 edges. Strictly said, when we talk
about surnames, we mean words identified as
surnames. Of course, many of these words were not
surnames (they were incorrectly classified) or they
were foreign surnames which we did not wish to
consider. From the citation graph with “surnames”
as graph nodes we determined the most authoritative
French authors using the three different ranking
methods. (The recognition of a French surname was
done manually.) See Table 3 for details.
RANKING ALGORITHMS FOR WEB SITES - Finding Authoritative Academic Web Sites and Researchers
373