Table 2: Quality of the example site mappings and quality
of the comparison between drafts and example sites.
domain precision recall f-measure removed added
topics topics
detected detected
hotel 1.00 0.43 0.60 0.54 0.52
surf 0.33 0.42 0.37 0.25 0.26
school 0.47 0.41 0.44 0.33 0.88
the similarity measures. On all three domains giving
high weights to text similarity resulted in mappings
with high scores. In the hotel domain URL similar-
ity also appeared to be effective. Increasing the mini-
mum similarity parameter (α) meant that we required
mapped pages to be more similar, so that precision
was increased, but recall decreased. Thus, with this
parameter we can effectively balance the quality of
the topics that we find against the number of topics
that are found. When the SiteGuide system is used by
a real user, it obviously cannot use a gold standard to
find the optimal parameter settings. Fortunately, we
can estimate roughly how we should choose the pa-
rameter values by looking at the resulting mappings
as explained in (Hollink et al., 2008).
The scores of the example site models generated
with optimal parameter values are shown in Table 2.
The table shows the scores for the situation in which
all frequent topics that SiteGuide has found are shown
to the user. When many topics have been found we
can choose to show only topics with a similarity score
above some threshold. In general, this improves pre-
cision, but reduces recall.
Next, we evaluated SiteGuide in the critiquing
scenario. We performed a series of experiments in
which the 5 sites one by one played the role of the
draft site and the remaining 4 sites were example
sites. In each run we removed all pages about one of
the gold standard topics from the draft site and used
SiteGuide to compare the corrupted draft to the ex-
amples. We counted how many of the removed top-
ics were identified by SiteGuide as topics that were
missing in the draft. Similarly, we added pages to
the draft that were not relevant in the domain. Again,
SiteGuide compared the corrupted draft to the exam-
ples. We counted how many of the added topics were
marked as topics that occurred only on the draft site
and not on any of the example sites. The results are
given in Table 2.
Table 2 shows that SiteGuide is able to discover
many of the topics that the sites have in common, but
also misses a number of topics. Inspection of the cre-
ated mappings demonstrates that many of the discov-
ered topics can indeed lead to useful recommenda-
tions to the user. We give a few examples. In the
school domain SiteGuide created a page cluster that
contained for each site the pages with term dates. It
also found correctly that 4 out of 5 sites provided a list
of staff members. In the surfing domain, a cluster was
created that represented pages where members could
leave messages (forums). The hotel site mapping con-
tained a cluster with pages about the facilities in the
hotel rooms. The clusters can also be relevant for the
critiquing scenario: for example, when the owner of
the fifth school site would use SiteGuide, he would
learn that his site is the only site without a staff list.
Some topics that the sites had in common were not
found, because the terms did not match. For instance,
two school sites provided information about school
uniforms, but on the one site these were called ‘uni-
form’ and on the other ‘school dress’. This example
illustrates the limitations of the term-based approach.
In the future, we will extend SiteGuide with WordNet
(Fellbaum, 1998), which will enable it to recognize
semantically related terms.
4.2 Evaluation of the Topic Descriptions
Until now we counted how many of the generated top-
ics had at least 50% of their pages in common with a
gold standard topic. However, there is no guarantee
that the statements that SiteGuide outputs about these
topics are really understood correctly by the users of
the SiteGuide system. Generating understandable de-
scriptions is not trivial, as most topics consist of only
a few pages. On the other hand, it may happen that
a description of a topic with less than 50% overlap
with a gold standard topic is still recognizable for hu-
mans. Therefore, below we evaluate how SiteGuide’s
end output is interpreted by human users and whether
the interpretations correspond to the gold standards.
We used SiteGuide to create output about the ex-
ample site models generated with the same optimal
parameter values as in the previous section. Since we
only wanted to evaluate how well the topics could be
interpreted by a user, we did not output the structural
features. We restricted the output for a topic to up to
10 content keywords and up to 3 phrases for page ti-
tles, URLs and anchor texts. We also displayed for
each topic a link to the example page.
Output was generated for each of the 34 frequent
topics identified in the three domains. We asked 5
evaluators to interpreted the 34 topics and to write
a short description of what they thought each topic
was about. We required the descriptions to be of the
same length as the gold standard descriptions (10-30
words). None of the evaluators were domain experts
or expert web site builders. It took the evaluators on
average one minute to describe a topic, including typ-
ing the description. By comparison, finding the topics
WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies
148