categorizing the latter we perform two consecutive
tasks: Firstly, we describe how to automatically ex-
tract segments within instances of different hypertext
types – this is done in Section 2.1. Secondly, we clas-
sify all extracted segments according to types of hy-
pertext segments as recurrent building blocks of the
hypertext types under consideration. This is described
in Section 2.2.
2.1 Hypertext Type Segmentation
Our general notion of segment types of a website is
the actual visual depiction of a hypertext. When fo-
cusing on the structure – abstracting from layout –
the logical document structure is the central reference
point to be considered. When focusing on layout, the
stylesheet language specifying the presentation of a
hypertext unit has to be interpreted. We infer that vi-
sually separable sections correlate with sections on
the content level. According to this approach, seg-
ment borders are seen to be marked by staginess text
phrases (as, e.g., headings expressed through the font-
size), by image intersections or – as it is done most
often – by visual spaces with no textual or graphi-
cal presentation. This step of segmentation is called
Segment Cutting. The code of a web document in-
cludes style- and structure-related elements so that
both can be considered. The reason is that identify-
ing headers only is insufficient since the visual depic-
tion can be coded by tags or by CSS code (e.g. class-
or font-size values). Therefore, we need to process
the logical document structure of a web document in
conjunction with all its document internal and exter-
nal layout information. In the next step, we search
for the most prominent segment separation features
occurring in the input document. These predefined
features (e.g.
div
,
h1
,
h2
,
a
, font-size that exceeds a
certain threshold) are explored as indicators of con-
tent section boundaries. As a result of this Segment
Cutting step we often get segments which are too
small. This relates, for example, to navigation items
and headings which are segmented as single sections.
In order to overcome this problem we perform a sec-
ond step called Segment Re-Connecting which amal-
gamates small segments with their subsequent seg-
ments. A segment is considered as being too small
if its size is below a specified threshold. As a result
we gain a set of segments for each document which
are next used as an input to segment categorization.
2.2 Hypertext Segment Type Classifier
Our approach of two-level web genre categorization
is to attribute the genre of a website or page by clas-
sifying its segments. As we are able to partition a
hypertext into its content-related segments (see Sec-
tion 2.1), the process of hypertext segment type cate-
gorization can be performed next. Categorization is
done by means of Support Vector Machine (SVM)
which is a popular technique for data categorization.
More specifically, we utilize SVM-Light (Joachims,
1997). Since an SVM produces a model which pre-
dicts the class label based upon the given instance fea-
tures, feature selection is an important part in training
a segment type classifier. We explore three classes
of segment features: Frequency of HTML-tags, fre-
quency of tokens and frequency of segment structure-
related numerical features. In the process of fea-
ture extraction the frequency of HTML tags is defined
by all occurred tags within the segment except script
code and comments which are removed. We included
only stemmed nouns, verbs, adjectives, adverbs, nu-
merals, punctation marks and named entities. Enti-
ties are split into subcategories as for example email,
proper names, location and country entities. Thirdly,
numerical characteristics are extracted by computing
the standard deviation of section, paragraph and sen-
tence lengths of segments. We argue that e.g. sen-
tence lengths of a contact section differs from that of
a project information section.
3 EXPERIMENT
For evaluating our approach we conducted a catego-
rization experiment by focusing on three hypertext
types: conference websites, personal academic web-
sites and academic project websites. Previous studies
focused on distinguishing thematically clearly sepa-
rated web genres such as web shops and web logs,
listings and search pages. In contrast to this, we
deal with hypertext types which are closely related
based on their common thematic background. The
reason is to deal with a more realistic scenario of web
genre categorization. Although there is much effort
on building reference corpora of web genre catego-
rization (Rehm et al., 2008), these data are still out of
reach so that we needed to compile our own training
corpora. Training corpus building has been done by
three volunteers downloading 50 German web pages
per hypertexttype. Because of our two-levelapproach
each of these 150 pages had been manually segmented
in terms of their genre-related sections (e.g. contact,
research, call for papers). That is, for each of the
segment types distinguished in this study monomor-
phic segments have been identified as typical exam-
ples in order to learn classifiers which can detect these
types of segments even in polymorphic web pages.
WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies
690