A TWO-LEVEL APPROACH TO WEB GENRE CLASSIFICATION
Ulli Waltinger, Alexander Mehler and Armin Wegner
Text Technology, Bielefeld University, Universit¨atsstr. 25, 33615 Bielefeld, Germany
Keywords:
Hypertext types, Web genre classification, Web structure mining, Two-level classifier.
Abstract:
This paper presents an approach of two-level categorization of web pages. In contrast to related approaches
the model additionally explores and categorizes functionally and thematically demarcated segments of the
hypertext types to be categorized. By classifying these segments conclusions can be drawn about the type of
the corresponding compound web document.
1 INTRODUCTION
Hypertext categorization is mainly performed as func-
tion learning irrespective of relational structure learn-
ing, that is, of genre-related structures internal to sin-
gle sites and pages. Consequently, approaches to hy-
pertext categorization mostly explore simple text fea-
tures (Karlgren and Cutting, 1994) (Kessler et al.,
1997) or additionally include structural features (Lee
and Myaeng, 2002; Lee and Myaeng, 2004; Lim
et al., 2005; Santini et al., 2006). Machine learning
techniques which incorporate thematic and structural
features of web pages have also been applied success-
fully (Joachims et al., 2001; Eissen and Stein, 2004;
Lindemann and Littig, 2006). A basic premise of
these approaches is that web genres are manifested by
thematic and structural features either on the level of a
single page or of a site as a whole. In contrast to this,
(Mehler et al., 2005; Mehler, 2007) refer to polymor-
phism as an aspect of informational uncertainty which
says that hypertextunitsare compoundmanifestations
of web genres so that their categorization goes hand
in hand with their genre-related segmentation. In this
sense, the segmentation of recurrent structural units of
a given hypertext unit precedes its genre-related cate-
gorization. (Mehler et al., 2007) analyze various no-
tions of informational uncertainty in support of this
two-level approach of hypertext categorization. The
present paper follows this line of research. That is,
we, firstly, focus on hypertext segment types as a ref-
erence point of hypertext segmentation in order to,
secondly, use the learnt segmentation of a hypertext
unit as a reference point of its categorization.
2 METHODOLOGY
In order to implement the two-level model of hyper-
text categorization we proceed as follows. We expect
that differences of hypertext types correlate with dif-
ferences of their segment types. This leads to the as-
sumption that when having detected the genre-related
segments of a hypertext unit we can draw conclusions
about its genre. However, because of polyfunctional-
ity this is not a trivial task: On the one hand, a per-
sonal academic home page can be detected by iden-
tifying a segment of type publications which solely
consists of references all of which contain the same
author name. On the other hand, a contact segment
is common to instances of different web genres, e.g.
conference websites and project websites. Moreover,
segments vary in terms of their location within web-
sites of the same genre. Based on this observation a
web page of a website as an instance of a web genre
is called polymorphic if it contains at least two seg-
ments of different types (Mehler et al., 2007). Such
pages are problematic input to hypertext categoriza-
tion as their polymorphic segments are responsive to
different types. If in contrast to this a website in-
cludes only segments of the same type it is called
monomorphic. Such pages allow to apply the clas-
sical apparatus of hypertext categorization as their
monomorphism guarantees separability compared to
other monomorphic pages. As a matter of fact, poly-
morphism is the predominant case and therefore in-
terferes with applying the classical apparatus (Mehler
et al., 2007). In order to successfully distinguish (i)
polymorphic from (ii) monomorphic pages and in or-
der to successfully segment the former while directly
689
Waltinger U., Mehler A. and Wegner A.
A TWO-LEVEL APPROACH TO WEB GENRE CLASSIFICATION.
DOI: 10.5220/0001834806800683
In Proceedings of the Fifth International Conference on Web Information Systems and Technologies (WEBIST 2009), page
ISBN: 978-989-8111-81-4
Copyright
c
2009 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
categorizing the latter we perform two consecutive
tasks: Firstly, we describe how to automatically ex-
tract segments within instances of different hypertext
types – this is done in Section 2.1. Secondly, we clas-
sify all extracted segments according to types of hy-
pertext segments as recurrent building blocks of the
hypertext types under consideration. This is described
in Section 2.2.
2.1 Hypertext Type Segmentation
Our general notion of segment types of a website is
the actual visual depiction of a hypertext. When fo-
cusing on the structure abstracting from layout
the logical document structure is the central reference
point to be considered. When focusing on layout, the
stylesheet language specifying the presentation of a
hypertext unit has to be interpreted. We infer that vi-
sually separable sections correlate with sections on
the content level. According to this approach, seg-
ment borders are seen to be marked by staginess text
phrases (as, e.g., headings expressed through the font-
size), by image intersections or as it is done most
often by visual spaces with no textual or graphi-
cal presentation. This step of segmentation is called
Segment Cutting. The code of a web document in-
cludes style- and structure-related elements so that
both can be considered. The reason is that identify-
ing headers only is insufficient since the visual depic-
tion can be coded by tags or by CSS code (e.g. class-
or font-size values). Therefore, we need to process
the logical document structure of a web document in
conjunction with all its document internal and exter-
nal layout information. In the next step, we search
for the most prominent segment separation features
occurring in the input document. These predefined
features (e.g.
div
,
h1
,
h2
,
a
, font-size that exceeds a
certain threshold) are explored as indicators of con-
tent section boundaries. As a result of this Segment
Cutting step we often get segments which are too
small. This relates, for example, to navigation items
and headings which are segmented as single sections.
In order to overcome this problem we perform a sec-
ond step called Segment Re-Connecting which amal-
gamates small segments with their subsequent seg-
ments. A segment is considered as being too small
if its size is below a specified threshold. As a result
we gain a set of segments for each document which
are next used as an input to segment categorization.
2.2 Hypertext Segment Type Classifier
Our approach of two-level web genre categorization
is to attribute the genre of a website or page by clas-
sifying its segments. As we are able to partition a
hypertext into its content-related segments (see Sec-
tion 2.1), the process of hypertext segment type cate-
gorization can be performed next. Categorization is
done by means of Support Vector Machine (SVM)
which is a popular technique for data categorization.
More specifically, we utilize SVM-Light (Joachims,
1997). Since an SVM produces a model which pre-
dicts the class label based upon the given instance fea-
tures, feature selection is an important part in training
a segment type classifier. We explore three classes
of segment features: Frequency of HTML-tags, fre-
quency of tokens and frequency of segment structure-
related numerical features. In the process of fea-
ture extraction the frequency of HTML tags is defined
by all occurred tags within the segment except script
code and comments which are removed. We included
only stemmed nouns, verbs, adjectives, adverbs, nu-
merals, punctation marks and named entities. Enti-
ties are split into subcategories as for example email,
proper names, location and country entities. Thirdly,
numerical characteristics are extracted by computing
the standard deviation of section, paragraph and sen-
tence lengths of segments. We argue that e.g. sen-
tence lengths of a contact section differs from that of
a project information section.
3 EXPERIMENT
For evaluating our approach we conducted a catego-
rization experiment by focusing on three hypertext
types: conference websites, personal academic web-
sites and academic project websites. Previous studies
focused on distinguishing thematically clearly sepa-
rated web genres such as web shops and web logs,
listings and search pages. In contrast to this, we
deal with hypertext types which are closely related
based on their common thematic background. The
reason is to deal with a more realistic scenario of web
genre categorization. Although there is much effort
on building reference corpora of web genre catego-
rization (Rehm et al., 2008), these data are still out of
reach so that we needed to compile our own training
corpora. Training corpus building has been done by
three volunteers downloading 50 German web pages
per hypertexttype. Because of our two-levelapproach
each of these 150 pages had been manually segmented
in terms of their genre-related sections (e.g. contact,
research, call for papers). That is, for each of the
segment types distinguished in this study monomor-
phic segments have been identified as typical exam-
ples in order to learn classifiers which can detect these
types of segments even in polymorphic web pages.
WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies
690
As a result 1,250 segments have been manually typed
and used for training our SVM-based segment clas-
sifiers. All annotated segments with their associated
segment labels were used for feature selection. We
also performed a feature selection procedure based on
the GSS coefficient (Galavotti et al., 2000). However,
feature selection did neither improve nor deteriorate
our categorization. We trained for each hypertexttype
one SVM, the corresponding segment types against
each other. For evaluation purposes, we used the
Leave-One-Out F-Measure calculated by the SVM-
Light implementation. In order to determine the hy-
pertext type of a compound web document we devel-
oped a weighted finite-state transducer. Doing this,
we argue that each web genre is represented by a cor-
responding document grammar which can be repre-
sented by a weighted directed graph. The grammar is
determined by its transition probability accumulated
through its segments. For the experiment on an over-
all categorization scenario, we randomly chose 60
websites from the annotated corpus – 20 for each hy-
pertext type – using both polymorphic and monomor-
phic websites.
4 RESULTS
Tables 1–3 show the results of our first level cate-
gorization with a focus on hypertext segment types.
Table 4 shows the results of the second-level cate-
gorization with a focus on web genres or hypertext
types. The first level categorization shows that we
clearly outperform the corresponding baseline sce-
nario. However, our gain in F-measure values is
round about .65 in the case of all three hypertext types
far away from more desirable values above .9. As
a consequence of this the second level categoriza-
tion results in an F-measure value also round about
.625 in conjunction with remarkably balanced re-
call and precision values which also outperform the
corresponding baseline however to a minor degree.
Let us look on related experiments in order to inter-
pret these results and to shed light on the range of
F-measure values gained by our experiment. (San-
tini, 2006) and (Eissen and Stein, 2004), for exam-
ple, report on an accuracy of .67 (Nave Bayes Classi-
fier) and an F-measure value of .89 (SVM Classifier),
respectively. Although these related experiments in
web genre categorization cannot be directly compared
to ours, their results seem to be better. However, if
we have a closer look on the experiments being con-
ducted we get insight into experimental differences
which put this statement into perspective: (Santini,
2006) and (Eissen and Stein, 2004) both deal with
Table 1: Evaluation Results: Conference Sites.
Classes (11) Recall Precision F-Measure
about .578 .703 .634
accommodation .680 .700 .690
call .350 .389 .368
committees .609 .609 .609
contact .581 .720 .643
disclaimer .706 .667 .686
organizer .455 .417 .435
program .692 .838 .758
registration .729 .771 .749
sightseeing .708 .739 .723
sponsors .542 .650 .591
Average .603 .655 .626
Baseline .200
Table 2: Evaluation Results: Personal Sites.
Classes (6) Recall Precision F-Measure
contact .947 .857 .899
links .583 .636 .608
personal .661 .709 .684
publications .795 .720 .756
research .485 .800 .604
teaching .581 .643 .610
Average .675 .728 .694
Baseline .280
web genres which because of their thematic and func-
tional divergence are obviously more separable than
the ones considered here. Web genres as, for exam-
ple, Weblogs, FAQs, Search Engine Pages or List-
ings etc. are thematically and functionally more di-
vergent than project pages and conference websites
which both stem from the area of academics. Thus,
we expect that there is a larger divergence of lexical
and other features between the genres considered in
their experiments compared to the genres explored in
our experiment.
5 CONCLUSIONS
This paper presented a model of two-level categoriza-
tion of web genres. In contrast to related approaches
the model presented here additionally explores and
categorizes functionally and thematically demarcated
segments of the hypertext types to be categorized. By
classifying these segments conclusions can be drawn
about the type of the corresponding compound web
document. Our research provides results in support
of solving this task and, thus, goes beyond the narrow
focus of classical approaches to functional hypertext
categorization.
A TWO-LEVEL APPROACH TO WEB GENRE CLASSIFICATION
691
Table 3: Evaluation Results: Project Sites.
Classes (9) Recall Precision F-Measure
contact .823 .869 .849
events .525 .636 .575
framework .447 .568 .500
links .471 .421 .444
news .539 .560 .549
objectives .603 .734 .662
project .799 .789 .794
publications .761 .761 .761
staff .500 .807 .617
Average .608 .683 .639
Baseline .240
Table 4: Evaluation Results: Hypertext Type Classification.
Classes (9) Recall Precision F-Measure
conference .640 .640 .640
personal .618 .627 .622
project .620 .608 .614
Average .626 .625 .625
Baseline .428
ACKNOWLEDGEMENTS
Financial support of the German Research Founda-
tion (DFG) through the Research Group 437, Project
A4 and Topic-Oriented Peer-to-Peer Agents in Digital
Libraries (LIS) at Bielefeld University is gratefully
acknowledged.
REFERENCES
Eissen, S. M. Z. and Stein, B. (2004). Genre classification
of web pages: User study and feasibility analysis. In
In: Biundo S., Fruhwirth T., Palm G. (eds.): Advances
in Artificial Intelligence, pages 256–269. Springer.
Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experi-
ments on the use of feature selection and negative evi-
dence in automated text categorization. In ECDL ’00:
Proc. of the 4th European Conf. on Res. and Adv. Tech.
for DL, pages 59–68, London, UK. Springer-Verlag.
Joachims, T. (1997). Text categorization with support vec-
tor machines: Learning with many relevant features.
Technical report.
Joachims, T., Cristianini, N., and Shawe-Taylor, J. (2001).
Composite kernels for hypertext categorisation. In
Proc. of the 11th Int. Conf. on Machine Learning,
pages 250–257. Morgan Kaufmann.
Karlgren, J. and Cutting, D. (1994). Recognizing text gen-
res with simple metrics using discriminant analysis.
In Proc. of the 15th Conf. on CL, pages 1071–1075.
ACL.
Kessler, B., Nunberg, G., and Schiitze, H. (1997). Auto-
matic detection of text genre. pages 32–38.
Lee, Y.-B. and Myaeng, S. H. (2002). Text genre classifi-
cation with genre-revealing and subject-revealing fea-
tures. In SIGIR ’02: Proc. of the 25th Int. ACM SIGIR,
pages 145–150, New York, NY, USA. ACM.
Lee, Y.-B. and Myaeng, S. H. (2004). Automatic iden-
tification of text genres and their roles in subject-
based categorization. In HICSS ’04: Proc. of the 37th
HICSS’04, page 40100.2, Washington, DC, USA.
IEEE Computer Society.
Lim, C., Lee, K., and Kim, G. (2005). Automatic genre
detection of web documents. In Su K., Tsujii J., Lee
J., Kwong O. Y., NLP, Berlin. Springer.
Lindemann, C. and Littig, L. (2006). Coarse-grained clas-
sification of web sites by their structural properties. In
Proc. of the 8th ACM - WIDM’06, pages 35–42, New
York, NY, USA. ACM Press.
Mehler, A. (2007). Structure formation in the web. toward
a graph-theoretical model of hypertext types. In Witt,
A. and Metzing, D., editors, Linguistic Modelling of
Information and Markup Languages. Springer, Dor-
drecht.
Mehler, A., Gleim, R., and Dehmer, M. (2005). To-
wards structure-sensitive hypertext categorization. In
Proc. of the 29th Annual Conf. of the GCS, Uni-
versit¨at Magdeburg, March 9-11, 2005, Berlin/New
York. Springer.
Mehler, A., Gleim, R., and Wegner, A. (2007). Structural
uncertainty of hypertext types. An empirical study. In
Proc. of Towards Genre-Enabled Search Engines: The
Impact of NLP, September, 30, 2007, Borovets, Bul-
garia, pages 13–19.
Rehm, G., Santini, M., and Alexander Mehler, e. (2008).
Towards a reference corpus of web genres for the eval-
uation of genre identification systems. In Pro. of the
6th LREC 2008, Marrakech (Morocco).
Santini, M. (2006). Identifying genres of web pages. In
Proc. of TALN 2006.
Santini, M., Power, R., and Evans, R. (2006). Implementing
a characterization of genre for automatic genre identi-
fication of web pages. In Proc. of the COLING/ACL,
pages 699–706, Morristown, NJ, USA. ACL.
WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies
692