ONLINE WEB GENRE CLASSIFICATION, IS IT DOABLE?

Hoda Badesh, James Blustein and Anwar Alhenshiri

Faculty of Computer Science, Dalhousie University, 6050 University Ave., Halifax, NS, Canada

Keywords: Web genre, Classification, Clustering, Retrieval.

Abstract: This paper investigates the feasibility and effectiveness of online clustering of Web search results by genre.

Although there are several research studies that have investigated the accuracy of classifying Web pages by

genres, research has focused only on off-line clustering and classification due to the large number of

documents on the Web. This research intends to investigate the feasibility of creating sets of Web pages to

represent main genres on the Web. Each genre, as identified in the work of Santini (2006), will be

represented by a set of Web pages. Web search results will be compared to those sets and classified

accordingly. Search results will be grouped according to their similarities to which set of genre

representatives. The resulting clusters of Web search results will be rendered to the user. A user study will

be conducted to examine the validity and accuracy of online clustering based on Web genres.

1 INTRODUCTION

The Web is growing in size and variety of content

made available by almost everyone. According to

Teevan (2008), information mismatching and

overloading are two significant problems while

search engines gather and present information;

which may decrease the performance of a search

engine. Improving the effectiveness and

performance of Web search engines has been

investigated in several directions. Clustering is one

concept that has been investigated in the aim of

improving the performance and effectiveness in

Web retrieval (Alhenshiri et al., 2010).

In Web information retrieval, clustering is meant

for grouping similar documents (Manning et al.,

2008). Clustering is usually intended to provide

overviews of information categories (topics) in the

result set. Hence, efficient subtopic retrieval is

anticipated with the use of clustering in Web search

results presentations (Carpineto et al., 2009).

Clustering can also decrease the need for scrolling

over multiple pages of results and motivate users to

look beyond the first few hits, resulting in more

effective and efficient user performance. In Web

information retrieval, clustering has been

investigated in several prototypes (Alhenshiri et al.,

2010; Turetken and Sharda, 2005).

Clustering has also been implemented in

conventional search engines such as Clusty

(www.clusty.com), Gceel (www.Gceel.com), and

Google (in their “see similar” feature and Google

Wonder Wheel). Although the performance of users

with row presentations of Web documents is

comparable to their performance with clustering-

based presentations, user preference usually comes

in favour of clustering-based methods (Carpineto et

al., 2009). In addition, there are indications that

clustering can even be more effective (Turetken and

Sharda, 2005). Even though topic-based clustering

might be favoured over row presentations of search

results, there has been little consideration of the

concept of genre-based clustering.

According to Marina Santini’s research

http://sites.google.com/site/marinasantiniacademicsit

e/, ‘Documents can be classified into topical and

non-topical text categories, that I call descriptors.

Examples of topical descriptors

are topic, content, subject matter or domain.

Examples of non-topical descriptors are genre,

vulgarisation, or layout structure (e.g. tables or

lists)’. The online non-topical classification which

relies on using Web page genres as discriminators

among Web pages is the goal of this research. The

attempt is to show the possibility of performing

genre-based clustering during query time. In

addition, the research intends to investigate the

effectiveness of this approach compared to topical

clustering.

278

Badesh H., Blustein J. and Alhenshiri A..

ONLINE WEB GENRE CLASSIFICATION, IS IT DOABLE?.

DOI: 10.5220/0003314202780281

In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST-2011), pages 278-281

ISBN: 978-989-8425-51-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

The remainder of the paper is divided as follows.

Section 2 illustrates the research rationale. Section 3

explains the process of creating the Web genre

representatives. Section 4 explains the intended

improvements and discusses the expected outcomes

of the research. Section 5 concludes the paper.

2 RESEARCH RATIONALE

In the literature, there has not been a unified clear

definition of the concept of Web page genre. Genres

are perceived as groups or classes of documents that

have certain features in common such as content,

structure, and functionality. Research has shown that

high levels of accuracy in genre-based classification

have been achieved with small and static datasets

(Mason, 2009). Since the Web is evolutionary, it

becomes difficult to uniquely identify all kinds of

genre that the Web may contain. The growing

number of social sites and community content on the

Web yields the continuous emergence of new Web

genres (Santini, 2006).

Creating a fine-grained palette of Web page

genres is a challenging problem. Santini (2006) used

a set of 25 Web-sampled pages and 23 different

genre labels in an attempt to investigate the user

perception of Web genres and evaluate the

evolutionary nature of Web genres. The results of

the study showed that Web genres can be grouped

into three categories: easy, ambiguous, and difficult.

The outcome of the study indicates that Web genres

hold the properties of hyperidism and individualism.

Researchers have attempted to categorize Web

genres using different techniques among which are

textual features (n-gram, word-gram, bag of words,

part of speech, etc) and visual features of Web pages

(Levering et al., 2009). Machine learning techniques

have been heavily investigated in the role of genre-

based classification with constant and relatively

small data sets.

Web page genre clustering is grouping Web

pages similar in content and structure into clusters of

genres (Santini, 2006). Most of the clustering

approaches in the literature are topic-based, which

takes the content of Web pages as the sole factor for

measuring similarity. Genres

extend the idea of

content to include the structure and

functionality of the document. In addition to

text-based clustering of documents,

Levering et

al. (2008) used HTML level features such as tags

and hypertext components to improve the accuracy

of clustering in a sample of online e-commerce

documents. In addition, visual features have been

investigated including the distribution of

components on the Web page as a distinguishing

factor of page genres.

Machine learning techniques have been utilized

for clustering and classifying genres for small

datasets such as in the work of Mason et al. (2009).

In their research, an n-gram based technique was

shown to provide high accuracy in genre

classification. SVM-based and a rule-based

classifiers were compared by Stubbe et al. (2007) to

investigate the effect of implicit and explicit user

feedback on incremental genre classification.

Levering et al. (2008) used a binary SVM classifier

to evaluate the use of different features in genre

classification. Accordingly, research has reached a

satisfactory level of accuracy in off-line

classification and clustering of Web documents.

Investigating the possibility of classifying and

clustering Web search results online during query

time has yet to be examined.

3 CREATING WEB PAGE GENRE

REPRESENTATIVE SETS

This research is aimed at creating groups of Web

pages to which Web search results will be compared

online during query time. These sets of Web pages

are called Web genre representatives. Web search

results will be compared and classified according to

their similarities to the representative sets. The

resulting clusters of Web search results will be

rendered to the user. The classification will use the

content of the document as well as the structure

which will be represented by surface features

(mainly the type of tags and their distribution on the

document).

The research already started by selecting the

work of Santini (2006) to identify the main genres

on the Web. Although, the number of 23 genres that

have been identified by Santini (2006) may not be

inclusive, these genres are enough for investigating

the possibility of identifying Web genres among

search results during query time. Hence, improving

the process of Web information retrieval is

anticipated. Due to high similarities among the

genres identified in the Work of Santini (2006), the

number of genres in the research presented in this

paper was reduced to 20. Those genres are shown in

Table 1.

Following the process of genre identification, the

research intends to create a set of representative

ONLINE WEB GENRE CLASSIFICATION, IS IT DOABLE?

279

Web documents for each genre. This will take place

by downloading Web pages that belong to different

genres and which were used in Santini (2006). The

key concept in this research is, however, deciding on

the number of Web pages that permits: first, a

satisfactory level of accuracy, and second, a

satisfactory classification time during query

answering. To determine adequate numbers of Web

pages for each genre, a repetitive process is needed.

Table 1: Types of Genres on the Web (taken from the

work of Santini (2006)).

No. Genre Comments

1 e-shop

2 personal home page

3 front page

4 search page

corporate home page OR

organization home page

merged

6 FAQs Web page

7 splash screen

8 net ad

9 email OR mailing list merged

10 sitemap

11 hotlist

12 academic personal home page

13 about page

14 blog OR clog merged

15 search by multiple fields

16 online form

17 newsletter

18 howto page

19 online tutorial

20 magazine cover

The process will start by giving each genre a

limited number of pages to represent the genre itself.

The pages will be selected from two different

sources. The first is the KI-04 and the SPIRIT

collections (http://www.itri.brighton.ac.uk/~Marina.

Santini/). The second source is fundamentally

different and will use the 25 pages used in Santini

(2006). For each page, pages of the same category

will be downloaded and a collection of Web pages

will be created for each genre. The size of the

collection in the second approach will be as close as

possible to that of the KI-04 and the SPIRIT

collections.

The next step in the research will take a subset

from each genre-related group of pages and use it in

online classification. Classification will happen by

assigning Web search results to the closest genre

represented by any of the subsets. The similarity will

be computed between each Web page in the search

results and the subset (cluster) of pages that

represent a particular genre. Testing will take into

account single-link, complete-link, and centroid

approaches in measuring the closeness of a Web

page to each cluster. The similarity will be based on

the content (cosine similarity) and the structure of

pages, i.e. surface features as in the work of Santini

and Sharoff (2009).

Each document will be assigned a genre label

based on how similar it is to the set of documents

that represent a genre. The surface features that will

be used in measuring the similarity of Web

documents will include the type, number, and

distribution of tags on the Web page. This approach

has been shown to be very effective in genre-based

classification (Santini and Sharoff, 2009).

The process of assigning Web pages to genres

will be examined after each classification during the

testing stage. The accuracy of the classification of

Web pages will be measured. Every time, the

number of pages that represent a genre will be

increased and the accuracy will be tested. When the

experiment reaches a number of pages in each

representative set that satisfies high accuracy and

high satisfaction with the resulting clusters of Web

pages, the process stops. This holds for every genre.

The result of this entire process (shown in Figure 1)

is sets of 20 main genre representatives for online

classification and clustering of Web search results.

Figure 1: Research Approach.

4 DISCUSSION AND FUTURE

WORK

From Santini (2006), twenty genres have been

selected for the purpose of this research. These

genres will be used in categorizing Web search

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

280

results. The process of creating a set of Web pages

that represents each of the genres is the next step in

the research. The process will result in clusters (sets)

of Web pages that resemble the content, the

structure, and the functionality of the corresponding

genres. Each set will go through several refinements

before it will be considered as a genre

representative. These refinements will be aimed at

minimizing the time required for achieving the

classification and clustering processes during query

time. In addition, the refinements will aim at

providing satisfactory levels of accuracy in the

classification.

After selecting Web genre representatives, the

research will aim at conducting a user study in

which the accuracy of genre-based classification will

be further investigated. The user engagement with

genre-based clustering as well as the effectiveness of

this approach will be investigated in the study. The

study will show the extent to which users will be

satisfied with genre-based clustering compared to

topical clustering and row presentations of Web

search results. Further research may be aimed at

more profound analysis of Web page genres to

include other subgenres.

5 CONCLUSIONS

Taking into consideration that Web genres may yield

more effective classification of Web documents

(Rosso, 2005), this research aims at investigating the

feasibility of classifying Web search results by

genres. The ultimate goal is to provide more

effective search results to the user. The remaining

stages of the research will involve creating Web

genre representatives of Web pages for the purpose

of classification. In addition, the clustering of Web

search results by genres will be investigated in a user

study that compares genre-based clustering to

topical clustering.

REFERENCES

Alhenshiri, A., Brooks, S., Watters, C., Shepherd, M.,

2010. Augmenting the Visual Presentation of Web

Search Results. In proceedings of the 5

International

Conference on Digital Information Management,

Thunder Bay, ON, Canada, (to appear).

Carpineto, C., Osiński, S., Romano, G., Weiss, D., 2009. A

Survey of Web Clustering Engines. ACM Computing

Surveys, vol. 41, issue 3, Article No. 17.

Levering, R., Cutler, M., and Yu, L., 2008. Using Visual

Features for Fine-Grained Genre Classification of Web

Pages. In Proceedings of the 41st Annual Hawaii

International Conference on System Sciences, Hawaii,

USA, 131.

Manning, C. D., Raghavan, P., Schütze, H., 2008.

Introduction to Information Retrieval. Cambridge

University Press.

Mason, J., E., Shepherd, M., Duffy, J., 2009. An N-Gram

Based Approach to Automatically Identifying Web

Page Genre. HICSS 2009: 1-10.

Rosso, A. M., 2005. What type of page is this?: Genre as

Web Descriptor. In Proceedings of the 5th

ACM/IEEE-CS Joint Conference on Digital Libraries,

Denver, CO, USA, 398.

Stubbe, A., Ringlstetter, C., Zheng, T., Goebel, R., 2007.

Incremental Genre Classification. In Proceeding of

Colloquium held in conjunction with Corpus

Linguistics, Birmingham, UK.

Santini, M., 2006. Interpreting Genre Evolution on the

Web. In EACL 2006 Workshop: NEW TEXT - Wikis

and blogs and other dynamic text sources, Trento, 32-

40.

Santini, M., Sharoff, S., 2009. Web Genre Benchmark

Under Construction. Journal for Language

Technology and Computational Linguistics (JLCL).

Volume 25, Number 1- Special Issue: Automatic

Genre Identification: Issues, and Prospects.

Teevan, J. 2008. How People Recall, Recognize, and

Reuse Search Results. ACM Transactions on

Information Systems, vol. 26, issue 4. Article No. 19.

Turetken, O., & Sharda, R., 2005. Clustering-based Visual

Interfaces for Presentation of Web Search Results: An

Imperical Investigation. Information Systems Frontier,

7(3), 273-297.

ONLINE WEB GENRE CLASSIFICATION, IS IT DOABLE?

281