redundant blocks, except the main block, if two
blocks are very similar to each other, they are
considered as redundant blocks, and hence are not
the main/informative blocks.
The work presented in this paper adopts and
modifies the CST approach proposed by Li & Yang
(2009). Different from the approach (Li & Yang,
2009), our approach employs the following strategy:
1. Employing HTML cleaner before processing
pages.
2. Using only text for calculating block significance
3. Utilizing both the depth and number of a node’s
children as the attenuation quotient factor.
4. Employing two steps of main block detection. In
the first step, modified CST is used for detecting
primary and secondary main block markers of post
from the same domain, while in the second step the
main block is detected by filtering the post using the
markers and by using the modified CST as the
second alternative.
3 EXTENDED CST APPROACH
The CST approach (Li & Yang, 2009) consists of
two steps: (1) building the CST of an HTML page,
and (2) measuring the importance of nodes which
represents nested blocks, and determining the block
with the highest importance value as the main block
of the page. The CST contains two types of nodes:
HTML item node and content node types. The
former is generated from <body>, <div>, or
<table> tags and represents blocks, while the later
represents the actual content of the page and is
generated from text, <image>, or <link> tags.
Starting from the deepest node, the importance of
an HTML item node is calculated by counting the
length or the number of texts, links, and images
occupied by the HTML item. The importance of an
HTML item node N is recursively defined as (Li &
Yang, 2009):
NN
CCNN
.
(1)
where
N
is the attenuation quotient of the node,
N
C
is the importance of the children of the node N,
and
N
C
is the importance of HTML item nodes
which are the children of node N,
N
C
is the
importance of content nodes which are the children
of the node N and is calculated by counting the
weighted sum of the size of text, the number of
images and the number of links. The experiment
reported by Li & Yang (2009) has shown that the
optimal weight is 1.0, 0.75, and 40 for text, links,
and images respectively. For a more detailed
explanation the reader should refer to the original
article (Li & Yang, 2009).
Based on the CST approach, we proposed two
improvements: the modified CST approach (mCST)
and the extended mCST (E-mCST). Our
modification of the CST approach (the mCST
approach) includes the following:
(1) We observe that in a blog post, a block is
represented by not only <body>, <table>,
<div>, but also other tags such as <id> and
<span>. For simplicity, in our implementation we
consider any tags other than text, image, and link
tags as HTML item tags.
(2) Note that in our research case we need to extract
the textual information from blog posts. Moreover,
we observe that the main block of blog post usually
contains very limited links and images. Hence, we
propose to use text as the only factor for calculating
the importance of content node.
(3) In the original CST approach (Li & Yang, 2009),
the attenuation quotient is defined as:
1, 10
log( ), 10
N
N
NN
D
DD
(2)
where D
N
is the depth of the node N. The definition
does not distinguish between a wide substree and a
narrow subtree. Figure 2 ilustrates the situation. For
clarity, circles represent item nodes and round-edge
rectangle represent content nodes.
As shown in Figure 2, a node with narrow
subtree contains more dense content than the one
with wider subtree, therefore the the former should
suffer from attenuation less than the later. Hence,
considering thatbthe attenuation quotient should be
influenced by not only the depth of the node but also
the wide of the subtree of the node, we propose a
new definition of attenuation quotient as:
1.0 1.0
.
(10)(10)
N
NN
Log D Log C
(3)
where D
N
is the depth of the node N, and C
N
is the
number of the children of the node N.
In the E-mCST approach, we would also like to
take the benefit of the fact that posts of the same
domain tend to have similar block layout, which is
represented by similar HTML item nodes. Thus, for
posts from the same domain of address, we identify
the HTML tags and their attributes that are
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
440