of characters (tokens) is labeled as the main content
and is provided to the user. BTE (Finn et al., 2001)
and DSC (Pinto et al., 2002) are two of the state-of-
the-art algorithms in this category. Block-based main
content extraction algorithms, e.g. boilerplate detec-
tion using shallow text features (Kohlsch¨utter et al.,
2010), divide an HTML file into a number of blocks,
and then look for those blocks which contain the main
content. Therefore, the output of these algorithms
is comprised of some blocks which probably con-
tain the main content. Line-based algorithms such
as CETR (Weninger et al., 2010), Density (Moreno
et al., 2009), and DANAg (Mohammadzadeh et al.,
2013), consider each HTML file as a continuous se-
quence of lines. Taking into account the applied logic,
they introduce those lines of the file which are ex-
pected to contain the main content. Then, the main
content is extracted and provided to the user from the
selected lines. Most of the main content extraction
algorithms benefit from some simple pre-processing
methods which remove all JavaScript codes, CSS
codes, and comments from an HTML file (Weninger
et al., 2010) (Moreno et al., 2009) (Mohammadzadeh
et al., 2013)(Gottron, 2008). There are two major rea-
sons for such an observation: (a) they do not directly
contribute to the main text content and (b) they do not
necessarily affect content of the HTML document at
the same position where they are located in the source
code. In addition some algorithms (Mohammadzadeh
et al., 2013) (Weninger et al., 2010) normalize length
of the line and, thus render the approach independent
from the actual line format of the source code.
3 PRE-PROCESSING METHODS
In this section, all kinds of the pre-processing meth-
ods are explained in detail. Hereafter, these meth-
ods are referred to as Filter 1, Filter 2, and Filter
3, for further simplicity. In this contribution, only
the presented pre-processing methods are combined
with one of the line-based algorithms which is called
DANAg (Mohammadzadeh et al., 2013).
3.1 Filter 1
Algorithm 1 shows the simple logic used in Filter 1.
It can be seen that one just needs to remove all the
existing hyperlinks in an HTML file which is done at
line 4 of this algorithm. The only disadvantage of this
pre-processing method is that by removing the hyper-
links, the anchor texts are also removed. As a result,
this will cause the hyperlinks in the extracted main
content to be lost. Thus, their anchor texts, which
must be seen in the main content, will no longer exist
in the final main content. Consequently, the applica-
tion of Filter 1 will reduce either the accuracy or the
amount of recall (Gottron, 2007). In the ACCB algo-
rithm (Gottron, 2008), as an adapted version of CCB,
all the anchor tags are removed from the HTML files
during the pre-processing stage, i.e. Filter 1.
Algorithm 1: Filter 1.
1: Hyper = {h
1
, h
2
, ..., h
n
}
2: i = 1
3: while i <= n do
4: h
i
.remove()
5: i = i+1
6: end while
3.2 Filter 2
The idea behind Filter 2 which is shown in Algo-
rithm 2 implies that the all attributes of each anchor
tag are removed. With respect to Algorithm 2, which
shows the pseudocodes of Filter 2, one can understand
that an anchor tag contains only an anchor text.
<a>anchor text</a>
An advantage of Filter 2 over Filter 1 is that some
anchor texts related to the anchor tags, which are lo-
cated in the main content area, can be extracted by us-
ing Filter 2. In other words, the amount of recall (Got-
tron, 2007) yielded from application of Filter 2 would
be greater than the one obtained from Filter 1.
Algorithm 2: Filter 2.
1: Hyper = {h
1
, h
2
, ..., h
n
}
2: i = 1
3: while i <= n do
4: for each attribute of h
i
do
5: h
i
.remove(attribute)
6: end for
7: i = i+ 1
8: end while
3.3 Filter 3
In the third pre-processing method, called Filter 3, all
the HTML hyperlinks are normalized. In other words,
the purpose of this method is to normalize the ratio of
content and code characters representing the hyper-
links. Filter 3 is addressed in the AdDANAg (Mo-
hammadzadeh et al., 2012) algorithm.
For further simplification and better compre-
hension, the underlying approach of Filter 3
is described using a typical example. In the
following HTML code, the only attribute is
href="http://www.spiegel.de/"
.
WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies
336