and ACCB mentioned in Section 2 of the present pa-
per are related to this contribution. It is interesting
to notice that (Moreno et al., 2009) used the Gottron
data sets and compared their approach, LICEA, with
CCB and ACCB. They showed that LICEA lead to
better results than CCB and ACCB. This means that
if we would have compared our approach with CCB
and ACCB, we would have achieved better results be-
cause as mentioned above, our results are superior to
LICEA-results.
6 CONCLUSIONS AND FUTURE
WORK
In this paper, we proposed a two-phase R2L main
content extraction algorithm. Results show that the
R2L algorithm produces satisfactory MC with F1-
measure > 0.92. Our algorithm has some advantages:
• It is DOM tree and HTML-format independent;
therefore, HTML statements may have some er-
rors.
• We do not need to use parser for our algorithm.
Many of the previous MCE methods have made
DOM tree structure, or used HTML tags for their
purpose. Using parser is time consuming.
• By fine tuning of the parameter P for each web
site, we see an increase in the F1-measure.
One problem with the R2L algorithm is that it has
not achieved value 1 for F1-measure, which suggests
that there are words in the MCA containing Non-R2L
characters. If there are only R2L characters in the
main content area, then the R2L algorithm yields ex-
actly the value 1 for the F1-measure. Otherwise, F1-
measure will have a value less than 1, based on a per-
centage of Non-R2L characters in the main content
area. To overcome this problem in the future, we have
a plan to input entire lines of HTML file composing
MC to an HTML parser. Then, the output of parser
will be exactly the main content.
ACKNOWLEDGEMENTS
We would like to thank Dr. Norbert Heidenbluth for
helping us to prepare figures and diagrams. We would
like also to thank Dr. Koen Deschacht and the Univer-
sity of K.U.LEUVEN for providing us with the Con-
tent Extraction Software. At the end, I would like to
thank Dr. Ljubow Rutzen-Loesevitz for editing this
paper.
Table 3: Average results for the R2L algorithm and the
LICEA reported in (Moreno et al., 2009).
Web site R2L, R2L, LICEA
P = 8 the best P
BBC-Farsi 0.9906 0.9906 0.7755
Hamshahri 0.9909 0.9909 0.9406
Jame Jam 0.9769 0.9872 0.9085
Ahram 0.9295 0.983 0.9342
Reuters 0.9356 0.9708 0.9221
Embassy of 0.9536 0.9715 0.8919
Germany, Iran
BBC-Urdu 0.9564 0.9972 0.9495
BBC-Pashto 0.9745 0.9745 0.9403
BBC-Arabic 0.987 0.987 0.302
Wiki 0.283 0.3852 0.7121
REFERENCES
Debnath, S., Mitra, P., and Giles, C. L. (2005). Identi-
fying content blocks from web documents. In Lec-
ture Notes in Computer Science, pages 285–293, NY,
USA. Springer.
Finn, A., Kushmerick, N., and Smyth, B. (2001). Fact or
fiction: Content classification for digital libraries. In
Proceedings of the Second DELOS Network of Excel-
lence Workshop on Personalisation and Recommender
Systems in Digital Libraries, Dublin, Ireland.
Gottron, T. (2007). Evaluating content extraction on html
documents. In Proceedings of the 2nd International
Conference on Internet Technologies and Applica-
tions, pages 123–132, University of Wales, UK.
Gottron, T. (2008). Content code blurring: A new approach
to content extraction. In 19th International Workshop
on Database and Expert Systems Applications, pages
29–33, Turin, Italy.
Gottron, T. (2009). An evolutionary approach to automati-
cally optimize web content extraction. In In Proceed-
ings of the 17th International Conference Intelligent
Information Systems, pages 331–343, Krakw, Poland.
Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003).
Dom-based content extraction of html documents. In
Proceedings of the 12th international conference on
World Wide Web, pages 207–214, New York, USA.
ACM.
Mantratzis, C., Orgun, M., and Cassidy, S. (2005). Sep-
arating xhtml content from navigation clutter using
dom-structure block analysis. In Proceedings of the
Sixteenth ACM Conference on Hypertext and Hyper-
media, pages 145–147, New York, USA. ACM.
Moreno, J. A., Deschacht, K., and Moens, M.-F. (2009).
Language independent content extraction from web
pages. In Proceeding of the 9th Dutch-Belgian Infor-
mation Retrieval Workshop, pages 50–55, Netherland.
Pinto, D., Branstein, M., Coleman, R., Croft, W. B., and
King, M. (2002). Quasm: a system for question an-
swering using semi-structured data. In Proceedings
of the 2nd ACM/IEEE-CS joint conference on Digital
libraries, pages 46–55, New York, USA. ACM.
USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES
249