could recognize the main content , correctly, be-
cause normal web pages are not fully structure.
5 CONCLUSIONS AND FUTURE
WORK
In this paper, we proposed DANAg: an extended
and especially language-independent new version of
DANA, with considerable effectiveness. Results
show that DANAg determines the main content with
high accuracy on many collected data sets. Achiev-
ing an average F1-measure > 0.90 on the test corpora
used in this paper, it outperforms the state of the art
methods in MCE.
In the future and for next research step, we will try
to extend DANAg to use machine learning methods
to group several areas in an HTML file contributing
to the main content of web pages. This allows for dis-
carding the parameter setting for gaps between main
content blocks and to overcome the problem observed
on certain documents in the evaluation corpora.
REFERENCES
Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2003). Extract-
ing content structure for web pages based on visual
representation. In APWeb, volume 2642 of Lecture
Notes in Computer Science, pages 406–417. Springer.
Debnath, S., Mitra, P., and Giles, C. L. (2005). Identi-
fying content blocks from web documents. In Lec-
ture Notes in Computer Science, pages 285–293, NY,
USA. Springer.
Finn, A., Kushmerick, N., and Smyth, B. (2001). Fact or
fiction: Content classification for digital libraries. In
Proceedings of the Second DELOS Network of Excel-
lence Workshop on Personalisation and Recommender
Systems in Digital Libraries, Dublin, Ireland.
Gibson, D., Punera, K., and Tomkins, A. (2005). The vol-
ume and evolution of web page templates. In WWW
’05: Special Interest Tracks and Posters of the 14th
International Conference on World Wide Web., pages
830–839, New York, NY, USA. ACM Press.
Gottron, T. (2007). Evaluating content extraction on html
documents. In Proceedings of the 2nd International
Conference on Internet Technologies and Applica-
tions, pages 123–132, University of Wales, UK.
Gottron, T. (2008). Content code blurring: A new approach
to content extraction. In 19th International Workshop
on Database and Expert Systems Applications, pages
29–33, Turin, Italy.
Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003).
Dom-based content extraction of html documents. In
Proceedings of the 12th international conference on
World Wide Web, pages 207–214, New York, USA.
ACM.
Hirschberg, D. S. (1977). Algorithms for the longest com-
mon subsequence problem. J. ACM,24(4), pages 664–
675.
Kaiser, G., Gupta, S., and Stolfo, S. (2005). Extracting
context to improve accuracy for html content extrac-
tion. In Special Interest Tracks and Posters of the 14th
International conference on World Wide Web, pages
1114–1115.
Lin, S.-H. and Ho, J.-M. (2002). Discovering informative
content blocks from web documents. In KDD ’02:
Proceeding of the Eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data Min-
ing, pages 588–593, New York, NY, USA. ACM.
Mantratzis, C., Orgun, M., and Cassidy, S. (2005). Sep-
arating xhtml content from navigation clutter using
dom-structure block analysis. In Proceedings of the
Sixteenth ACM Conference on Hypertext and Hyper-
media, pages 145–147, New York, USA. ACM.
Mohammadzadeh, H., Gottron, T., Schweiggert, F., and
Nakhaeizadeh, G. (2011a). A fast and accurate ap-
proach for main content extraction based on character
encoding. In TIR’11: Proccedings of the 8th Work-
shop on Text-based Information Retrieval., Toulouse,
France.
Mohammadzadeh, H., Schweiggert, F., and Nakhaeizadeh,
G. (2011b). Using utf-8 to extract main content
of right to left languages web pages. In ICSOFT
2011: Proceedings of the 6th International Confer-
ence on Software and Data Technologies., pages 243–
249, Seville, Spain.
Moreno, J. A., Deschacht, K., and Moens, M.-F. (2009).
Language independent content extraction from web
pages. In Proceeding of the 9th Dutch-Belgian Infor-
mation Retrieval Workshop, pages 50–55, Netherland.
Pinto, D., Branstein, M., Coleman, R., Croft, W. B., and
King, M. (2002). Quasm: a system for question an-
swering using semi-structured data. In Proceedings
of the 2nd ACM/IEEE-CS joint conference on Digital
libraries, pages 46–55, New York, USA. ACM.
Rahman, A. F. R., Alam, H., and Hartono, R. (2001). Con-
tent extraction from html documents. In WDA 2001:
Proceedings of the First International Workshop on
Web Document Analysis., pages 7–10.
Weninger, T., Hsu, W. H., and Han, J. (2010). CETR – con-
tent extraction via tag ratios. In Proceeding of Inter-
national World Wide Web Conference., Raleigh, USA.
EXTRACTING THE MAIN CONTENT OF WEB DOCUMENTS BASED ON A NAIVE SMOOTHING METHOD
475