Table 3: The Average F1 Scores of AdDANAg based on the corpus in Table 1.
BBC
Economist
Zdf
Golem
Heise
Repubblica
Spiegel
Telepolis
Yahoo
Wikipedia
Manual
Slashdot
Plain 0.595 0.613 0.514 0.502 0.575 0.704 0.549 0.906 0.582 0.823 0.371 0.106
LQF 0.826 0.720 0.578 0.806 0.787 0.816 0.775 0.910 0.670 0.752 0.381 0.127
Crunch 0.756 0.815 0.772 0.837 0.810 0.887 0.706 0.859 0.738 0.725 0.382 0.123
DSC 0.937 0.881 0.847 0.958 0.877 0.925 0.902 0.902 0.780 0.594 0.403 0.252
TCCB 0.914 0.903 0.745 0.947 0.821 0.918 0.910 0.913 0.758 0.660 0.404 0.269
CCB 0.923 0.914 0.929 0.935 0.841 0.964 0.858 0.908 0.742 0.403 0.420 0.160
ACCB 0.924 0.890 0.929 0.959 0.916 0.968 0.861 0.908 0.732 0.682 0.419 0.177
Density 0.575 0.874 0.708 0.873 0.906 0.344 0.761 0.804 0.886 0.708 0.354 0.362
DANAg 0.924 0.900 0.912 0.979 0.945 0.970 0.949 0.932 0.952 0.646 0.401 0.209
AdDANAg 0.922 0.900 0.911 0.994 0.931 0.970 0.951 0.932 0.950 0.840 0.404 0.236
Table 4: The Average F1 Scores of AdDANAg based on the corpus in Table 2.
Al Ahram
BBC Arabic
BBC Pashto
BBC Persian
BBC Urdu
Embassy
Hamshahri
Jame Jam
Reuters
Wikipedia
ACCB-40 0.871 0.826 0.859 0.892 0.948 0.784 0.842 0.840 0.900 0.736
BTE 0.853 0.496 0.854 0.589 0.961 0.810 0.480 0.791 0.889 0.817
DSC 0.871 0.885 0.840 0.950 0.896 0.824 0.948 0.914 0.851 0.747
FE 0.809 0.060 0.165 0.063 0.002 0.017 0.225 0.027 0.241 0.225
KFE 0.690 0.717 0.835 0.748 0.750 0.762 0.678 0.783 0.825 0.624
LQF-25 0.788 0.780 0.844 0.841 0.957 0.860 0.765 0.737 0.870 0.773
LQF-50 0.785 0.777 0.837 0.828 0.954 0.856 0.767 0.724 0.870 0.772
LQF-75 0.773 0.773 0.837 0.819 0.954 0.852 0.756 0.724 0.870 0.750
TCCB-18 0.886 0.826 0.912 0.925 0.990 0.887 0.871 0.929 0.959 0.814
TCCB-25 0.874 0.861 0.909 0.927 0.992 0.883 0.888 0.924 0.958 0.814
Density 0.879 0.202 0.908 0.741 0.958 0.882 0.920 0.907 0.934 0.665
DANAg 0.949 0.986 0.944 0.995 0.999 0.917 0.991 0.966 0.945 0.699
AdDANAg 0.949 0.985 0.944 0.996 0.999 0.917 0.991 0.973 0.945 0.852
REFERENCES
Debnath, S., Mitra, P., and Lee Giles, C. (2005). Identifying
content blocks from web documents. In Foundations
of Intelligent Systems, Lecture Notes in Computer Sci-
ence, pages 285–293.
Finn, A., Kushmerick, N., and Smyth, B. (2001). Fact or fic-
tion: Content classification for digital libraries. In DE-
LOS Workshop: Personalisation and Recommender
Systems in Digital Libraries.
Gottron, T. (2008). Content code blurring: A new approach
to content extraction. In DEXA ’08: 19th Interna-
tional Workshop on Database and Expert Systems Ap-
plications, pages 29 – 33. IEEE Computer Society.
Gottron, T. (2009). An evolutionary approach to automati-
cally optimise web content extraction. In IIS’09: Pro-
ceedings of the 17th International Conference Intelli-
gent Information Systems, pages 331–343.
Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003).
DOM-based content extraction of HTML documents.
In WWW ’03: Proceedings of the 12th International
Conference on World Wide Web, pages 207–214, New
York, NY, USA. ACM Press.
Mantratzis, C., Orgun, M., and Cassidy, S. (2005). Sepa-
rating XHTML content from navigation clutter using
DOM-structure block analysis. In HYPERTEXT ’05:
Proceedings of the sixteenth ACM conference on Hy-
pertext and hypermedia, pages 145–147, New York,
NY, USA. ACM Press.
Mohammadzadeh, H., Gottron, T., Schweiggert, F., and
Nakhaeizadeh, G. (2011a). Extracting the main con-
tent of web documents based on a naive smooth-
ing method. In KDIR’11: International Conference
on Knowledge Discovery and Information Retrieval,
pages 470 – 475. SciTePress.
Mohammadzadeh, H., Gottron, T., Schweiggert, F., and
Nakhaeizadeh, G. (2011b). A fast and accurate ap-
proach for main content extraction based on character
encoding. In DEXA ’11: 22th International Workshop
on Database and Expert Systems Applications, pages
167 – 171. IEEE Computer Society.
Mohammadzadeh, H., Schweiggert, F., and Nakhaeizadeh,
G. (2011c). Using utf-8 to extract main content of
right to left language web pages. In ICSOFT 2011
- Proceedings of the 6th International Conference on
Software and Data Technologies, Volume 1, Seville,
Spain, 18-21 July, 2011, pages 243–249. SciTePress.
Moreno, J., Deschacht, K., and Moens, M. (2009). Lan-
guage independent content extraction from web pages.
In Proceeding of the 9th Dutch-Belgian Information
Retrieval Workshop, pages 50–55.
Pinto, D., Branstein, M., Coleman, R., Croft, W. B., King,
M., Li, W., and Wei, X. (2002). QuASM: a system
for question answering using semi-structured data. In
JCDL ’02: Proceedings of the 2nd ACM/IEEE-CS
joint conference on Digital libraries, pages 46–55,
New York, NY, USA. ACM Press.
Weninger, T. and Hsu, W. H. (2008). Text extraction from
the web via text-tag-ratio. In TIR ’08: Proceedings
of the 5th International Workshop on Text Information
Retrieval, pages 23 – 28. IEEE Computer Society.
WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies
682