(k-nearest neighbours) and by two decision tree
based methods (J48, based on C4.5, and
RandomTree method). The results of classification
accuracy for this experiment are shown in Table 3.
The table shows that classification achieves good
accuracy above 90%; it is slightly lower for the
whole dataset of pages from 5 different sites. If we
perform classification only for one web site only
(CNN.com), the accuracy is above 95%.
Table 3: Results of visual blocks classification.
Method All web sites One site only
J48 92.4% 97.5%
k-NN 90.5% 94.9%
R_Tree 91.1% 96.0%
From these results we can see that the visual
blocks classification can be a good foundation for
successful two-phase classification.
For the second phase, best results are provided
by a Bayes Net classifier, a tree-based classifier
(Functional Trees) and SMO method (Sequential
Minimal Optimization). We will take the results of
J48 algorithm in the Phase I as input for Phase II.
The setting of coefficients for individual content
visual block categories is following: for links, the
value is 1; for main text: 2; for heading: 5. Detailed
experiments with different setting of coefficients and
detailed description of possibilities to modify
weights are presented in (Bartik, 2010).
The classification accuracy is compared with a
classical text-based classification (extracted text
weighted by TF-IDF). Again, results on dataset of
pages from one web site and from all five web sites
are compared.
The results in Table 4 lead to a conclusion that
using modified weights causes better web page
classification accuracy, even if all the visual blocks
are not classified correctly in the first phase.
Table 4: Results of the whole two-phase categorization.
Method
Text
Based
All sites
(2-phase)
One site
(2-phase)
Bayes Net
90.6%
93.3% 94.7%
FT
88.6%
91.8% 92.9%
SMO
88.0%
91.9% 93.9%
4 CONCLUSIONS
We have presented a method of two-phase
classification of web pages. The basic idea is that the
text, which is extracted, is represented by modified
term weights. They are modified according to
importance of a visual block, in which the term is
present. The information about category of visual
blocks is obtained by classification in the first phase.
The second phase makes classification of whole
web pages with use of vectors of modified weights.
The experiments with data from news sites
showed that the accuracy of two-phase classification
exceeds 90%.
In the future research, it is possible to try using
results of visual block classification and modified
term weights in some other web mining problems,
for example finding similar documents within some
dataset or in some information retrieval tasks.
ACKNOWLEDGEMENTS
This research was supported by the Research Plan
No. MSM0021630528 – “Security-Oriented
Research in Information Technology” and by the
BUT FIT grant No. FIT-10-S-2.
REFERENCES
Lin, S. H., Ho, J. M., 2002. Discovering Informative
Content Blocks from Web Documents. In SIGKDD
2002, 8
th
Conference on Knowledge Discovery and
Data Mining, ACM.
Chen, Y., Ma, W. Y., Zhang, H. J., 2003. Detecting Web
Page Structure for Adaptive Viewing on Small Form
Factor Devices. In WWW 2003, Twelfth International
World Wide Web Conference, ACM.
Cai, D., Yu, S., Wen, J. R., Ma, W. Y., 2003. VIPS: a
Vision-based Page Segmentation Algorithm. Microsoft
Research.
Xiang, P., Yang, X., Shi, Y., 2006. Effective Page
Segmentation Combining Pattern Analysis and Visual
Separators for Browsing on Small Screens. In
International Conference on Web Intelligence, ACM.
Salton, G., Buckley, C., 1999. Term Weighting
Approaches in Automatic Text Retrieval. In
Information Processing and Management, Vol. 24,
Elsevier.
Kwon, O. W., Lee, J. H., 2003. Text Categorization Based
on K-nearest Neighbor Approach for Web Site
Classification. In Information Processing and
Management, Vol. 39, Elsevier.
Schenker, A., Last, M., Burke, H., Kandel, A., 2004.
Classification of Web Documents Using Graph
Matching. In International Journal of Pattern
Recognition and Artificial Intelligence, Vol. 18,
World Scientific.
Burget, R., 2007. Layout Based Information Extraction
from HTML Documents. In ICDAR 2007, Ninth
International Conference on Document Analysis and
Recognition, IEEE.
TWO-PHASE CATEGORIZATION OF WEB DOCUMENTS
461