7 CONCLUSION
We proposed a web page classification method for
building a high quality homepage collection, by
using a three-way classifier composed of recall-
assured and precision-assured component classifiers.
By using the proposed method, not only the basic
performance in F-measure but also the performance
of precision/recall-assured classifiers are improved
evidently. At the same time, the effectiveness for
reducing the number of pages that need manual
assessment to satisfy a required quality given is also
shown.
The current classification performance is still far
below what can be achieved manually. In the future,
we will study a technique to estimate the likelihood
of the surrounding pages to be the component pages
and incorporate it to the current method.
We also need to investigate the processing cost
problem because high performance classifiers
require rather complex feature processing that might
make it impractical to deal with the enormous size of
the web. We have partly tackled this problem with
the rough filtering presented in the reference (Wang
and Oyama, 2006), and we expect to be able to
overcome it by extending the approach.
ACKNOWLEDGEMENTS
This study was partially supported by a Grant-in-Aid
for Scientific Research B (No. 18300037) from the
Japan Society for the Promotion of Science (JSPS).
We used the NW100G-01 document data set with
permission from the National Institute of Informatics
(NII). We would like to thank Professors Akiko
Aizawa and Atsuhiro Takasu of NII for their
precious advice.
REFERENCES
Aizawa, A., Oyama, K., 2005. A fast linkage detection
scheme for multi-source information integration. In
International Workshop on Challenges in Web
Information Retrieval and Integration (WIRI 2005),
Tokyo, Japan.
CiNii. http://ci.nii.ac.jp/.
Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock,
D. M., Flake, G. W., 2002. Using web structure for
classifying and describing web pages. In 11th
International World Wide Web Conference, Honolulu,
Hawaii, USA.
Cover, T.M., Thomas, J. A., 1991. Elements of
information theory. Willey press.
Kan., M.-Y., 2004. Web page categorization without the
web page. In 13th World Wide Web Conference
(WWW2004), New York, NY, USA, May 17-22.
Kan, M.-Y., Thi, H.O.N., 2005. Fast webpage
classification using URL features. In CIKM’05,
Bremen, Germany.
Masada, T., Takasu, A., Adachi, J., 2005. Improving web
search performance with hyperlink information. IPSJ
Transactions on Databases, Vol.46, No.8.
Sun, A., Lim, E.-P., Ng. W.-K., 2002. Web classification
using support vector machine. In 4th international
workshop on web information and data management.
ACM Press, McLean, Virginia, USA.
Sun, A., Lim, E.-P., 2003. Web unit mining: finding and
classifying subgraphs of web pages. In International
Conference on Information and Knowledge
Management (CIKM2003), New Orleans, Louisiana,
USA.
Sun, J., Zhang, B., Chen, Z., Lu, Y., Shi, C., Ma, W.,
2004. GE-CKO: A method to optimize composite
kernels for web page classification. In 2004
IEEE/WIC/ACM International Conference on Web
Intelligence (WI2004), Beijing, China.
Wang, Y., Oyama, K., 2006. Combining page group
structure and content for roughly filtering researchers’
homepages with high recall. IPSJ Transactions on
Databases, Vol.47, No. SIG 8 (TOD 30).
Yang, Y., Slattery, S., Ghani, R., 2002. A study of
approaches to hypertext categorization. Journal of
Intelligent Information Systems, volume 18. Kluwer
Academic Press.
Table 4: Estimated page numbers of classification output from a 100GB web page corpus.
baseline o-i-e-1_tag_real
Required quality
(precision / recall)
assured
positive
uncertain
(N
b
)
assured
negative
assured
positive
uncertain
(N
p
)
assured
negative
Reduction
ratio
(N
p
/N
b
)
99.5% / 98% 3,800 461,832 1,618,988 9,206 358,207 1,717,187 77.6%
99% / 95% 6,163 274,524 1,803,913 11,251 156,782 1,916,567 57.1%
98% / 90% 11,116 155,418 1,918,066 15,503 81,157 1,987,940 52.2%
WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY
HOMEPAGE COLLECTION
175