Table 3: Experiment results.
Object relevant web page filtering
Product identification without refinement step
(entry page / detailed profile pages)
Product identification with refinement step
(entry page / detailed profile pages)
W W+B W+O W+B+O W W+B W+O W+B+O W W+B W+O W+B+O
en Rec. (%) 95.2 93.7 85.2 83.9 76.5/79.3 77.3/79.7 80.2/81.1 78.3/82.5 78.2/82.7 79.9/83.1 80.5/85.7 82.5/86.3
Prec. (%)
79.4 81.2 82.5 89.9 72.6/80.9 75.2/81.3 81.1/82.7 83.1/83.2 74.2/83.2 75.9/85.7 81.9/87.0 87.3/88.5
cn Rec. (%) 91.4 90.9 83.1 79.5 74.3/78.8 72.9/80.2 75.6/80.3 73.2/84.8 76.1/80.2 76.3/81.5 76.9/83.4 77.3/85.3
Prec. (%) 75.1 78.3 80.0 88.8 71.1/77.5 75.4/77.9 79.9/81.2 81.0/81.9 75.9/78.0 75.8/79.1 80.4/81.6 82.1/83.6
jp Rec. (%) 94.6 93.7 86.3 81.5 79.4/76.7 74.2/77.3 75.0/79.2 75.5/80.6 80.1/77.2 75.3/79.5 79.7/85.0 81.1/90.5
Prec. (%) 80.2 81.5 81.3 89.2 72.9/79.1 75.8/79.7 82.5/82.3 84.8/83.5 73.2/80.3 77.0/82.1 85.7/82.8 86.8/84.0
Ave Rec. (%) 93.7 92.8 84.9 81.6 76.7/78.3 74.8/79.1 76.9/80.2 75.7/82.6 78.1/80.0 77.2/81.4 79.0/84.7 80.3/87.4
Prec. (%)
78.2 80.3 81.3 89.3 72.2/79.2 75.5/79.6 81.2/82.1 83.0/82.9 74.4/80.5 76.2/82.3 82.7/83.8 85.4/85.4
W: filtering with white wordlist of path query; B: filtering with black wordlist of path-query; O: filtering with ontological wordlist of content-query
The corresponding experiments are conducted to
verify the effects of the adopted new web page
clustering method in Section 3.2. Figure 8 is the
comparison of the final product identification results
with two web page clustering methods. We can see
that the performance from the proposed clustering
algorithm is enhanced significantly comparing with
the results from the identified HLs. The impact of
the refinement step can also be found in this figure,
i.e., averagely, it improves performance about 3-8%
with respect to both the recall and precision.
Figure 8: Comparison of product identification results.
5 CONCLUSIONS
Most existing solutions for web data extraction
assume each given page includes several data
records. It makes them not applicable for the
problem of website-level data extraction, which
assumes the relevant information of an object is
distributed sparsely in the website. This paper
proposes a novel approach to address this new
problem. It exploits HLs for web page organization
in websites as a novel resource for not only the
object relevant web page finding but also the object
centred web page clustering. The experiment results
verify the usability of the proposed approach. And
also some limitations exist in this method. The major
limitation is that the object-relevant keywords need
to be set manually. How to collect the keywords
(semi-)automatically is our future work.
REFERENCES
Laender, A., da Silva, A., B. Ribeiro-Neto, and Teixeira.,
J., 2002. A Brief Survey of Web Data Extraction
Tools. SIGMOD Record.
Arocena, G. O., and Mendelzon, A. O., 1998. WebOQL:
Restructuring documents, databases, and webs. Proc.
of ICDE.
Arasu, A. and Garcia-Molina, H., 2003. Extracting
Structured Data from Web Pages. SIGMOD-03.
Liu, B., Grossman, R., and Zhai, Y., 2003. Mining data
records in Web pages. In Proc. of the ACM SIGKDD.
Chang, C., Lui, S., 2001. IEPAD: Information extraction
based on pattern discovery. Proc. of WWW
Cohen, W., Hurst, M., and Jensen, L. 2002. A flexible
learning system for wrapping tables and lists in HTML
documents. Proc. of WWW
Cai, D., Yu, S., Wen, Ji-Rong, and Ma, W.-Y., 2003.
VIPS: a vision-based page segmentation algorithm.
Microsoft Technical Report (MSR-TR-2003-79).
Hammer, J., Mchvoh, J., and Garcia-Molina, H., 1997.
Semistructured data: The TSIMMIS experience. Proc.
of the First East-European Symposium on Advances in
Databases and Information Systems.
Davulcu, H., Vadrevu, S., Nagarajan, S., Gelgi, F., 2005
METEOR: metadata and instance extraction from
object referral lists on the web. Proc. Of WWW.
Zhu, H., Raghavan,S., Vaithyanathan, S., 2007. Alexander
Löser: Navigating the intranet with high precision.
Proc. WWW.
Kao, H.-Y., Lin, S.-H., 2004. Mining web informative
structures and content based on entropy analysis. IEEE
Trans. on Knowledge and Data Engineering.
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., Ma, W.-Y., 2006.
Simultaneous record detection and attribute labeling in
web data extraction. Proc. Of KDD
Park, J. Barbosa, D., 2007. Adaptive record extraction
from web pages, Proc. WWW
Tajima, K. Mizuuchi, Y. Kitagawa, M., K. Tanaka., 1998.
Cut as a querying unit for WWW, Netnews, and E-
mail. In Proc. Of ACM Hypertext.
WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies
608