CRAWLING DEEP WEB CONTENT THROUGH QUERY FORMS
Jun Liu, Zhaohui Wu, Lu Jiang, Qinghua Zheng and Xiao Liu
MOE KLINNS Lab and SKLMS Lab, Xi’an Jitaotong University, Xi'an 710049, China
Keywords: Deep Web, Deep Web Surfacing, Minimum Executable Pattern, Adaptive Query.
Abstract: This paper proposes the concept of Minimum Executable Pattern (MEP), and then presents a MEP
generation method and a MEP-based Deep Web adaptive query method. The query method extends query
interface from single textbox to MEP set, and generates local-optimal query by choosing a MEP and a
keyword vector of the MEP. Our method overcomes the problem of “data islands” to a certain extent which
results from deficiency of current methods. The experimental results on six real-world Deep Web sites show
that our method outperforms existing methods in terms of query capability and applicability.
1 INTRODUCTION
There is an enormous amount of information buried
in the Deep Web, and its quantity and quality are far
beyond the “surface web” that traditional search
engines can reach (Michael, 2001). However, such
information cannot be obtained through static html
pages but dynamic pages generated in response to a
query through a web form. Due to the enormous
volume of Deep Web information and the
heterogeneity among the query forms, effective
Deep Web crawling is a complex and difficult issue.
Deep web crawling aims to harvest data records
as many as possible at an affordable cost (Barbosa,
2004), whose key problem is how to generate proper
queries. Presently, a series of researches on Deep
Web query has been carried out, and two types of
query methods, namely prior knowledge-based
methods and non-prior knowledge methods, have
been proposed.
The prior knowledge-based query methods need
to construct the knowledge base beforehand, and
generate queries under the guidance of prior
knowledge. In (Raghavan, 2001) proposed a task-
specific Deep Web crawler and a corresponding
query method based on Label Value Set (LVS)
table; the LVS table as prior knowledge is used for
passing values to query forms. (Alvarez, 2007)
brought forward a query method based on domain
definitions which increased the accuracy rate of
filling out query forms. Such methods automate
deep crawling to a great extent (Barbosa,2005),
however, have two deficiencies. First, these methods
can only perform well when there is sufficient prior
knowledge, while for the query forms that have few
control elements (such as a single text box), they
may have an unsatisfactory performance. Second,
each query is submitted by filling out a whole form,
which reduces the efficiency of Deep Web crawling.
The non-prior knowledge methods are able to
overcome the above deficiencies. These methods
generate new candidate query keywords by
analyzing the data records returned from the
previous query, and the query process does not rely
on prior knowledge. Barbosa et al. first introduced
the ideas, and presented a query selection method
which generated the next query using the most
frequent keywords in the previous records (Barbosa,
2004). However, queries with the most frequent
keywords in hand do not ensure that more new
records are returned from the Deep Web database.
(Ntoulas, 2005) proposed a greedy query selection
method based on the expected harvest rate. In the
method, candidate query keywords are generated
from the obtained records, and then their harvest
rates are calculated; the one with the maximum
expected harvest rate will be selected for the next
query. (Wu P, 2006) modeled each web database as
a distinct attribute-value graph, and under this
theoretical framework, the problem of finding an
optimal query selection was transferred into finding
a Weighted Minimum Dominating Set in the
corresponding attributed-value graph; according to
the idea, a greedy link-based query selection method
was proposed to approximate the optimal solution.
634