extent is a challenging work. Many researches on
automatic deep Web discovery have been done
(Madhavan et al. 2007; Hong, He, and Bell 2009; He,
Hong, and Bell 2009; An et al. 2008). He et al (2003;
2005; 2007) presented an interface layout expression
called IEXP according to the location layout of labels
and query control elements in query interface, and
developed an extraction tool called WISE-IExtractor
to get query interface schema. Liu et al. (2010)
proposed an approach which primarily utilizes the
visual features on the deep Web pages to implement
deep Web data extraction, including data record
extraction and data item extraction. He et al. (2004)
built a statistical model which adopted a positive
correlation and a negative correlation mining
algorithm finding out the hidden complex matching
patterns to analyze the frequency and pattern of
attribute names occurring at the same time. Fan et al.
(2011) proposed a query transformation based on
predicate template. Our contributions are summarized
as follows:
• Domain ontology is built under the guidance of
domain specialists. In order to guarantee the
correctness of ontology, we built the ontology
manually. Besides, in the process of using ontology,
it is updated and improved automatically and
continuously.
• The research on schema extraction from deep
Web query interface is a key step in hidden Web
resources mining. A novel approach to extract
interface schema from deep Web based on domain
ontology is presented, in which a new presentation of
query interface attribute is proposed.
• In addition, it is hard to avoid submitting
repeated attribute information in these queries which
wastes lots of time and energy. Therefore, in order to
visit multiple databases in the same time and get
useful information through comparing and screening
returning results, it is of great significance to match
multiple attributes in different interfaces and integrate
a unified query interface from view of efficiency.
Most traditional integrated methods of deep Web
interface obtain mapping relations based on abundant
statistics. They not only need lots of sample spaces,
but also ignore some semantic relations between
attributes. A two-phase pattern matching approach
based on ontology is defined to achieve matching
between form attributes and knowledge in ontology.
The matching results are used to update the ontology
and simplify the following matching process. An
integrated query interface is generated according to
the obtained mapping relations and schema
information.
• Query transformation, where the query requests
are submitted to multiple destination query interfaces
selected by a specific domain, is a critical component
of deep Web integrated system. This paper presented
a type-driven minimum superset query
transformation based on ontology. The predicate
templates with constraints and four types of predicate
processors are proposed in the query transformation,
in which automatically transforming one query in the
integrated form to the multiple queries in the
destination query forms is achieved.
• Due to the different styles of data records
extracted from multiple databases, post-processing is
an important part to organize, simplify, and convert
these kinds of data records into a unified structure. In
the part of building domain ontology, BIM and
RSEM are adopted to obtain the feature vector of
domain books. In the part of extracting information,
Web pages are first parsed using HTML parser. Then
HTML tree is obtained after getting rid of information
that users do not interested in, such as ads or
navigation and so on. Finally, the relevant data blocks
in HTML tree are extracted, ranked and stored.
2 EXTRACTION OF INTERFACE
SCHEMA
An imperative step of discovering information in
deep Web is interface schema extraction, which finds
out the common characteristics in query interfaces
from two perspectives, the internal code and visual
unit.
Based on the controls in the query interface, texts
associated with controls, and the characteristics of
layout, the tags which locate in <form> and </form>,
are divided into three kinds. The definitions of tags
are as follows:
Definition 1. Control Tag (CT) is used to receive
the query value which is input from users and mainly
described with <input></input> and
<select></select>. CT has formalized representation
of a tuple
{,,, , }CT name type id value location