Following that, our approach analyses the web
results pages generated and focuses on the tables
extracted from these results because the deep web
sources generates frequently the structured
information table (Wang, 2003).
All the information extracted by our approach is
used to build the XML schema describing the deep
web source analysed. We choose the XML schema to
describe the deep web source content because it is the
most adapted way to describe the web content.
All our proposed algorithms have a low
complexity O(n) except for the algorithms one and
three that has the complexity O(n
2
) that is a still an
acceptable complexity.
To resume, our approach uses all the accessible
elements of the deep web source with a cumulated
private knowledge about the deep web to maximize
our chance to extract the most descriptive XML
schema. And if we compare our approach to the
crawling and the surfacing approaches, our approach
needs only to extract the schema description and not
all deep web source data and this optimize the number
of the queries generated during our query processing
step in comparison with these two approaches. And
contrary to the form integration approach that sends
the final user query to all the deep web sources
through its global form to respond to it, the mediation
system based on our extracted XML schema
description with its associated constraints extracted
can identify precisely the deep web sources that can
respond to the final user query.
The XML schema extracted by our approach to
describe a deep web source is a promising solution to
integrate the deep web in a mediation system.
4 CONCLUSION AND FUTURE
WORKS
Our objective is to be able to implement a web
mediation system, and in this paper, we have
presented our automatic approach to extract the
needed XML schema describing a selected deep web
source.
Our approach applies a static and dynamic
analysis to the HTML forms giving access to the deep
web sources. During its process, our approach uses its
private knowledge about the deep web domains to
optimize and enrich its XML schema extraction. And
during the dynamic step, the web results pages
generated by our queries are analysed to identify also
the constraints of the form fields. All the elements
extracted by our approach during the static and the
dynamic analysis of the HTML forms are used to
build the XML schema describing the associated deep
web source.
The XML schema extracted is the key information
needed to integrate the associated deep web source in
a web mediation system able to respond to the end
user query directly in the web environment.
Currently, we are working on many issues to
improve our approach in parallel with the ongoing
experiments on the DWSpyder tool. Firstly, we are
working on the optimization of the number of the
queries generated and also on the optimization of the
content of the identification tables. For this we plan
to score the values of the identification tables that
generate interesting web results. These scored values
will be used to optimize the number of queries
generated to analyse the deep web source associated
to each domain name.
We are also working on the analysis of the
unstructured elements like the text of the web results
pages to realise an exhaustive analysis of all the
elements generated by our queries and not only the
HTML tables.
REFERENCES
Bing, L. 2007. Web Data Mining. Springer.
Bergman, M.K. 2001. The Deep Web: Surfacing Hidden
Value. The Journal of Electronic Publishing, Vol 7.
Chang. K. C.-C., He, B., Li, C., Patel, M., Zhang, Z. 2004.
Structured Databases on the Web: Observations and
Implications. ACM SIGMOD Record, Vol. 33, n. 3, 61-
70.
Chang, K.C-C., He, B., Zhang, Z. 2005. Toward large scale
integration: Building a MetaQuerier over databases on
the web. In proceedings of the Second Conference on
Innovative Data Systems Research (CIDR), 44-55.
Doan, A., Halevy, A., Ives, Z. 2012. Principles of Data
Integration. Elsevier.
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G.,
Schallhart, C. 2011. Real understanding of real estate
forms. In Proceedings of the international conference
on Web Intelligence, Mining and Semantics, Article
No. 13.
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G.
Schallhart, C. 2013. The Ontological Key:
Automatically Understanding and Integrating Forms to
Access the Deep Web. VLDB Journal Volume 22, Issue
5, 615-640. DOI = http//doi.acm.org/ 10.1007/s00778-
013-0323-0.
He, H., Meng, W., Yu, C., Wu, Z. 2005. WISE-Integrator :
A System for extracting and integrating complex web
search interfaces of deep web. In proceedings of the
31st VLDB conference, p.1314.
He, H., Meng, W., Yu, C., Wu, Z. 2005. Constructing
interface schemas for search interfaces of web
KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval
148