hazardous because it requires no guidance to mark
these cases as failures. In the latter cases, however,
the extracted data must be rejected manually.
Using automatic extraction with existing domain
knowledge, 85% of the extracted product attributes
were correct and 10% bogus data. On average, 23 of
27 available product attributes were correctly
extracted and one false positive was mined.
Overall, the information extraction component
showed feasible results. Assuming that the
algorithms are included in an information platform
used by consumers, it is expected that users provide
extraction hints to the system in a wiki-like form.
After some running time and the intensive collection
of domain knowledge, the extraction success should
even increase, thus only making the employment of
information extraction by crawling inevitable in very
few cases.
6 CONCLUSIONS
In this paper we presented algorithms for locating
and extracting product information from websites
while only being supplied with a product name and
its producer’s name. While the retrieval algorithm
was developed from scratch, the extraction
algorithm extends previous works presented in
Section 2 especially leveraging the special
characteristics of product detail pages. The
evaluation showed the feasibility of the approaches.
Both the retrieval and extraction component
generated better results when being supplied with
domain knowledge used for bootstrapping. Thus,
future research will focus on improving the system’s
learning component to automatically create
extensive domain knowledge at runtime.
Currently, additional algorithms are being
developed for mapping the extracted specification
keys to a central terminology and converting the
corresponding values to standard formats. Thus,
product comparisons would be enabled at runtime.
Evaluations will examine the success of these
algorithms. Another direction of future research
includes the automatic extension of the used product
specification terminology being represented by an
ontology. Thus, the mapping algorithm’s evaluation
results would be improved significantly.
The consolidated integration of this paper’s
algorithms as well as described future extensions in
a federated consumer product information system
would enable users to create an all-embracing view
on products of interest and compare those products
effectively while only requiring a fraction of today’s
effort for gathering product information from the
information provider. In the same manner it may be
integrated in enterprise product information systems
as well as online shopping systems easing and
accelerating the process of implementing product
specifications.
REFERENCES
Arasu, A. and Garcia-Molina, H. (2003). Extracting
Structured Data from Web Pages. In SIGMOD
International Conference on Management of Data.
San Diego, CA, USA 10-12 June 2003. ACM Press:
New York.
Banko, M., Cafarella, M. J. Soderland, S., Broadhead, M.
and Etzioni, O. (2007). Open Information Extraction
from the Web. In IJCAI 20
th
International Joint
Conference on Artificial Intelligence. Hyderabad,
India 9-12 January 2007. Morgan Kaufmann
Publishers Inc.: San Francisco.
Califf, M. E. and Mooney, R. J. (1997). Relational
Learning of Pattern-Match Rules for Information
Extraction. In ACL SIGNLL Meeting of the ACL
Special Interest Group in Natural Language Learning.
Madrid, Spain July 1997. T. M. Ellison: Madrid.
Chang, C.-H. and Lui, S.-C. (2001). IEPAD: Information
Extraction based on Pattern Discovery. In IW3C2 10
th
International Conference on the World Wide Web.
Hong Kong, China 1-5 May 2001. ACM Press: New
York.
Crescenzi, V., Mecca, G. and Merialdo, P. (2001).
Roadrunner: Towards Automatic Data Extraction from
Large Web Sites. In VLDB Endowment 27
th
International Conference on Very Large Data Bases.
Rome, Italy 11-14 September 2001. Morgan
Kaufmann Publishers Inc.: San Francisco.
Freitag, D. (1998). Information Extraction from HTML:
Application of a General Machine Learning Approach.
In AAAI 15
th
National Conference on Artificial
Intelligence. Madison, WI, USA 26-30 July 1998.
AAAI Press: Menlo Park.
Hsu, C.-N. and Dung, M.-T. (1998). Generating Finite-
State Transducers for Semi-Structured Data Extraction
from the Web. Journal of Information Systems, 23(8),
pp.521-538.
Kushmerick, N., Weld, D. S. and Doorenbos, R. (1997).
Wrapper Induction for Information Extraction. In
IJCAI 15
th
International Joint Conference on Artificial
Intelligence. Nagoya, Japan 23-29 August 1997.
Morgan Kaufmann Publishers Inc.: San Francisco.
Laender, A. H. F., Ribeiro-Neto, B. and da Silva, A. S.
(2002). DEByE - Data Extraction by Example. Data
and Knowledge Engineering, 40(2), pp.121–154.
Liu, B. (2007). Web Data Mining: Exploring Hyperlinks,
Contents, and Usage Data. Springer: Heidelberg.
LOCATING AND EXTRACTING PRODUCT SPECIFICATIONS FROM PRODUCER WEBSITES
21