traction and screen scraping has been outlined, while
the main approaches to it have been summarized. The
lack of an integrated framework for scraping data
from the web has been identified as a problem, and
this paper presents a framework that tries to fill this
gap.
An RDF model for web scraping has been de-
fined at the semantic scraping level. The objective of
this RDF model should not be confused with that of
RDFa. RDFa defines a format for marking up HTML
elements to extract an RDF graph. Our model com-
plements RDFa by allowing RDF graphs to refer to
data that is present in HTML fragments in an unanno-
tated HTML document. This enables an open frame-
work for web scraping. The tasks of building an RDF
graph out of a web document has been shown. With
this, a semantic screen scraper has been developed.
The semantic screen scraper produces RDF graphs
out of web documents and RDF-defined extractors,
that offer interoperable scraping information.
Future works involve experimenting with the au-
tomatic construction of mappings out of incomplete
ones or unstructured HTML resources. While some
of the approaches to information extraction deal with
wrapper induction or vision-based approaches, the
modelling of web page fragments as web resources
changes the paradigm behind this task. Approaches
such as machine learning or graph analysis can be
combined and applied to this different scenario.
ACKNOWLEDGEMENTS
This research project is funded by the European Com-
mission under the R&D project OMELETTE (FP7-
ICT-2009-5), and by the Spanish Government un-
der the R&D projects Contenidos a la Carta (TSI-
020501-2008-114) and T2C2 (TIN2008-06739-C04-
01).
REFERENCES
Berners-Lee, T., Hendler, J., Lassila, O., et al. (2001). The
semantic web. Scientific american, 284(5):28–37.
Bizer, C., Heath, T., and Berners-Lee, T. (2009). Linked
data-the story so far. sbc, 14(w3c):9.
Bolin, M., Webber, M., Rha, P., Wilson, T., and Miller, R. C.
(2005). Automation and customization of rendered
web pages. Symposium on User Interface Software
and Technology, page 163.
Breslin, J., Decker, S., Harth, A., and Bojars, U. (2006).
SIOC: an approach to connect web-based communi-
ties. International Journal of Web Based Communi-
ties, 2(2):133–142.
Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2003). Extract-
ing content structure for web pages based on visual
representation. In Proc.5 th Asia Pacific Web Confer-
ence, pages 406–417.
Chang, C., Kayed, M., Girgis, M., and Shaalan, K. (2006).
A survey of web information extraction systems.
IEEE Transactions on Knowledge and Data Engineer-
ing, pages 1411–1428.
Fielding, R. T. (2000). Architectural Styles and the Design
of Network-based Software Architectures. PhD thesis,
University of California.
Haza
¨
el-Massieux, D. and Connolly, D. (2004). Glean-
ing resource descriptions from dialects of languages
(grddl). World Wide Web Consortium, W3C Coordi-
nation Group Note NOTE-grddl-20040413.
Hogue, A. (2005). Thresher: Automating the unwrapping
of semantic content from the world wide web. In Pro-
ceedings of the Fourteenth International World Wide
Web Conference, pages 86–95. ACM Press.
Huynh, D., Mazzocchi, S., and Karger, D. (2007). Piggy
bank: Experience the semantic web inside your web
browser. Web Semantics: Science, Services and
Agents on the World Wide Web, 5(1):16–27.
Kosala, R. and Blockeel, H. (2000). Web mining research:
A survey. ACM SIGKDD Explorations Newsletter,
2(1):1–15.
Kushmerick, N. (1997). Wrapper induction for information
extraction.
Kushmerick, N. (2000). Wrapper induction: Efficiency and
expressiveness. Artificial Intelligence, 118:2000.
Pan, A., Raposo, J.,
´
Alvarez, M., Montoto, P., Orjales,
V., Hidalgo, J., Ardao, L., Molano, A., and Vi
˜
na, A.
(2002). The denodo data integration platform. Very
Large Data Bases, page 986.
Toomim, M., Drucker, S. M., Dontcheva, M., Rahimi, A.,
Thomson, B., and Landay, J. A. (2009). Attaching
UI enhancements to websites with end users. Confer-
ence on Human Factors in Computing Systems, pages
1859–1868.
Wei, L., Meng, X., and Meng, W. (2006). Vision-based web
data records extraction. In WebDB.
Wong, J. and Hong, J. I. (2007). Making mashups with
marmite:towards end-user programming for the web.
Conference on Human Factors in Computing Systems,
page 1435.
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
456