loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Lorenzo Blanco ; Valter Crescenzi and Paolo Merialdo

Affiliation: Universitá Roma Tre, Italy

Keyword(s): Information Extraction, Wrapper Induction, Web mining.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Collaboration and e-Services ; Databases in Web Applications ; e-Business ; Enterprise Information Systems ; Internet Technology ; Knowledge Discovery and Information Retrieval ; Knowledge Engineering and Ontology Development ; Knowledge-Based Systems ; Semantic Web ; Soft Computing ; Symbolic Systems ; Web Information Systems and Technologies ; Web Mining

Abstract: Many large web sites contain highly valuable information. Their pages are dynamically generated by scripts which retrieve data from a back-end database and embed them into HTML templates. Based on this observation several techniques have been developed to automatically extract data from a set of structurally homogeneous pages. These tools represent a step towards the automatic extraction of data from large web sites, but currently their input sample pages have to be manually collected. To scale the data extraction process this task should be automated, as well. We present techniques to automatically gathering structurally similar pages from large web sites. We have developed an algorithm that takes as input one sample page, and crawls the site to find pages similar in structure to the given page. The collected pages can feed an automatic wrapper generator to extract data. Experiments conducted over real life web sites gave us encouraging results.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.133.131.168

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Blanco, L.; Crescenzi, V. and Merialdo, P. (2005). EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP. In Proceedings of the First International Conference on Web Information Systems and Technologies - WEBIST; ISBN 972-8865-20-1; ISSN 2184-3252, SciTePress, pages 247-254. DOI: 10.5220/0001234202470254

@conference{webist05,
author={Lorenzo Blanco. and Valter Crescenzi. and Paolo Merialdo.},
title={EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP},
booktitle={Proceedings of the First International Conference on Web Information Systems and Technologies - WEBIST},
year={2005},
pages={247-254},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001234202470254},
isbn={972-8865-20-1},
issn={2184-3252},
}

TY - CONF

JO - Proceedings of the First International Conference on Web Information Systems and Technologies - WEBIST
TI - EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP
SN - 972-8865-20-1
IS - 2184-3252
AU - Blanco, L.
AU - Crescenzi, V.
AU - Merialdo, P.
PY - 2005
SP - 247
EP - 254
DO - 10.5220/0001234202470254
PB - SciTePress