loading
Papers

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Daniel Nikovski 1 ; Alan Esenther 1 and Akihiro Baba 2

Affiliations: 1 Mitsubishi Electric Research Laboratories, United States ; 2 Mitsubishi Electric Corporation, Japan

ISBN: 978-989-8111-84-5

Keyword(s): Service oriented architectures, System integration, Information extraction, Web mining.

Related Ontology Subjects/Areas/Topics: Coupling and Integrating Heterogeneous Data Sources ; Databases and Information Systems Integration ; Enterprise Information Systems ; Legacy Systems ; Web Databases

Abstract: We propose two methods for constructing automated programs for extraction of information from a class of web pages that are very common and of high practical significance — variable-length lists of records with identical structure. Whereas most existing methods would require multiple example instances of the target web page in order to be able to construct extraction rules, our algorithms require only a single example instance. The first method analyzes the document object model (DOM) tree of the web page to identify repeatable structure that includes all of the specified data fields of interest. The second method provides an interactive way of discovering the list node of the DOMtree by visualizing the correspondence between portions of XPath expressions and visual elements in the web page. Both methods construct extraction rules in the form of XPath expressions, facilitating ease of deployment and integration with other information systems.

PDF ImageFull Text

Download
CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.214.224.224

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Nikovski D.; Esenther A.; Baba A. and (2009). SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS.In Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 3: ICEIS, ISBN 978-989-8111-84-5, pages 261-266. DOI: 10.5220/0001858402610266

@conference{iceis09,
author={Daniel Nikovski and Alan Esenther and Akihiro Baba},
title={SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS},
booktitle={Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 3: ICEIS,},
year={2009},
pages={261-266},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001858402610266},
isbn={978-989-8111-84-5},
}

TY - CONF

JO - Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 3: ICEIS,
TI - SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS
SN - 978-989-8111-84-5
AU - Nikovski, D.
AU - Esenther, A.
AU - Baba, A.
PY - 2009
SP - 261
EP - 266
DO - 10.5220/0001858402610266

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.