Authors:
Jun Wang
1
and
Kanji Uchino
2
Affiliations:
1
Fujitsu R&D Center Co., Ltd., China
;
2
Fujitsu Laboratories, Ltd., Japan
Keyword(s):
RSS, Metadata, Information Extraction, Knowledge Management
Related
Ontology
Subjects/Areas/Topics:
Biomedical Engineering
;
Data Engineering
;
Enterprise Information Systems
;
Health Information Systems
;
Information Systems Analysis and Specification
;
Internet Technology
;
Knowledge Management
;
Metadata and Metamodeling
;
Ontologies and the Semantic Web
;
Society, e-Business and e-Government
;
Web Information Systems and Technologies
;
Web Interfaces and Applications
;
Web Personalization
;
XML and Data Management
Abstract:
Although RSS demonstrates a promising solution to track and personalize the flow of new Web information, many of the current Web sites are not yet enabled with RSS feeds. The availability of convenient approaches to “RSSify” existing suitable Web contents has become a stringent necessity. This paper presents EHTML2RSS, an efficient system that translates semi-structured HTML pages to structured RSS feeds, which proposes different approaches based on various features of HTML pages. For the information items with release time, the system provides an automatic approach based on time pattern discovery. Another automatic approach based on repeated tag pattern discovery is applied to convert the regular pages without the time pattern. A semi-automatic approach based on labelling is available to process the irregular pages or specific sections in Web pages according to the user’s requirements. Experimental results show that our system is efficient and effective in facilitating the RSS feed
generation.
(More)