Authors:
Rachid Aknouche
;
Ounas Asfari
;
Fadila Bentayeb
and
Omar Boussaid
Affiliation:
ERIC Laboratory, France
Keyword(s):
Extract-Transform-Load,Textual Data,Text Warehousing,Text Warehouse Model, TWM, data integration, decisional architecture, information retrieval, 20 Newsgroups
Abstract:
In this paper, we propose an original approach for text warehousing process. It is based on a decisional architecture which combines classical data warehousing tasks and information retrieval (IR) techniques. We first propose a new ETL process, named ETL-Text, for textual data integration and then, we present a new Text Warehouse Model, denoted TWM, which takes into account both the structure and the semantics of the textual data. TWM is associated with new dimensions types including: a metadata dimension and a semantic dimension. In addition, we propose a new analysis measure based on the modeling language widely used in IR area. Moreover, our approach is based on Wikipedia as external knowledge source to extract the semantics of the textual documents. To validate our approach, we develop a prototype composed of several processing modules that illustrate the different steps of the ETL-Text. Also, we use the 20 Newsgroups corpus to perform our experimentations.