the term “storm” with a relevance index greater than
50%. When evaluating an aggregation, the degree of
contribution of the facts described in a news item
must be proportional to the relevance of this news
item for the studied topic.
SELECT Avg(Paragraph/Rainfall)
FROM //Local_Section//News_Item
WHERE Title contains ‘storm’ > 0.5
GROUP BY CUBE (Paragraph/Location)
3.5 IR terms as a dimension
Finally, next query shows how the terms specified at
IR expressions can be used as an additional analysis
dimension. In this context, thesaurus and ontologies
would allow defining classification hierarchies over
this new dimension. In this way, the query below
provides a new dimension to study the aggregation
of those news that contain the term “tidal wave”, or
the aggregation for the term “flood”. Note that these
two aggregations can be joined at higher levels of
the term hierarchy (e.g. “natural disaster”).
SELECT Avg(Paragraph/Rainfall)
FROM //Local_Section//News_Item
WHERE Title contains
‘tidal wave| flood’ > 0.5
GROUP BY CUBE
(‘tidal wave| flood’,
Paragraph/Location)
4 CONCLUSIONS
In this paper we have explained how the data
warehouses and digital libraries communities, each
from its particular point of view (structured vs.
unstructured information, respectively), can
mutually take advantage from semi-structured
information. Recently, some proposals start
extending the traditional data warehouse technology
towards semi-structured information. However, to
date none of these approaches entirely exploit all the
properties of these documents. Along this work we
have identified a set of requirements for this novel
technology, the semi-structured information
warehouses technology.
In our opinion, the development of such systems
must be based on an underlying document model
able to exploit the nature of this kind of information.
We are currently working on a semi-structured
model which combines IR and evaluation of
structural conditions techniques to query an XML
documents collection, and where the facts described
at the selected documents are ranked by relevance.
For the future, we plan to design a semi-
structured warehouse model built over this
document model. In order to involve the facts
relevance in the warehouse model, the semantics of
the aggregation operations will have to be revised. In
this context, we find interesting some works which
study the management of imprecise information
(Pedersen et al, 1999) (Rundensteiner et al, 1992).
REFERENCES
Kimball, R., 2002. The Data Warehouse toolkit. John
Wiley & Sons.
Codd, E. F.; Codd, S. B. and Salley, C.T., 1993. Providing
OLAP to user-analysts: An IT mandate. Technical
Report, E.F. Codd & Associates.
Baeza-Yates, R. and Ribeiro-Neto, B., 1999. Modern
Information Retrieval. Addison-Wesley.
Navarro, G. and Baeza-Yates, R., 1997. Proximal Nodes:
A Model to Query Document Databases by Contents
and Structure. ACM Trans. on Information Systems.
Xyleme, L., 2001. A dynamic warehouse for XML data of
the Web. IEEE Data Engineering Bulletin 24(2).
Aramburu, M. J. and Berlanga, R., 2001. A Temporal
Object-Oriented Model for Digital Libraries of
Documents. Concurrency: Practice and Experience 13
(11), John Wiley.
Binh, N. T.; Tjoa, A. M. and Mangisengi, O., 2001. Meta
Cube-X: An XML Metadata Foundation for
Interoperability Search among Web Warehouses.
Proc. Intl. Workshop on Design and Management of
Data Warehouses.
Mangisengi, O.; Huber, J.; Hawel, C. and Essmayr, W.,
2001. A Framework for Supporting Interoperability of
Data Warehouse Islands Using XML. Proc. of the 3rd
Intl. Conference on Data Warehousing and
Knowledge Discovery. LNCS 2114.
Ishikawa, H. et al, 1999. Document Warehousing Based
on a Multimedia Database System. Proc. IEEE 15th
Intl. Conference on Data Engineering, pp. 168-173.
Pedersen, D.; Riis, K. and Pedersen, T. B., 2002. XML-
Extended OLAP Querying. Technical Report,
Department of Conputer Science, Aalborg University.
Pedersen, T. B.; Jensen, C. S. and Dyreson, C. E., 1999.
Supporting Imprecision in Multidimensional
Databases Using Granularities. Proc. of the Eleventh
International Conference on Scientific and Statistical
Database Management, pp. 90–101.
Rundensteiner, E. and Bic., L., 1992 Evaluating
Aggregates in Possibilistic Relational Databases.
DKE, 7(3):239–267.
ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION
582