languages which permits to manage the
heterogeneity of document structure, content and
associated metadata.
2.1 Modelling
A document repository organizes and structures
information for content retrieval. In this context,
document modelling is one of the key issues. The
modelling is used to determine which information
should be stored in a repository and to reflect the
relationships between the document parts. In order
to be able to handle the various types of data
including text, images, videos and audio, several
models were proposed. These models can be
classified in two categories according to their levels
of completeness in the holding of the multimedia
document description.
The first category gathers works which aim at
modelling each type of media separately. These
approaches do not manage the fitting of several
media in only one document. (Loisant and al., 2002)
propose a metamodel that can be used to describe
any type of media. The goal of this metamodel is to
provide an independent media base to generate
specific models. Each one corresponds to only one
type of media. (Moënne-Loccoz and al., 2004)
provide a model to manage the specificities of video
documents. This model ensures the recognition of
the temporal aspect and the diversity of the video
document descriptors (high and low level).
The models of the second category cover all the
media composing the document. They transcribe
links that connect the various mono-media
components of the same document. (Amous and al.,
2002) extend classic approaches by adding a set of
metadata specific to each type of media in order to
formalize information relating to the document
content. (Darmont and al., 2002) propose an
approach that presents the multimedia documents
within a unified format by using XML language.
This facilitates their structuring in document
databases. Indeed, they propose a conceptual model
that generalizes and presents any type of document
in the form of a complex object. They use some
characteristics (name, keywords, duration, etc.) of
these documents to index them.
All of these works suppose that semi-structured
documents cannot have always a pre-defined
structure and that each document has its own
structure. Nevertheless, we can notice that
documents describing the same type of information
or aiming at the same intention of use (example co,
documentary emission, etc.) have usually similar
structures and/or are annotated with the same set of
metadata. It would be then interesting to be able to
find these similarities and to deduce generic
documents classes and not to remain at a specific
level. The use of these generic classes will facilitate
the exploitation of the bulky documents repositories
contents by focusing research only on the needed
collection (Mbarki and soulé-Dupuy, 2004).
Moreover, the majority of these models do not
provide a clear separation between the structure and
the document contents descriptions. What induces a
lack of clearness, consequently documents handling
becomes harder.
2.2 Documents Exploitation
With a semi-structured and especially multimedia
document, research cannot be based solely on a
predefined tabular schema like in classic databases.
Otherwise it would not permit the exploitation of the
different metadata used to annotate document
content. This need gave birth to a new generation of
querying languages.
The contribution of LOREL (Lightweight Object
Repository Language) (Goldman and al., 1999)
resides in the flexibility of semi-structured
documents interrogations, even though we do not
know their structure. The easiness is offered by the
introduction of path expressions.
XQuery (a Query Language for XML) (W3C, 2003)
permits to interrogate a XML document according to
different criteria. This language is often called the
SQL of the XML. It permits to elaborate complex
queries. It is a hybrid language between XPath and
SQL. It is capable also to browse the arborescence
of a XML document, to carry up the information
required by the user and to create a new document
containing only the needed granules.
The power of these interrogation languages
resides in the manipulation of document structure.
The documents used by these languages are
generally constituted by short and precise elements
(title, authors, etc.). To interrogate and to manipulate
documents that contain elements having important
size (section, paragraph, etc.), a combination of such
languages with the information retrieval techniques
is necessary. To overcome this shortcoming, we
propose a specific language permitting to interrogate
documents according to their structures and their
contents. The noticeable difference between our
proposition and the previous languages concerns the
complexity of queries. Indeed, in our approach the
user can see the document organization (tree) while
interrogating. The graphical language that we
propose provides on the one hand a best
management and comprehension of the document
composition and on the other hand an easy querying
because the user does not need to have a previous
knowledge about interrogation languages.
A DOCUMENT REPOSITORY ARCHITECTURE FOR HETEROGENEOUS BUSINESS INFORMATION
MANAGEMENT
193