solution for (i) semantically annotating the content
of documents and (ii) populating the ontology with
the new instances found in the documents
(Amardeilh, 2007). The solution uses domain-
specific knowledge acquisition rules which link the
results obtained from the information extraction
tools to the ontology elements, thus creating a more
formal representation (RDF or OWL) of the
document content (Amardeilh, 2006). The OntoPop
methodology has certain limitations regarding the
resolving of synonyms on one hand and the
resolving of multiple instances with the same lexical
representation on the other hand. In this paper, we
address the identified limitations by extending the
OntoPop methodology with new processing steps
before populating the ontology.
SOBA is a system designed to create a soccer
specific knowledge base from heterogeneous sources
(Buitelaar et al., 2006). The system performs (i)
automatic document retrieval from the Web, (ii)
linguistic annotation and information extraction
using the Heart-of-Gold approach (Schäfer, 2007)
and (iii) mapping of the annotated document parts
on ontology elements (Buitelaar et al., 2006). Our
approach performs information extraction from
unstructured text and document annotation for a
specific domain and uses reasoning on the ontology
to infer properties for the newly added instances.
Ontea performs semi-automatic annotation using
regular expressions combined with lemmatization
and indexing mechanisms (Laclavik et al., 2007).
The methodology was implemented and tested on
English and Slovak content. Our system was
designed to process multilingual documents,
including Latin languages, and so far it has provided
good results for a corpus of Romanian documents,
by using resources specific to the Romanian
language.
3 ARCHIVAL DOMAIN MODEL
This paper proposes a generic representation of the
archival domain as illustrated in Figure 1. The
archival domain is modelled starting from the raw
medieval documents provided by the Cluj County
National Archives (CCNA, 2008). These documents
are hand written and contain many embellishments,
making them hard to be automatically processed.
Due to this difficulty, in our case studies we have
used document summaries generated by the
archivists (see Figure 2).
Within our model, the central element is the
document. Documents belong to a specific domain
such as the historical domain or the medical domain.
In our research we have used the historical archival
domain, formally represented as domain knowledge
by means of domain ontology (concepts and
relations) and rules. Documents can be obtained
from several data sources like external databases,
Web sites or digitized manuscripts.
Figure 1: The archival domain model.
The document content (see Figure 2) is expressed
in natural language in an unstructured manner. In
our case study, the document content actually
represents a summary of the associated original
document. Several documents may be related to one
another by referring information about the same
topics even if they are not containing the same
lexical representations (e.g. names, events, etc.). The
document also features a set of technical data, such
as the date of issue, archival fund or catalogue
number. In the case of the document shown in
Figure 2, the technical data specifies the document
number (“235”), the language in which the raw
document was written (“Latin”) and the edition in
which the original document has appeared
(“Zimmermaan-Werner 1892 –I, nr.169”).
When searching in the archival documents it is
important to identify all documents that are related
to a specified topic. To enable information retrieval
from all relevant documents, the domain knowledge
is used to add a semantic mark-up level to the
documents content.
Figure 2: Example of a document which contains technical
data and the summary of the original archival document.
WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies
152