Let us consider the case of a young researcher
looking for primary sources (leaflets, pictures, letters,
etc.) that report or comment on violent actions
performed by the police against students and workers
during the social protest in 1968. A digital “smart
archivist” would provide all documents somehow
referring to such kind of actions, independently from
the words actually used to report them in the primary
sources.
In order to reach such a result, a simple keyword
or tag-based search is not enough. As the long
tradition of studies in Knowledge Representation and
Reasoning in the field of Artificial Intelligence
tells us, in order to be “intelligent”, the system must
“know” the documents and grasp their content.
Therefore, the goal is providing the system with
further machine readable knowledge than that
actually represented by words occurring in the
documents or in their textual metadata. Technically,
this means building a semantic layer over existing
archival metadata, including:
Computational ontologies (Guarino et al., 2009)
representing the semantic “vocabulary” (Goy et
al., 2015);
A knowledge base containing a detailed formal
description of: the events narrated in the
documents, the places where they happened,
people, organizations, and collectives involved in
them, together with the role they played.
In order to guarantee the needed computational
interoperability, the standards of the Semantic Web
must be employed: OWL 2 (Hitzler et al., 2012) for
the computational ontologies and RDF (Hayes and
Patel-Schneider, 2014) and the Linked Data
principles (Heath and Bizer, 2011) for the knowledge
base.
However, providing a system with a so deep and
complex knowledge is a well-known bottleneck for
knowledge-based systems (especially as regards as
the knowledge acquisition step), that can threaten the
sustainability of the approach. One main goal of
PRiSMHA is to provide a solution to solve this
problem.
3 PRiSMHA: FRUITFUL
SYNERGIES
The solution can be found by looking in two
directions:
Crowdsourcing collaborative approaches, if a
digital version of the archival resources is
available (Ashenfelder, 2015) (Beaudoin, 2015)
(MicroPasts, 2018).
Automatic Information Extraction techniques,
when full texts are available (Boschetti et al.,
2014). Note that automatic extraction techniques
from documents other than texts (images, videos,
audio recordings) are currently out of the scope of
the project.
Thus, the specific goal of PRiSMHA is to
verify/demonstrate the feasibility of a solution based
on the integration of these two approaches.
3.1 Building the Ontology
PRiSMHA relies on two modular ontologies: a
top/core ontology called HERO Historical Event
Representation Ontology), and a domain ontology,
called HERO-900. Overall, the OWL2 version of
HERO+HERO-900 counts more than 400 classes and
more than 350 properties; moreover, it is a strongly
axiomatized ontology (more than 4.000 logical
axioms).
We started from the definition of HERO,
representing the semantically rich common
vocabulary. “Common” means shared between:
The system, the users of the crowdsourcing
platform, and final users querying the digital
“smart archivist”;
Computer scientists and ontologists actually
designing and implementing the system, and
historians providing a historical, analytical
perspective on the documents.
This top-level semantic model contains concepts
such as place, time, event, organization, collective
entity, participant, different roles played in events,
etc. Table 1 shows the basic structure of HERO.
HERO is the result of the integration of an
analysis of existing models (Agora, 2018) (CIDOC-
CRM, 2018) (Raimond and Abdallah, 2007) (Doerr
et al., 2010) (van Hage et al., 2009) (Nanni et al.,
2017) (Sprugnoli and Tonelli, 2017) and the
outcomes of the dialog between computer scientists
and historians about the notion of event, its properties
(e.g., participation in events, roles played by
participants) and the relations between events (e.g.,
cause, influence).
Most of the existing models the most famous of
which is probably CIDOC-CRM are mainly
designed for representing production, preservation
and curation activities and has been employed in
several projects for describing documents types,
creators, geographical/temporal anchoring. Although
most of these models support the representation of
KMIS 2019 - 11th International Conference on Knowledge Management and Information Systems
226