the experiments are mainly based on symbolic pat-
tern rules, due to the good performance empirically
obtained with this methodology over this type of doc-
uments.
To evaluate our current prototype, we used a sam-
ple of 144 Spanish notary acts. The extracted data
from this type of document can be grouped in two
different sets:
• Document Parameters (Doc-Param in Figure 3),
which are the data that are related to the docu-
ment itself. These parameters are the title of the
document, a protocol number given to the docu-
ment, the date and the location when and where
the document was signed, and the notary’s name.
• Person Parameters (Persons in Figure 3), which
corresponds to the data of the persons that are
mentioned in the document. These parameters are
name, surname, national identity number, marital
state, address, region, and country.
We assessed the performance of our approach us-
ing the well-known measures in the field of the Infor-
mation Extraction (precision, recall, and F-measure).
To calculate them, we took into account for each doc-
ument: the number of data to be extracted, the num-
ber of extracted data, and the number of data that have
been retrieved properly. The baseline was the appli-
cation of a set of extraction rules based on regular
expressions without using our ontology-based extrac-
tion approach and without using our document clean-
ing methods. The baseline results was not very good
because these documents were quite long (near one
hundred pages sometimes), and they were full of data
and names, so, the baseline method frequently re-
trieved too many results, most of them erroneous.
We performed two experiments (see results in Fig-
ure 3). The first one was measuring the effect of the
introduction of the use of the ontologies (Steps 2 and
3) to guide the extraction, leading to a substantial im-
provement of the process (Exp. 1). In the second
experiment (Exp. 2), we introduced the use of our
ad-hoc combination of the spell checkers mentioned
in section 3.1, and a text cleaner to eliminate iden-
tifiers, page numbers, stamps, etc. With these new
enhancements, our system achieved better results (a
F-measure above 80%). Of course, the rules to ex-
tract the data used were the same in all experiments
to isolate influences derived from its quality.
We have considered the set of Document Parame-
ters by averaging the results obtained for each of the
data (title, protocol number, etc.). Within this group
we get an average of 85% of precision and 78% of
recall. On the other hand, we have considered the set
of Persons by averaging the results obtained for each
of their attributes (name, address, etc.), and we have
obtained 93% of precision and 72% of recall. In Fig-
ure 3, Global is the mean of both results.
In the analyzed dataset, our proposed system ob-
tained an average of 89% of precision, and 75% of
recall in our data extraction system, which shows the
interest of this proposal. We have also analyzed the
erroneous results, and the system could be enhanced
by improving extraction algorithms, and by adjust-
ing the preliminary cleaning and correction processes.
We should continue exploring more types of docu-
ments, with more relationships and distinct entities;
but it seems clear that having knowledge to guide the
extraction has a very positive influence on this task
on such legal documents, and it facilitates the mainte-
nance labors and the amplifications of the system.
5 STATE OF THE ART
The use of ontologies in the field of Information Ex-
traction (Russell and Norvig, 1995) has increased in
the last years. An ontology is defined as a formal
and explicit specification of a shared conceptualiza-
tion (Gruber, 1993). Thanks to their expressiveness,
they are successfully used to model human knowl-
edge and to implement intelligent systems. Sys-
tems that are based on the use of ontologies for in-
formation extraction are called OBIE systems (On-
tology Based Information Extraction) (Wimalasuriya
and Dou, 2010). The use of an ontological model as a
guideline for the extraction of information from texts
has been successfully applied in other works as (Gar-
rido et al., 2012; Kara et al., 2012; Garrido et al.,
2013; Borobia et al., 2014).
Regarding legal issues, it is interesting to high-
light works such as History Assistant (Jackson et al.,
2003), which extracts rulings from court opinions and
retrieves relevant prior cases from a citator database.
It does all of this by combining natural language
processing techniques with statistical methods. An-
other system to classify fragments of normative texts
into provision types and to extract their arguments
was proposed by (Biagioli et al., 2005). That sys-
tem was based on multiclass Support Vector Ma-
chine classification techniques and on Natural Lan-
guage Processing techniques. More recently we find
TRUTHS (Cheng et al., 2009), a system developed
with a modified the classical Hobbs generic informa-
tion extraction architecture (Appelt et al., 1993) to ex-
tract information from criminal case documents and
to fill up a template.
The definition of extraction ontologies in the con-
text of the Semantic Web was made in (Embley and
The AIS Project: Boosting Information Extraction from Legal Documents by using Ontologies
443