of the classification is the identification of mathemat-
ical and textual formulas from the detected definition;
in this way the class Definition can be split in two:
Mathematical formula; Textual formula.
The mathematical formula represents a manner in
which the detected description can be easily translated
into a mathematical formula, namely some strategic
words are recognised, such as adding, divided by, rate,
ratio, etc.
The textual formula instead describes a definition
that outlines a textual description for the metric neces-
sary to the computation of the value of a parameter
included in the SLO, which cannot be represented with
a mathematical formula. It is necessary to mention that
the difference between mathematical and textual for-
mulas is very narrow and a classification algorithm
based on training is not enough to distinguish one
or another. This reason motivates splitting the class
Definition into two subclasses, for which another in-
formation extraction technique is expected.
4 SLAs TEXT RECOGNITION
In order to design and develop an automatic SLA clas-
sifier, some Natural Language Processing (NLP) tech-
niques can be utilised. Such techniques have the aim
to elaborate the document containing an SLA and to
obtain the information about the SLOs necessary to
feed a possible dedicated service monitoring tool.
The principal open-source Information Extraction
(IE) (Grishman, 1997) tools are: Apache OpenNLP,
OpenCalais, DBpedia and GATE. The principal tasks
of OpenNLP are tokenisation, sentence segmenta-
tion, part-of-speech tagging, named entity extraction,
chunking, parsing and co-reference resolution. All
these tasks are present also in OpenCalais and GATE
with a more usable graphical interface. OpenCalais
can annotate documents with rich semantic meta-data,
including entities, events and facts. However, the out-
put of the tool is text enriched by annotations, which
are not user-customisable.
We chose to use GATE for the implementation of
the automatic classifier for SLAs. GATE performs
all the tasks described for the other tools and has the
advantage of being customisable; indeed, it allows to
create customised types of annotations using a Java An-
notation Patterns Engine (JAPE) transducer personal-
isation; also the annotations that the tool produces can
be customised. The most used component from GATE
for this automation is the Information Extraction (IE)
one. The input to the system is a dataset of SLA docu-
ments in many common use formats, such as Microsoft
Word and Adobe Portable Document Format (PDF).
The output of the classifier is composed of the same
documents but with annotations. The annotations are
added to some sentences of the documents and they
correspond to the identified classes described before
(Definition, Value and Not definition).
The implementation of the classifier works follow-
ing two sequential steps:
1.
Classification of the sentences of the document
according to the three classes described before;
2.
Identification of the mathematical and the textual
formulas included in the Definition class from the
previous step.
4.1 Step 1: Sentences Classification
The ANNIE (A Nearly-New Information Extraction
system) plug-in is the principal and most used compon-
ent of the software GATE. It takes an input document
that is annotated as output of the process. Step 1 is
composed of a pipe of activities, where each depends
on the output of the previous one and gives the input
for the subsequent one. Almost each activity of this
pipe is responsible for implementing and executing a
specific information extraction technique. The whole
pipe of phases composing Step 1 is described in the
following:
1.
‘Document Reset’: this phase is responsible for
deleting all the annotations already included in the
document; it cleans the file so that it is ready for
the whole annotation process;
2.
‘Tokenisation’: the tokeniser component in GATE
implements a word segmentation technique. It
divides the document in tokens, such as numbers,
punctuations, etc. The tokeniser implementation
in GATE uses regular expressions to give an initial
annotation of tokens. Subsequently, for each token
the following features are recognised: the ‘string’
itself; the ‘kind’, which is the set the token belongs
to, such as word, number, symbol or punctuation;
the ‘orth’, meaning the orthographical structure;
3.
‘Gazetteer’: this component is responsible for per-
forming a Named Entity Recognition (NER) tech-
nique. It identifies the names of the entities based
on some lists. Such lists are simple text files with
an entry for each line. Each list represents a set
of names depending on a domain, such as cities,
week days, organisations, etc. An indexation is
present to give access to such lists. As default be-
haviour, the Gazetteer creates a special annotation
named ‘Lookup’ for each entry found in the text.
GATE gives the possibility to create a gazetteer
with personalised lists;
CLOSER 2016 - 6th International Conference on Cloud Computing and Services Science
62