
Messages Extraction. This stage consists of three sub-stages. At the first one we try to
identify the message types that exist in the input documents, while at the second we try
to fill in the messages’ arguments with the instances of the ontology concepts identified
in the previous stage. The third sub-stage includes the identification of the temporal
expressions that might exist in the text, and the normalization of the messages’ referring
time, in relation to the document’s publication time.
In most of the cases we had a one-to-one mapping between message types and
sentences. We used once again an ML approach, using lexical and semantic features
for the creation of the vectors. As lexical features we used a fixed number of verbs and
nouns occurring in the sentences. Concerning the semantic features, we used two kinds
of information. The first one was a numerical value representing the number of the
top-level ontology concepts that were found in the sentences. Thus the created vectors
had eight numerical slots, each one representing one of the top-level ontology concepts.
Concerning the second semantic feature, we used what we have called trigger words,
which are several lists of words, each one “triggering” a particular message type. Thus,
we allocated six slots—the maximum number of trigger words found in a sentence—
each one of which represented the message type that was triggered, if any. In order to
perform our experiments, we used the WEKA platform. The algorithms that we used
were again the Naïve Bayes, LogitBoost and SMO, varying their parameters during
the series of experiments that we performed. The best results were achieved with the
LogitBoost algorithm, using 400 boost cycles. More specifically the number of correctly
classified message types were 78.22%, after performing a ten-fold cross-validation on
the input vectors.
The second sub-stage is the filling in of the messages’ arguments. In order to per-
form this stage we employed several domain-specific heuristics which take into account
the results from the previous stages. It is important to note here that although we have
a one-to-one mapping from sentences to message types, it does not necessarily mean
that the arguments (i.e. the extracted instances of ontology concepts) of the messages
will also be in the same sentence. There may be cases where the arguments are found in
neighboring sentences. For that reason, our heuristics use a window of two sentences,
before and after the one under consideration, in which to search for the arguments of
the messages, if they are not found in the original one. The total evaluation results from
the combination of the two sub-stages of the messages extraction stage are shown in
table 2. As in the previous cases, we also used a tenfold cross-validation process for the
evaluation of the ML algorithms.
The last of the three sub-stages, in the messages extraction stage, is the identification
of the temporal expressions found in the sentences which contain the messages and
alter their referring time, as well as the normalization of those temporal expressions
in relation to the publication time of the document which contains the messages. For
this sub-stage we adopted a module which was developed earlier. As was mentioned
earlier in this paper, the normalized temporal expressions alter the referring time of the
messages, an information which we use during the extraction of the Synchronic and
Diachronic Relations (SDRs).
10