5.2 Results and Discussion
In this Section we show the results given by our sys-
tem. As shown in Table 3 the results vary, depending
on the collection used to evaluate. It is worth em-
phasizing that we did scope identification with auto-
matic cue recognition, so the input of our system is the
sentence without any extra information. Therefore, it
means that we do not use neither golden cues (which
means that the system did not need to find where was
the cue and which one was it) nor golden trees (which
means that the tree is given and it is certainly correct).
Table 3: Results of our Speculation Scope Finding System,
evaluated with the three Collections of Bioscope.
Collection Precision Recall F1 PCS PCHC
Papers 82.78% 73.88% 78.08% 39.43% 80.38%
Abstracts 87.96% 75.35% 81.14% 46.75% 79.50%
Clinical 83.96% 67.15% 74.62% 36.20% 67.19%
Average 86.70% 74.62% 79.54% 43.96% 77.26%
As can be observed in Table 3, one of the main
problems is to correctly detect the hedge cue. One
of the main reasons is that there are some hedge cues
that are not always considered as cues. For instance,
the wordform can is not commonly used as a hedge
cue in the papers collection (just 31.57%) but it is
more frequent as a hedge cue in the abstracts collec-
tion. Therefore we found that hedge cue classification
is a really difficult task, as some cues are not always
used as hedge cues. In systems like ours we need to
decide which cues are included and which ones not,
so mistakes in this decision may result in a loss of
accuracy because the scopes of the speculative sen-
tences that contain these non common hedge cues are
not correctly annotated.
5.3 Comparison with the State of the
Art Systems
In this Section we compared the results of our system
with the best state of the art systems (Morante and
Daelemans, 2009) and (Zhu et al., 2010). The main
comparison is shown in Table 4.
Our system does not need any training, so we did
our test with all the corpus. Morante et al. per-
formed 10-fold cross validation experiments with the
abstracts collection. For the other two collections,
they trained with the abstracts set and they tested with
the corresponding collection. We can also show that
the results obtained by them for the abstracts col-
lection are very high if we compare our results for
the other collections. This is probably because they
trained their system with the abstracts collection. The
Table 4: Results of our work, evaluated with the three col-
lections of Bioscope and compared with the systems of
Morante et al., Zhu et al.
Collection System Precision Recall F1 PCS PCHC
Papers
Our Results 82.78% 73.88% 78.08% 39.43% 80.38%
Morante et al. 67.97% 53.16% 59.66% 35.92% 92.15%
Zhu et al. 56.27% 58.20% 57.22% – –
Abstracts
Our Results 87.96% 75.35% 81.14% 46.75% 79.50%
Morante et al. 85.77% 72.44% 78.54% 65.55% 96.03%
Zhu et al. 81.58% 73.34% 77.24% – –
Clinical
Our Results 83.96% 67.15% 74.62% 36.20% 67.19%
Morante et al. 68.21% 26.49% 38.16% 26.21% 64.44%
Zhu et al. 70.46% 25.59% 37.55% – –
Average
Our Results 86.70% 74.62% 79.54% 43.96% 72.26%
Morante et al. 83.37% 61.51% 68.71% 54.68% 89.58%
Zhu et al. 76.55% 62.54% 67.41% – –
ways of annotating hedge scopes in the abstracts col-
lection and the clinical reports collection are really
different, which leads to a loss of accuracy in these
cases. The Scientific Papers collection is more simi-
lar, but there are some infrequent cues in the Abstracts
collection that appear in the Scientific Papers collec-
tion, like would.
As we can observe in the results for the Clinical
Reports Collection, the differences are greater than in
the other cases for the recall measure. The results of
Morante et al. system mean that considering their pre-
cision results, their system correctly classifies most
of the tokens. But regarding recall, their system de-
tects very few tokens. For a system that includes all
the correct tokens except one for each scope, the pre-
cision and recall measures would be very high, but
the PCS measure would be zero. This means that our
system leaves out some of the tokens for each scope
out, but most of the tokens are correctly included. We
can conclude that their results are completely derived
from the fact that they train the models using the ab-
stracts collection. As a result, this factor deeply af-
fects the recall in the Clinical Reports collection be-
cause it contains somewhat different hedge cues and,
more important, uses them in a different way. Nev-
ertheless, for us, the problem is not as deep as their
case, because we used a configurable lexicon of word-
forms which is the same for the three collections and
includes all the wordforms that appear in the whole
corpus.
In Table 5 we show the percentage of correct
scopes (PCS) per speculation cue, for hedge cues that
occur 20 or more times in one of the subcorpora com-
pared with Morante et al.
The differences in the PCS measure (percentage
of correct scopes) show that their system correctly an-
notates more scopes than ours, but our results in Pre-
cision and Recall show that we classified more cor-
rect tokens within the scope of speculation. Morante
et al. used machine–learning to predict the correct
hedge cue, while we only have a lexicon.
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
260