where
mρ =
∑
n
i=1
T P
i
∑
n
i=1
(T P
i
+ FP
i
)
, mr =
∑
n
i=1
T P
i
∑
n
i=1
(T P
i
+ FN
i
)
(2)
The previous process will be carried out once we
have selected the threshold t
i
for each classifier, and
gives us an indirect idea of the quality of the threshold
selection method being considered by evaluating the
system performance obtained after using this method.
But previously we have to select the more convenient
threshold for each classifier. The possible relevance
thresholds t
i
considered to look for the best one will
range from 0.1 to 0.9 with a step size of 0.1, indepen-
dently on the approach used to estimate them.
For the experiments where we use the own train-
ing set to estimate the best thresholds, we simply
carry out the same previous process but using the
documents in the training set instead of those in the
test set to compute the measures F
i
, for each possible
threshold, selecting the one offering the best F value.
For the experiments where we use a validation set,
we first randomly divide each training set to extract a
new training subset (80% of the training instances)
and a validation set (20% of the training instances).
Then we build another set of classifiers from these
training subsets and use them to evaluate the instances
in the validation sets, once again computing, for each
possible threshold, the measures F
i
, and selecting the
threshold which gets the best F value.
In addition to using both the imbalanced and the
balanced versions of the classifiers separately, we
have also tried a combined method: for each MP
i
we
evaluate (using either the validation set or the training
set) both classifiers, obtain the best threshold for each
one, t
n
i
and t
b
i
, and select the classifier that gets the
best results.
4.1 Results
The results of our experiments are summarized in Ta-
ble 1. In addition to the experiments using the vali-
dation sets and the own training sets, we also display
results of the baseline approach which fixes the rele-
vance thresholds at 0.5 for all the classifiers. We re-
port results from both the balanced, the non balanced
and the combined versions of the classifiers.
Regarding the baseline approach, we can observe
that the results in terms of micro-F are quite simi-
lar for both the balanced and imbalaced approaches
(with a slight advantage for the second one), but the
balanced approach is clearly better in the case of us-
ing the macro-F measure. This seems to indicate that
balancing the training data particularly improves the
results of those MPs having less interventions (which
are precisely those having more imbalanced training
data). These MPs have the same importance than
other MPs having more interventions from the point
of view of computing MF, although they are less im-
portant when computing mF.
The results obtained when using a validation set
are discouraging, as we systematically get worse re-
sults than the baseline (between 3% and 8% of wors-
ening). Therefore, although the use of a separate val-
idation set is the standard practice to estimate the pa-
rameters of classifiers, in our case study this approach
does not work properly. We believe that the reason
may be the (low) number of documents in the valida-
tion sets associated to many MPs (only around 16%
of the interventions of an MP will appear in her vali-
dation set
5
), which is not enough to capture the char-
acteristics of these MPs.
In order to overcome the problem of the low num-
ber of documents in the validation sets, we tried the
same procedure with the whole training data. What
we are expecting is that the bigger is the number of
documents the better the thresholds will fit. Look-
ing at the results in Table 1 we can corroborate that
this assumption is mostly true in the non balanced ap-
proach, where both the macro and micro F-measures
improve with respect to the baseline results (9% and
5% respectively). Regarding the micro F-measure,
the improvement is most remarkable because we get
the best result for this measure in all the classifiers.
We think that this happens because the thresholds of
the MPs with a strong training set are well estimated.
Nevertheless, the macro F-measure is not so good in
absolute terms. Perhaps this is due to the fact that,
in this measure, we are giving the same importance
to all the MPs, independently on the quality of their
training set and the thresholds of the MPs with a weak
training set are not well estimated. The combined bal-
anced/non balanced approach in this case does not
improve in any case the results of the non balanced
approach alone. Finally, the balanced approach once
again obtains worse results than the baseline.
To put into perspective the results obtained us-
ing the validation and the training sets, we have also
displayed in Table 1 the ideal results we could get
with the balanced, non balanced and combined ap-
proaches. These values are computed by selecting the
best thresholds (and in the last case also choosing for
each MP whether balancing him or not) on the ba-
sis of the results on the test set. These results show
that the combined approach could be useful, at least
in theory, if we were able to decide when the training
data associated to an MP should be balanced or not:
5
For example an MP having 20 interventions will have
only 3 in the validation set.
KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval
190