Table 5: Table of data complexity measurements in the datasets.
Categories Fisher Discriminant Ratio VOR J4 LOO error 1NN (%) Imbalance
GO 0003677 1,162564308 1,518414e-45 366,65 42 1:7,68
GO 0003700 1,258898151 8,325292e-43 153,09 54,7 1:10,76
GO 0003824 0,095424389 1,503915e-39 114,67 41,3 1:2,74
GO 0005215 1,275657636 1,974715e-67 3045,37 19,4 1:8,26
GO 0016787 0,004254501 6,654359e-07 -0,472 53,9 1:4,63
GO 0030234 0,265168845 1,247835e-26 14,712 79,3 1:23,87
GO 0030528 0,954652410 1,125151e-37 255,43 37,6 1:7,22
Table 6: Table of AUC and GM.
Categories SMOTE SPSO CS CSCu MC MCCu Ada
AUC GM AUC GM AUC GM AUC GM AUC GM AUC GM AUC GM
GO 0003677 0,693 0,668 0,708 0,707 0,615 0,519 0,788 0,786 0,684 0,659 0,718 0,713 0,766 0,747
GO 0003700 0,654 0,599 0,721 0,721 0,679 0,629 0,821 0,817 0,617 0,566 0,668 0,655 0,773 0,744
GO 0003824 0,664 0,658 0,667 0,667 0,53 0,292 0,655 0,651 0,618 0,592 0,661 0,654 0,599 0,536
GO 0005215 0,778 0,752 0,811 0,81 0,643 0,562 0,829 0,823 0,803 0,788 0,839 0,835 0,812 0,766
GO 0016787 0,505 0,405 0,516 0,513 0,497 0,188 0,504 0,49 0,499 0,395 0,499 0,443 0,485 0,128
GO 0030234 0,568 0,429 0,663 0,642 0,618 0,613 0,699 0,686 0,515 0,205 0,617 0,518 0,675 0,502
GO 0030528 0,659 0,621 0,717 0,714 0,595 0,493 0,763 0,762 0,68 0,662 0,676 0,66 0,723 0,691
Total 0,646 0,59 0,686 0,682 0,596 0,47 0,723 0,717 0,63 0,552 0,668 0,64 0,69 0,588
the variance of the result. This may be due to an ap-
propriate number of iterations for MetaCost (10 iter-
ations) was not taken. MetaCost use resampling via
Bootstrap, taking a portion of the training set to cre-
ate a subset in each iteration, then each subset is taken
by a number of base classifiers equal to the number
of iterations for the algorithm selected and the final
classification decision is taken in committee by a vote
of each classifier. When the number of iterations in
MetaCost is not adequate and additionally the dataset
have a substantial degree of imbalance, as it is in this
case, the number of samples of interest, i.e the sam-
ples belonging to this category used for each base
classifier could not be enough.
In all categories, there exist cases where some bal-
ance techniques present very similar values of GM
compared with their AUC values, mainly in SPSO
and CSCu. It observes that occurs particularly when
the numeric difference between sensitivity and speci-
ficity is small, i.e, the numeric values of sensitivity
and specificity are to close among them. This fact
can be corroborated with the relative sensitivity val-
ues (RS) (Su and Hsiao, 2007), exposed in Table 7.
Table 7: Table of relative sensitivity.
Categories SMOTE SPSO CS CSCu MC MCCu Ada
GO 0003677 0,582 0,996 3,301 1,119 0,582 0,796 1,572
GO 0003700 0,427 1,067 2,185 1,217 0,433 0,676 1,701
GO 0003824 0,766 0,973 11,057 0,803 0,555 0,755 0,406
GO 0005215 0,593 1,002 2,885 0,8 0,688 0,842 0,762
GO 0016787 0,251 1,234 25,892 0,614 0,241 0,371 0,017
GO 0030234 0,208 1,652 0,783 0,677 0,044 0,297 0,269
GO 0030528 0,503 1,203 3,541 1,126 0,631 0,651 1,824
As it can seen, SPSO and CSCu are the techniques
with less bias in their classifications, with values more
close to one. The above indicates that precisely these
two classifiers try to obtain an equilibrium between
sensibility and specificity values, fact that is shown
with the points in Figure 4 where AUC = GM. Con-
trary to popular belief, SMOTE tends to be very spe-
cific, although sampling techniques try to become
more sensitive to increase distribution of samples on
category with lower representation. it is noteworthy
that both CS and MC obtained a quite substantial im-
provement when they use CuckooCost to optimize
their parameters, initially CS was to sensitive but it
had a small s pecificity, contrary case to MC, that it
had a big specificity. When CuckooCost was used,
both strategies were proximal to one, specially in CS.
6 CONCLUSIONS AND FUTURE
WORK
A method to optimize the free parameters associated
to cost sensitive learning, applied to prediction of
molecular functions in embryophita plants was pro-
posed, with the purpose of having direct control over
sensitivity and specificity of the classification (related
to the costs involved misclassifying samples belong-
ing to each category). The optimization is proposed
over the elements of the cost matrix, whose tuning
was adapted on elements outside the main diagonal,
building the cost ratio. The variation of the cost ratio,
along with the classification parameters were used as
hyperparameters in the optimization problem, since
the metric intrinsically modify the fitness function. To
this purpose, a metaheuristic optimization technique
called Cuckoo Search was used. The methodology
AMethodologyforOptimizingtheCostMatrixinCostSensitiveLearningModelsappliedtoPredictionofMolecular
FunctionsinEmbryophytaPlants
79