Table 5: Standard deviation of the F
scores for all the 18
labels, considering the two training sets (Random and Fair).
k Random Fair Difference
1000 0.194 0.157 0.037
2000 0.191 0.163 0.029
3000 0.193 0.163 0.030
4000 0.196 0.169 0.027
5000 0.191 0.171 0.020
10000 0.187 0.174 0.013
In this work, we present and discuss a new algorithm
to generate fair subsets from unbalanced datasets.
The results of ILP algorithm in the multi-label image
classification task showed consistent improvements
compared to the random sub-selection of the original
training set, considering both the global scope (macro
F1 score), and the F1 score of the less frequent labels.
As future research directions, we envision the in-
vestigation of the computational complexity of the
Fairer Coverage Problem and the application of our
method to different datasets. The HPA dataset is a
special case where we have a single characteristic for
each sample, but our method could easily be adapted
to select a fairer dataset from a more complex dataset,
i.e., containing more than one attribute. We also be-
lieve the method will be useful when applied to large
datasets that cannot be used in full for the training
phase due to computational limitations.
The authors would like to thank the S
ao Paulo
Research Foundation [grants #2015/11937-9,
#2017/12646-3, #2020/16439-5]; Coordination for
the Improvement of Higher Education Person-
nel; and the National Council for Scientific and
Technological Development [grants #304380/2018-0,
#306454/2018-1, #309330/2018-1, #161015/2021-2].
