that has a cost for training linear to the number of the
training instances and the cost for testing linear to the
number of testing instances and the number of classes
in the taxonomy. In despite of its simplicity the ob-
tained resultsare very competitive in comparisonwith
other algorithms. Another advantage of the centroid-
based approach is that it summarizes the characteris-
tics of each class, using a centroid vector. The advan-
tage of the summarization performed by the centroid
vectors is that it combines multiple prevalent features
together, even if these features are not simultaneously
present in a single instance. This is useful becausecan
capture individual features present only in a few ex-
amples. Also, in terms computational time although
it’s evaluation wasn’t the main focus of this work,
the centroid-based approaches here proposed showed
clearly to require less time and resources than the
rules (HLCS) and Naive Bayes (GMND) approaches.
On the other hand, centroid-based classifiers are
dependent of a good set of examples for each class
and can lead to wrong classifications if the partition-
ing of examples is unbalanced. Also, in the context
of hierarchical classification, the addition of children
data to train the centroids of the higher classes of the
hierarchy needs to be more investigated because the
average of the vectors from two children classes can
not always truly represent the characteristics of the
parent class. In a centroid-based approach it’s im-
portant to ensure that the instances belonging to the
same class will be proportionally distributed between
the training and testing partitions, if all examples of
one class remain in the same partition the centroid of
this class wont be trained or wont have examples to
classify.
As future researches we highlight a deeper analy-
sis of the centroid relations between parent and chil-
dren classes in the hierarchy using different datasets.
Also this algorithm can be improved to support DAG
taxonomies and to make multiple paths of label pre-
diction (MPL). Another approachto be investigated is
the selection of a set of k centroids for every instance
being classified, the final centroid that would predict
the class to the instance will be select by election in a
similar way used in k-NN algorithm.
REFERENCES
Alves, R. T., Delgado, M. R., and Freitas, A. A. (2008).
Multi-label hierarchical classification of protein func-
tions with artificial immune systems. In Proceed-
ings of the 3rd Brazilian symposium on Bioinformat-
ics: Advances in Bioinformatics and Computational
Biology, BSB ’08, pages 1–12, Berlin, Heidelberg.
Springer-Verlag.
Barros, R. C., Cerri, R., Freitas, A. A., and de Carvalho,
A. C. P. L. F. (2013). Probabilistic clustering for hi-
erarchical multi-label classification of protein func-
tions. In In proceeding of: Machine Learning and
Knowledge Discovery in Databases (ECML 2013),
At Prague, Czech Republic, volume 8189 of Lecture
Notes in Computer Science.
Blockeel, H., De Raedt, L., and Ramon, J. (1998). Top-
down induction of clustering trees. In Proceedings of
the 15th International Conference on Machine Learn-
ing, pages 55–63. Morgan Kaufmann.
Cerri, R., Barros, R. C., de Carvalho, A. C. P. L. F., and
Freitas, A. A. (2013). A grammatical evolution algo-
rithm for generation of hierarchical multi-label clas-
sification rules. In IEEE Congress on Evolutionary
Computation, pages 454–461. IEEE.
Enembreck, F., Scalabrin, E. E., Tacla, C. A., and
´
Avila,
B. C. (2006). Automatic identification of teams based
on textual information retrieval. In CSCWD, pages
534–538. IEEE.
Ferrandin, M., Nievola, J. C., Enembreck, F., Scalabrin,
E. E., Kredens, K. V., , and
´
Avila, B. C. (2013). Hi-
erarchical classification using fca and the cosine sim-
ilarity function. In Proceedings of the 2013 Interna-
cional Conference on Artificial Inteligence (ICAI’13),
volume 1, pages 281–287.
Filmore, D. (2004). It’s a GPCR world. Modern Drug Dis-
covery, 7:24–28.
Guan, H., Zhou, J., and Guo, M. (2009). A class-feature-
centroid classifier for text categorization. In Proceed-
ings of the 18th international conference on World
wide web, WWW ’09, pages 201–210, New York, NY,
USA. ACM.
Han, E.-H. and Karypis, G. (2000). Centroid-based doc-
ument classification: Analysis and experimental re-
sults. In Proceedings of the 4th European Conference
on Principles of Data Mining and Knowledge Discov-
ery, PKDD ’00, pages 424–431, London, UK, UK.
Springer-Verlag.
Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohen,
F. E., and Vriend, G. (2003). Gpcrdb information sys-
tem for g protein-coupled receptors. Nucleic Acids
Research, 31(1):294–297.
Kiritchenko, S., Matwin, S., and Famili, A. F. (2005). Func-
tional annotation of genes using hierarchical text cat-
egorization. In in Proc. of the BioLINK SIG: Link-
ing Literature, Information and Knowledge for Biol-
ogy (held at ISMB-05.
Otero, F. E. B., Freitas, A. A., and Johnson, C. G. (2010). A
hierarchical multi-label classification ant colony algo-
rithm for protein function prediction. Memetic Com-
puting, pages 165–181.
Rocchio, J. (1971). Relevance feedback in information re-
trieval. In Salton, G., editor, The SMART Retrieval
System - Experiments in Automatic Document Pro-
cessing, pages 313–323. Prentice Hall.
Rom˜ao, L. M. and Nievola, J. C. (2012). Hierarchical
classification of gene ontology with learning classi-
fier systems. In Advances in Artificial Intelligence -
IBERAMIA 2012, volume 7637 of Lecture Notes in
ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems
32