A Centroid-based Approach for Hierarchical Classiﬁcation

Mauri Ferrandin

, Fabr´ıcio Enembreck

, J´ulio C´esar Nievola

, Edson Em´ılio Scalabrin

and Br´aulio Coelho

Avila

Engenharia de Controle e Automac¸˜ao, Universidade Federal de Santa Catarina-UFSC, Campus Blumenau,

Rua Pomerode, 710, 89065-300, Blumenau, SC, Brazil

Programa de P´os-Graduac¸˜ao em Inform´atica, Pontif´ıcia Universidade Cat´olica do Paran´a-PUCPR,

Rua Imaculada Conceic¸˜ao, 1155, Prado Velho, 80215-901, Curitiba, PR, Brazil

Keywords:

Data Mining, Hierarchical Classiﬁcation, Centroid Classiﬁcation.

Abstract:

Classiﬁcation is a common task in Machine Learning and Data Mining. Some classiﬁcation problems need to

take into account a hierarchical taxonomy establishing an order between involved classes and are called hierar-

chical classiﬁcation problems. The protein function prediction can be considered a hierarchical classiﬁcation

problem because their functions may be arranged in a hierarchical taxonomy of classes. This paper presents

an algorithm for hierarchical classiﬁcation using a centroid-based approach with two versions named HCCS

and HCCSic respectively. Centroid-based techniques have been widely used to text classiﬁcation and in this

work we explore it’s adoption to a hierarchical classiﬁcation scenario. The proposed algorithm was evaluated

in eight real datasets and compared against two other recent algorithms from the literature. Preliminary results

showed that the proposed approach is an alternative for hierarchical classiﬁcation, having as main advantage

the simplicity and low computational complexity with good accuracy.

1 INTRODUCTION

Classiﬁcation is one of the most important problems

in Machine Learning and Data Mining. The classi-

ﬁcation consists in associating one or more classes

from a set of predeﬁned classes to a not classiﬁed ex-

ample (instance) from a database. The features (at-

tributes) of each example will determine the classes it

will be associated with.

The prediction of the functions of proteins is con-

sidered as a classiﬁcation problem. The set of differ-

ent proteins is considered the example database and

the set of biological functions are the classes which

can be associated to each example. The functions of

the proteins may be arranged in a hierarchical taxon-

omy of classes, so the prediction of these functions is

considered a hierarchical classiﬁcation problem.

Centroid-based classiﬁers have been largely ap-

plied in text categorization problems showing good

accuracy with low computational costs due to its sim-

plicity in the representation of the information with-

out to loose the capacity of summarize the main as-

pects present in the training examples.

Based on the wide diversity of hierarchical classi-

ﬁcation problems, speciﬁc algorithms in this area are

being developed. This paper presents an algorithm

for hierarchical classiﬁcation that uses techniques of

centroid-based classiﬁers that is an adaptation of the

centroid-base algorithms used for text categorization.

Experiments with biological data sets were done and

the obtainedresults were compared with two other ap-

proaches- GMNB (Silla and Freitas, 2009)and HLCS

(Rom˜ao and Nievola, 2012) - that were proposed to

explore the same hierarchical classiﬁcation problem.

The remainder of this paper is organized as fol-

lows: Section 2 presents background on hierarchical

classiﬁcation and centroid-based classiﬁers. Section

3 shows the related works in the subject of this pa-

per. Section 4 discusses the new proposed algorithms

for hierarchical classiﬁcation. Section 5 presents the

experimental setup and reports the computational re-

sults obtained with the algorithms proposed in this pa-

per. Conclusions and some perspectives about future

works are stated in Section 6.

2 BASIC FOUNDATIONS REVIEW

This section presents an overview about hierarchical

classiﬁcation problems and the main centroid-based

techniques that will be used in this work.

Ferrandin M., Enembreck F., Nievola J., Scalabrin E. and Ávila B..

A Centroid-based Approach for Hierarchical Classiﬁcation.

DOI: 10.5220/0005339000250033

In Proceedings of the 17th International Conference on Enterprise Information Systems (ICEIS-2015), pages 25-33

ISBN: 978-989-758-096-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

2.1 Hierarchical Classiﬁcation

A hierarchical classiﬁcation problem has as main

characteristic a taxonomy that imposes a hierarchical

order between the set of classes present in the dataset.

This order is represented by (C,≺) that represents a

“IS-A” relationship among the classes and is asym-

metric, reﬂexive and transitive where:

• The only one greatest element R is the root of the

tree;

• ∀c

∈ C, if c

≺ c

then c

⊀ c

;

• ∀c

∈ C, c

⊀ c

• ∀c

∈ C, c

≺ c

and c

≺ c

imply c

≺ c

According to (Silla and Freitas, 2011b), three

main features distinguish the hierarchical classiﬁca-

tion problems. Firstly the type of hierarchical taxon-

omy of classes which may be represented as a tree or

a directed acyclic graph (DAG). Figure 1 represents

both types of taxonomies with a tree - Figure 1 (a) -

and a DAG - Figure 1 (b). In the representation, the

“IS-A” relationship states that one instance that be-

longs to the class 2.2.1 also belongs to classes 2.2, 2

and root (R). When the taxonomy is represented as a

DAG the scenario is even more complex because one

class can have more than one parent node, so consid-

ering the representation in the Figure 1 (b) there are

two classes named by 1.2/2.1and 2.2.1/2.3.1that have

more than one parent and as consequence they also

belongs to the classes of all of its ancestors nodes in

different branches of the DAG.

root

1 2

1.1 1.2 2.1 2.2

1.1.1 1.1.2 2.2.1 2.2.2

(a)

root

1 2

1.1

1.2/2.1

2.2 2.3

1.1.1 1.1.2

2.2.1/2.3.1

2.3.2

(b)

Figure 1: Different types of hierarchical class taxonomies.

(a): tree-structured; (b): DAG-structured.

The second characteristic of the hierarchical clas-

siﬁcation problems is related to how deep the classiﬁ-

cation is performed in the hierarchy. That is, the hier-

archical classiﬁcation method can be implemented to

always predict classes that are in the leaf nodes of the

taxonomy - this approach is named as mandatory leaf

node prediction (MLNP) - or the method can consider

stopping the classiﬁcation at any node from any level

of the taxonomy - approach named non-mandatory

leaf node prediction (NMLNP).

The third criterion considers how the hierarchi-

cal structure of the taxonomy is explored. The ex-

ploration could be local, when the system employs a

set of local classiﬁers (i.e. one classiﬁer per class is

induced); global, when a single classiﬁer is used to

represent the entire class taxonomy; or ﬂat classiﬁers

which ignore the relationshipsamong the classes, typ-

ically predicting only classes represented in the leaf

nodes.

The algorithms for hierarchical classiﬁcation can

be classiﬁed by the three cited features and also by

a additional property that indicates the capabilities of

making single or multi-label predictions. Considering

that the classes are organized by a taxonomy these

capabilities are named Single Path of Labels (SPL)

and Multiple Paths of Label (MPL) respectively.

2.1.1 Performance Measures for Hierarchical

Classiﬁers

There are different measures used to evaluate the per-

formance of hierarchical classiﬁers. The most ac-

cepted and used approach is an adaptation of tradi-

tional measures for classiﬁers known as Precision,

Recall and F-Measure. In the context of a hierar-

chical classiﬁcation problem, for each dataset the ﬁ-

nal values of hierarchical Precision (hP), hierarchical

Recall (hR) and hierarchical F-measure (hF) are ob-

tained by Equation 1 accordingly to the proposal of

(Kiritchenko et al., 2005). In the Equation 1, accord-

ing to the author we assume that one instance i be-

longs to a set of classes C and will have a set of pre-

dicted classes denoted byC

′

. The extended sets

C and

′

represent respectively the classes in the sets C and

′

with the addition of all ancestors classes of each

set considering the taxonomy.

hP =

∑

∩

′

∑

′

, hR =

∑

∩

′

∑

, hF =

2∗ hP ∗ hR

hP+ hR

(1)

Although no hierarchical classiﬁcation measure

can be considered the best one in all possible hi-

erarchical classiﬁcation scenarios and applications,

the main reason for recommending the hP, hR and

hF measures is that, broadly speaking, they can be

effectively applied to any hierarchical classiﬁcation

scenario; i.e., tree-structured, DAG-structured, SPL,

MPL, MLNP or NMLNP problems (Silla and Freitas,

2011b).

2.2 Centroid-based Classiﬁcation

Centroid-based approaches have been widely used

in text categorization problems. In a centroid-based

classiﬁcation algorithm, the documents are repre-

sented using the vector-space model (Salton, 1989).

According to (Han and Karypis, 2000), in this model,

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

each document is considered to be a vector in the

term-space. In its simplest form, each document

is represented by the term-frequency (TF) vector as

shown in Equation 2, where t f

is the frequency of the

ith term in the document.

t f

= (t f

,t f

,...,t f

) (2)

In addition, the inverse document frequency (IDF)

reﬁnement is commonly used as a reﬁnement to con-

sider that terms appearing frequently in many docu-

ments have limited discrimination power, and for this

reason they need to be de-emphasized. This is done

in (Salton, 1989)by multiplyingthe frequencyof each

term i by log(N/d f

), where N is the total number of

documents in the collection, and d f

is the number of

documents that contains the ith term (i.e., document

frequency).

The tf-idf representation explained above leads

to a representation of a document as represented in

the Equation 3. Finally, to deal with documents

with different lengths the vectors are normalized to

||d

t fid f

= 1.

t fid f

= (t f

log(

d f

),t f

log(

d f

),...,t f

log(

d f

))

(3)

Considering a set S of documents belonging to a

class x represented by it’s tf-idf vectors, a centroid C

is obtained by the average of the terms of the docu-

ments as represented in Equation 4.

|S|

∑

d∈S

d (4)

The classiﬁcation process consists of to compute

a centroid for each class in the training dataset. If

there are k classes in the training set, this leads to a

set of centroid vectors {C

,...,C

}, where each C

is the centroid for the ith class. The class of a new

document x is determined as follows. First we use the

document-frequencies of the various terms computed

from the training set to compute the tf-idf weighted

vector-space representation of x. Then, we compute

the similarity between x and all centroids using the

cosine measure as shown in the Equation 5.

cos(x,C) =

x·C

||x|| ||C||

(5)

Finally, based on the obtained similarities mea-

sures, we assign the example x being classiﬁed to the

class corresponding to the most similar centroid. This

test phase is represented by Equation 6.

arg max

j=1,..,k

(cos(x,C

)) (6)

Although the centroid-based classiﬁer approach is

considered very simple, it has the advantage that it’s

computationalcomplexity of the learningphase is lin-

ear on the number of documents and the number of

terms in the training set, the amount of time required

to classify a new document is at most O(km), where k

is the number of classes and m is the number of terms

present in x. Thus, the overall computational com-

plexity of this algorithm is very low, and it is identical

to fast document classiﬁers such as Naive Bayesian

(Han and Karypis, 2000).

3 RELATED WORKS

In this Section we present the main works in the ar-

eas related to this paper, ﬁrstly the works related to

hierarchical classiﬁcation showing the historical and

mainly works in the subject, and secondly the main

works related to centroid-based classiﬁcation.

3.1 Related Works in Hierarchical

Classiﬁcation

There are a great number of researches and propos-

als addressed to hierarchical classiﬁcation problems.

Some of them were created having as base other pro-

posals addressed to text classiﬁcation or other do-

main. In these works a lot of different approaches

were used considering the built of local and global

classiﬁcation systems. Bellow the most recent and re-

markable works are presented.

In the paper proposed by (Vens et al., 2008) the

authors developed a hierarchical classiﬁcation model

named Clus for the DAG structure using the global

approach capable of to make multi-label hierarchical

classiﬁcation (HMC). In this work the authors also

discuss about two other versions of the Clus proposed

before: single-label classiﬁcation (SC) and hierarchi-

cal single-label classiﬁcation (HSC). For the devel-

opment of these classiﬁers the authors used the in-

duction of decision trees ﬁrstly supporting only tax-

onomies represented by a tree and after showed how

this model can be modiﬁed for use in hierarchical

DAG structures. For the induction of decision trees

a framework named predictive clustering trees (PCT)

(Blockeel et al., 1998) was used. The PCT views a

decision tree as a hierarchy of clusters: the top-node

corresponds to one cluster containing all data, which

is recursively partitioned into smaller clusters while

moving down the tree. PCTs are constructed so that

each split maximally reduces intra-cluster variance.

In the work of (Silla and Freitas, 2009)the authors

proposed a method that is an extension of the ﬂat clas-

ACentroid-basedApproachforHierarchicalClassification

siﬁcation algorithm naive Bayes adapted to hierarchi-

cal classiﬁcation. The author compared the results of

a local and two global version of the algorithm using

biologic data for protein’s functions prediction. The

results of the global approach named Global Multi-

label Naive Bayes (GMNB) will be compared with

the methods proposed in this work. Lately in the work

(Silla and Kaestner, 2013) the authors also evaluated

the GMNB performance in a different domain to pre-

dict bird species with the presence of a taxonomy of

species.

The proposal of (Rom˜ao and Nievola, 2012)

adapted Learning Classiﬁer Systems (LCS) in order

to predict protein functions. The proposed approach,

called HLCS (Hierarchical Learning Classiﬁer Sys-

tem) builds a global classiﬁer to predict all classes in

the application domain and its is expressed as a set of

IF-THEN classiﬁcation rules.

A lot of different techniques were tested in dif-

ferent models and among then stand out: a model

named Multi-Label Hierarchical Classiﬁcation with

an Artiﬁcial Immune System (MHC-AIS) that gen-

erates rules IF-THEN and two main versions were

presented exploring both local and global approaches

(Alves et al., 2008); the use of optimization based on

ant colony to predict classes in problems of hierarchi-

cal classiﬁcation (Otero et al., 2010).

A global method called Grammatical Evolution

for Hierarchical Multi-label classiﬁcation (GEHM)

was proposed in (Cerri et al., 2013). The approach

makes use of grammatical evolutionfor generating hi-

erarchical multi-label classiﬁcation rules. In this ap-

proach, the grammatical evolution algorithm evolves

the antecedents of classiﬁcation rules, in order to as-

sign instances from a hierarchical multi-label classi-

ﬁcation dataset to a probabilistic class vector. The

method is comparedto bio-inspiredalgorithmsin pro-

tein function prediction datasets. The empirical anal-

ysis conducted in the work showed that GEHM out-

performs the bio-inspired algorithms with statistical

signiﬁcance, suggesting that grammatical evolution

is a promising alternative to deal with hierarchical

multi-label classiﬁcation of biological datasets.

The authors of (Barros et al., 2013) developed

a hierarchical multi-label classiﬁcation algorithm

for protein function prediction, named Hierarchical

Multi-label Classiﬁcation with Probabilistic Cluster-

ing (HMC-PC) that was based on probabilistic clus-

tering making use of cluster membership probabil-

ities in order to generate the predicted class vector.

An extensive empirical analysis was performed com-

paring the proposed approach to four different hier-

archical multi-label classiﬁcation algorithms in pro-

tein function datasets structured both as trees and

DAG. The presented results showed that HMC-PC

achieves superior or comparable results when com-

pared to the state-of-the-art method for hierarchical

multi-label classiﬁcation.

Finally, (Ferrandin et al., 2013) presented an algo-

rithm for hierarchical classiﬁcation using the global

approach, called Hierarchical Multi-label Classiﬁer

System using Formal Conceptual Analysis and Sim-

ilarity of Cosine (HMCS-FCA-SC) for hierarchical

multi-label classiﬁcation. The proposed algorithm

combined FCA techniques for hierarchical classiﬁca-

tion.

For more content about hierarchical classiﬁca-

tion and related works we suggest the survey pro-

posed by (Silla and Freitas, 2011b) with a large re-

view about hierarchical classiﬁcation demonstrating

the major theories about the subject, classifying the

algorithmsand proposinga standard nomenclaturefor

classify the works in this ﬁeld or research.

3.2 Related Works in Centroid-based

Classiﬁcation

The initial adoption of a centroid-based classiﬁer was

in informationretrievaland text classiﬁcation with the

work of (Rocchio, 1971). A centroid-based classiﬁer

when applied to text classiﬁcation using tf-idf vectors

to represent documents is known as the Rocchio clas-

siﬁer.

In (Han and Karypis, 2000) experiments with text

categorization showed that the centroid-based clas-

siﬁer outperformed other algorithms such as Naive

Bayesian, k-nearest-neighbours, and C4.5, on a wide

range of datasets. The analysis showed that the sim-

ilarity measure used by the centroid-based scheme

allows it to classify a new document based on how

closely its behaviour matches the behaviour of the

documents belonging to different classes.

A method to improve the centroid-based classi-

ﬁcation accuracy was proposed by (Theeramunkong

and Lertnattee, 2001) considering a number of

statistical term weighting systems based on term-

distribution, including factors of intra-class, inter-

class, overall term frequency distribution and term

length normalization. A number of experiments us-

ing drug information web pages and newsgroups data

set were done. The results showed that the method

outperforms standard tf-idf centroid-based, k-nearest

neighbor and naive Bayesian classiﬁers to some ex-

tent.

(Tibshirani et al., 2002) used a centroid-based

classiﬁer to cancer class prediction from gene expres-

sion proﬁling. The method called nearest shrunken

centroids identiﬁes subsets of genes that best charac-

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

t fid fic

= (t f

log(

d f

)(

t fic

t f

)

),t f

log(

d f

)(

t fic

t f

)

),...,t f

log(

d f

)(

t fic

t f

)

)) (7)

terize each class. The method was highly efﬁcient in

ﬁnding genes for classifying small round blue cell tu-

mors and leukemias.

The authors of (Enembreck et al., 2006) used a

centroid-based approach for identifying people who

have the most appropriate competencies to form a

research and development team. The selection was

done through the analysis of the curriculum vita of

the candidate researchers.

For (Tan, 2008), in the context of text categoriza-

tion, centroid-based classiﬁers proved to be a sim-

ple and yet efﬁcient method but it often suffers from

the inductive bias or model misﬁt incurred by its as-

sumption. In order to address this issue, the author

proposed a novel batch-updated approach to enhance

the performance of centroid-based classiﬁers. The

main idea behind this method is to take advantage of

training errors to successively update the classiﬁca-

tion model by batch. The technique is simple to im-

plement and ﬂexible to text data. The experimentalre-

sults indicate that the technique can signiﬁcantly im-

prove the performance of centroid-based classiﬁers.

A fast Class-Feature-Centroid (CFC) classiﬁer for

multi-class, single-label text categorization in which

a centroid is built from two important class distri-

butions: inter-class term index and inner-class term

index was proposed by (Guan et al., 2009). CFC

proposes a novel combination of these indexes and

employs a denormalized cosine measure to calculate

the similarity score between a text vector and a cen-

troid. Experimentsshowed that CFC consistently out-

performedthe state-of-the-artSVM classiﬁers is more

effective and robust than SVM when data is sparse.

4 PROPOSED ALGORITHM

Considering the centroid-based techniques employed

for text classiﬁcation demonstrated in the Section 2.2

an adaptation was made allow them to deal with hier-

archical classiﬁcation. In a ﬁrst moment only the tf-

idf for weighting the attributes was used and this clas-

siﬁer was named Hierarchical Centroid-Based Classi-

ﬁer System (HCCS) and is showed in Algorithm 1.

Every instance in the training partition receives

the same treatment of a document in the text classi-

ﬁcation process so every attribute of the instance is

weighted using the tf-idf (line 3) using the Equation

3. To deal with the relationships among the classes,

every instance vector was added to the centroid vec-

tors of all classes it belongs to regarding the taxon-

Algorithm 1: HCCS.

Require: The sets of instances for: training TR, testing

TE; The class taxonomy H;

1: Initialize a set of centroids for the classes in H;

2: foreach (tr

∈ TR) do

3: Represent tr

attributes as tf-idf vector trv

;

4: Add trv

to the centroid of the class it belongs

and to the centroides of its descendants

in H;

5: end for

6: Compute the average for all centroids;

7: foreach (te

∈ TE) do

8: Represent te

attributes as tf-idf vector tev

;

9: Find the centroid most similar to tev

;

10: Predict the class of the chosen centroid to te

;

11: end for

12: Compute the results of classiﬁcation process;

omy, e.g. the class that is explicitly assigned to the in-

stance and all ancestors in the taxonomy. By this way,

the centroid of one parent class will be the average of

the centroids of all its children classes (line 4) and the

instances directly assigned to this parent class. The

average of the centroids (line 6) is computed using

Equation 4.

The testing phase of the HCCS consists in to ﬁnd

the most similar centroid for every test instance, the

selection (line 9) is done according to Equation 6. Fi-

nally the classiﬁer hits and misses are computed (line

12) through the measures hP, hR and hF respectively

showed in the Equation 1.

An improved version of HCCS named HCCSic

(with ic meaning intra-class) was created adapting tf-

idf weighting of the attributes to consider the intra-

class attribute frequency. This variant version uses

the same steps represented in the Algorithm 1 except

for the line 3 where the weighting of attributes was

done as shown in Equation 7, where the t fic

is the

frequency of the attribute i in all training instances

belonging to the same class of the instance d.

The main intention with the use of the intra-class

frequency in the HCCSic is to weight the attributes

considering its frequency among the instances of the

same class. Summarizing, one attribute that is very

frequent among all instances of the same class will

have a bigger weight.

5 EXPERIMENTAL EVALUATION

In this Section the experimental evaluation realized

with the proposed algorithm is presented along with

ACentroid-basedApproachforHierarchicalClassification

Table 1: Results obtained by HCCS and HCCSic compared with HLCS (Rom˜ao and Nievola, 2012) and GMNB (Silla and

Freitas, 2009) algorithms.

Protein Signature HCCS HCCSic HLCS GMNB

Type Type hP hR hF hP hR hF hP hR hF hP hR hF

Enzime

Interpro 79.76 83.00 81.35 88.63 92.26 90.55 87.80 85.36 86.56 94.96 89.58 90.53

Pfam 74.73 78.61 76.62 86.75 90.95 88.80 86.34 81.47 83.83 95.15 86.94 88.72

Prints 76.41 79.80 78.07 82.87 86.22 84.51 89.69 82.33 85.85 92.21 87.26 87.98

Prosite 79.17 82.69 80.90 87.31 91.27 89.24 90.35 86.27 88.26 95.14 89.53 90.70

GPCR

Interpro 70.33 75.30 72.71 71.04 74.49 72.72 90.26 74.30 81.51 87.60 71.33 77.01

Pfam 48.50 55.00 51.54 44.60 51.81 47.93 82.53 60.30 69.69 77.23 57.52 64.40

Prints 67.14 73.21 70.04 68.27 73.07 70.59 86.50 68.18 76.26 87.06 69.42 75.38

Prosite 46.86 52.65 49.59 42.14 47.32 44.58 79.42 60.45 68.65 75.64 53.73 61.14

the used datasets. Also, the directly and statistical

comparisons of results against other algorithms from

the literature is demonstrated.

5.1 Datasets

The two biological databases used in this article

are from the family of G-Protein Coupled Recep-

tor (GPCR) and Enzymes. The protein functional

classes are given by unique hierarchical indexes by

(Horn et al., 2003) in the case of GPCRs, and

by Enzyme Commission Codes (Tipton and Boyce,

2000) in the case of enzymes. These databases

were used in the works of (Silla and Freitas, 2009)

and (Rom˜ao and Nievola, 2012), and are available

at https://sites.google.com/site/carlossillajr/resources.

Enzymes are catalysts that accelerate chemical reac-

tions while GPCRs are proteins involved in signaling

and are particularly important in medical applications

as it is believedthat from 40% to 50% of currentmed-

ical drugs target GPCR activity (Filmore, 2004).

Each dataset has four different versions based on

different kinds of predictor attributes, and in each

dataset the classes to be predicted are hierarchical

protein functions. Each type of binary predictor at-

tribute indicates whether or not a “protein signature”

(or motif) occurs in a protein (Silla and Freitas, 2009).

The motifs used in this work were: Interpro Entries,

FingerPrints from the Prints database, Prosite Patterns

and Pfam. Apart from the presence/absence of sev-

eral motifs according to the signature method, each

protein has two additional attributes: the molecular

weight and the sequence length.

Table 2 shows main characteristics of datasets

after the pre-processing steps which are detailed in

(Silla and Freitas, 2009). In all datasets, each protein

(example) is assigned at least to one class at each level

of the hierarchy.

Before performing the experiments, the following

preprocessing steps were applied to the datasets: (i)

Table 2: Enzime and GPCR dataset main characteristics.

Protein Type/Signature Attributes Examples Classes/Level

Enzyme

Interpro 1,216 14,027 6/41/96/187

Pfam 708 13,987 6/41/96/190

Prints 382 14,025 6/45/92/208

Prosite 585 14,041 6/42/89/187

GPCR

Interpro 450 7,444 12/54/82/50

Pfam 75 7,053 12/52/79/49

Prints 283 5,404 8/46/76/49

Prosite 129 6,246 9/50/79/49

Every class with fewer than 10 examples was merged

with its parent class. If after this merge the class still

had fewer than 10 examples, this process would be

repeated recursively until the examples would be la-

beled to the root class. (ii) All examples whose most

speciﬁc class was the root class were removed. (iii) A

class blind discretization algorithm based on equal-

frequency binning (using 20 bins) was applied to

the molecular weight and sequence length attributes,

which were the only two continuous attributes in each

dataset.

The data used in this paper is a subset of pro-

tein function datasets and a more detailed description

about the dataset used in this work is presented in the

work of (Silla and Freitas, 2011a).

5.2 Obtained Results

The experiments were performedusing 10-fold cross-

validation. Table 1 shows obtained results with the

proposed approaches HCCS and HCCSic comparing

their results with the algorithms HCLS (Rom˜ao and

Nievola, 2012) and GMNB (Silla and Freitas, 2009).

The results are represented by the three hierarchical

measures hP, hR and hF.

A comparison of the results obtained by HCCS

and HCCSic is showed in Table 4 in which is pos-

sible to see that the HCCSic outperformed the results

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

Table 3: Results obtained by HCCS and HCCSic compared with HLCS (Rom˜ao and Nievola, 2012) and GMNB (Silla and

Freitas, 2009) algorithms. The ⊕ indicates the best result.

Protein Signature HCCS HCCSic HLCS GMNB

Type Type hP hR hF hP hR hF hP hR hF hP hR hF

Enzime

Interpro ⊕ ⊕ ⊕

Pfam ⊕ ⊕ ⊕

Prints ⊕ ⊕ ⊕

Prosite ⊕ ⊕ ⊕

GPCR

Interpro ⊕ ⊕ ⊕

Pfam ⊕ ⊕ ⊕

Prints ⊕ ⊕ ⊕

Prosite ⊕ ⊕ ⊕

of HCCS in the most part of the datasets. That re-

sults show improvements in this particular classiﬁ-

cation problem afforded by the modiﬁcation of tf-idf

weighting to the tf-idf with intra-class frequency pre-

sented in the Equation 7.

Table 4: Comparison HCC and HCCic obtained results. The

⊕ indicates the best result.

Protein Signature HCCS HCCSic

Type Type hP hR hF hP hR hF

Enzime

Interpro ⊕ ⊕ ⊕

Pfam ⊕ ⊕ ⊕

Prints ⊕ ⊕ ⊕

Prosite ⊕ ⊕ ⊕

GPCR

Interpro ⊕ ⊕ ⊕

Pfam ⊕ ⊕ ⊕

Prints ⊕ ⊕ ⊕

Prosite ⊕ ⊕ ⊕

Table 3 presents the comparison of results be-

tween the proposed algorithms against the HLCS

(Rom˜ao and Nievola, 2012) and GMNB (Silla and

Freitas, 2009) algorithms. Considering the distribu-

tion of the best results we see that the HCCSic ob-

tained best values for recall and consequently for hF

in the enzyme datasets while the GMNB obtained

best precision. The HCLS approach outperformed the

other algorithms for the GPCR datasets.

There is an apparent difference of performance in

the GPCR datasets when comparing the HCCS and

HCCSic results against the other algorithms, and ex-

cept in the Prosite dataset in which all classiﬁers had a

low performance, that the performance drop could be

consequence of the number of the classes in the levels

of the taxonomy and the number of examples in the

dataset, revealing a greater sensitivity of the proposed

algorithms to these variables. Looking at Table 2 is

possible to see that the GPCR datasets have a bigger

number of classes in the ﬁrst levels of the hierarchy

and a lower number of instances.

Statistical tests based on Friedman comparing the

hF values between the classiﬁers with (α = 0.05) in-

dicated signiﬁcant differences between the classiﬁers.

The post-hoc analysis represented in Table 5 demon-

strated that there is no signiﬁcant statistical differ-

ences between the results of HCCSci and HCLS or

GMNB algorithms. The statistical test also presented

signiﬁcance when comparing HCCS against HCCSic

algorithms, showing that the improvement done on

the second classiﬁer achieve best results.

Table 5: Results of Friedman test considering the hF mea-

sure of classiﬁers.

Friedman Test using hF

Comparison p-value signiﬁcance

HCCS - HCCSic 0.030488 *

HCCS - HLCS 0.000892 ***

HCCS - GMNB 0.000482 ***

HCCSic - HLCS 0.136864

HCCSic - GMNB 0.085510

HLCS - GMNB 0.799078

6 CONCLUSION

This paper presented a new algorithm for the hi-

erarchical classiﬁcation problem of predicting pro-

tein functions supporting taxonomies organized as a

tree. The algorithm is an adaptation of the centroid-

based classiﬁers largely employed in text classiﬁca-

tion problems and was presented with two versions:

the ﬁrst one - named HCCS - using only tf-idf to

weight the attributes and the second - named HCC-

Sic - adding a intra-class frequency weight to tf-idf

weighting. Both approaches here proposed are clas-

siﬁed as global classiﬁers that support taxonomies or-

ganized as a tree, predict one class per instance in the

hierarchy (SPL) and can predict classes at all levels of

the taxonomy (NMLNP).

The main advantage of the proposed algorithms

is the simplicity of the centroid-based classiﬁcation

ACentroid-basedApproachforHierarchicalClassification

that has a cost for training linear to the number of the

training instances and the cost for testing linear to the

number of testing instances and the number of classes

in the taxonomy. In despite of its simplicity the ob-

tained resultsare very competitive in comparisonwith

other algorithms. Another advantage of the centroid-

based approach is that it summarizes the characteris-

tics of each class, using a centroid vector. The advan-

tage of the summarization performed by the centroid

vectors is that it combines multiple prevalent features

together, even if these features are not simultaneously

present in a single instance. This is useful becausecan

capture individual features present only in a few ex-

amples. Also, in terms computational time although

it’s evaluation wasn’t the main focus of this work,

the centroid-based approaches here proposed showed

clearly to require less time and resources than the

rules (HLCS) and Naive Bayes (GMND) approaches.

On the other hand, centroid-based classiﬁers are

dependent of a good set of examples for each class

and can lead to wrong classiﬁcations if the partition-

ing of examples is unbalanced. Also, in the context

of hierarchical classiﬁcation, the addition of children

data to train the centroids of the higher classes of the

hierarchy needs to be more investigated because the

average of the vectors from two children classes can

not always truly represent the characteristics of the

parent class. In a centroid-based approach it’s im-

portant to ensure that the instances belonging to the

same class will be proportionally distributed between

the training and testing partitions, if all examples of

one class remain in the same partition the centroid of

this class wont be trained or wont have examples to

classify.

As future researches we highlight a deeper analy-

sis of the centroid relations between parent and chil-

dren classes in the hierarchy using different datasets.

Also this algorithm can be improved to support DAG

taxonomies and to make multiple paths of label pre-

diction (MPL). Another approachto be investigated is

the selection of a set of k centroids for every instance

being classiﬁed, the ﬁnal centroid that would predict

the class to the instance will be select by election in a

similar way used in k-NN algorithm.

REFERENCES

Alves, R. T., Delgado, M. R., and Freitas, A. A. (2008).

Multi-label hierarchical classiﬁcation of protein func-

tions with artiﬁcial immune systems. In Proceed-

ings of the 3rd Brazilian symposium on Bioinformat-

ics: Advances in Bioinformatics and Computational

Biology, BSB ’08, pages 1–12, Berlin, Heidelberg.

Springer-Verlag.

Barros, R. C., Cerri, R., Freitas, A. A., and de Carvalho,

A. C. P. L. F. (2013). Probabilistic clustering for hi-

erarchical multi-label classiﬁcation of protein func-

tions. In In proceeding of: Machine Learning and

Knowledge Discovery in Databases (ECML 2013),

At Prague, Czech Republic, volume 8189 of Lecture

Notes in Computer Science.

Blockeel, H., De Raedt, L., and Ramon, J. (1998). Top-

down induction of clustering trees. In Proceedings of

the 15th International Conference on Machine Learn-

ing, pages 55–63. Morgan Kaufmann.

Cerri, R., Barros, R. C., de Carvalho, A. C. P. L. F., and

Freitas, A. A. (2013). A grammatical evolution algo-

rithm for generation of hierarchical multi-label clas-

siﬁcation rules. In IEEE Congress on Evolutionary

Computation, pages 454–461. IEEE.

Enembreck, F., Scalabrin, E. E., Tacla, C. A., and

Avila,

B. C. (2006). Automatic identiﬁcation of teams based

on textual information retrieval. In CSCWD, pages

534–538. IEEE.

Ferrandin, M., Nievola, J. C., Enembreck, F., Scalabrin,

E. E., Kredens, K. V., , and

Avila, B. C. (2013). Hi-

erarchical classiﬁcation using fca and the cosine sim-

ilarity function. In Proceedings of the 2013 Interna-

cional Conference on Artiﬁcial Inteligence (ICAI’13),

volume 1, pages 281–287.

Filmore, D. (2004). It’s a GPCR world. Modern Drug Dis-

covery, 7:24–28.

Guan, H., Zhou, J., and Guo, M. (2009). A class-feature-

centroid classiﬁer for text categorization. In Proceed-

ings of the 18th international conference on World

wide web, WWW ’09, pages 201–210, New York, NY,

USA. ACM.

Han, E.-H. and Karypis, G. (2000). Centroid-based doc-

ument classiﬁcation: Analysis and experimental re-

sults. In Proceedings of the 4th European Conference

on Principles of Data Mining and Knowledge Discov-

ery, PKDD ’00, pages 424–431, London, UK, UK.

Springer-Verlag.

Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohen,

F. E., and Vriend, G. (2003). Gpcrdb information sys-

tem for g protein-coupled receptors. Nucleic Acids

Research, 31(1):294–297.

Kiritchenko, S., Matwin, S., and Famili, A. F. (2005). Func-

tional annotation of genes using hierarchical text cat-

egorization. In in Proc. of the BioLINK SIG: Link-

ing Literature, Information and Knowledge for Biol-

ogy (held at ISMB-05.

Otero, F. E. B., Freitas, A. A., and Johnson, C. G. (2010). A

hierarchical multi-label classiﬁcation ant colony algo-

rithm for protein function prediction. Memetic Com-

puting, pages 165–181.

Rocchio, J. (1971). Relevance feedback in information re-

trieval. In Salton, G., editor, The SMART Retrieval

System - Experiments in Automatic Document Pro-

cessing, pages 313–323. Prentice Hall.

Rom˜ao, L. M. and Nievola, J. C. (2012). Hierarchical

classiﬁcation of gene ontology with learning classi-

ﬁer systems. In Advances in Artiﬁcial Intelligence -

IBERAMIA 2012, volume 7637 of Lecture Notes in

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

Computer Science, pages 120–129. Springer Berlin

Heidelberg.

Salton, G. (1989). Automatic text processing: the trans-

formation, analysis, and retrieval of information by

computer. Addison-Wesley Longman Publishing Co.,

Inc., Boston, MA, USA.

Silla, C. N. and Freitas, A. A. (2009). A global-model naive

bayes approach to the hierarchical prediction of pro-

tein functions. In Proceedings of the 2009 Ninth IEEE

International Conference on Data Mining, ICDM ’09,

pages 992–997, Washington, DC, USA. IEEE Com-

puter Society.

Silla, C. N. and Freitas, A. A. (2011a). Selecting dif-

ferent protein representations and classiﬁcation algo-

rithms in hierarchical protein function prediction. In-

tell. Data Anal., 15(6):979–999.

Silla, C. N. and Kaestner, C. A. A. (2013). Hierarchi-

cal classiﬁcation of bird species using their audio

recorded songs. In SMC, pages 1895–1900. IEEE.

Silla, Jr., C. N. and Freitas, A. A. (2011b). A survey of

hierarchical classiﬁcation across different application

domains. Data Min. Knowl. Discov., 22:31–72.

Tan, S. (2008). An improved centroid classiﬁer for text cat-

egorization. Expert Syst. Appl., 35(1-2):279–285.

Theeramunkong, T. and Lertnattee, V. (2001). Improv-

ing centroid-based text classiﬁcation using term-

distribution-based weighting system and clustering.

Tibshirani, R., Hastie, T., Narasimhan, B., and Chu,

G. (2002). Diagnosis of multiple cancer types by

shrunken centroids of gene expression. Proceedings

of the National Academy of Sciences, 99(10):6567–

6572.

Tipton, K. F. and Boyce, S. (2000). History of the enzyme

nomenclature system. Bioinformatics, 16(1):34–40.

Vens, C., Struyf, J., Schietgat, L., Dˇzeroski, S., and Block-

eel, H. (2008). Decision trees for hierarchical multi-

label classiﬁcation. Mach. Learn., 73:185–214.

ACentroid-basedApproachforHierarchicalClassification