56.20 cases in every 100 thousand women. Without
taking into account non melanoma skin tumors,
breast cancer ranks first and most frequent in the
South, Southeast, Northeast and Center of Brazil.
Cancer classification has been a central topic in
treatment research. The most classical approach is
based on tumor morphology, which presents
limitations such as a strongly biased identification
by specialists and also difficulties in distinguishing
between subtypes (Liu et al., 2004). In this context
data mining appears as a means of facilitating
decision making processes by health professionals
when it comes to diagnosis, treatment and patient
care (Tseng et al., 2015).
Data mining belongs to a stage in the Knowledge
Discovery in Database (KDD) process (Tseng et al.,
2015). It is defined as the process of discovering
hidden patterns in data, which can take place
automatic or semi automatically (Witten et al.,
2011).
The vast applications of cDNA and
oligonucleotide microarrays, with complete genomic
expression, scanning more than 40,000 clones in a
single experiment, made possible the development
of a new era of molecular genomics. At the same
time, they’re generating vast amounts of data.
Molecular expression based classification continues
to be a challenge partly due to the different
microarray platforms, identification methods,
scanners, image analysis tools but also currently
available classification algorithms. Moreover,
there’s a growing number of algorithms that are
being developed for analysis of high quality
microarray data (Greer and Khan, 2004).
Due to the high cost, genetic data are usually
collected from a limited number of patients.
Therefore, there’s a need for choosing the most
relevant information among the available data.
Irrelevant gene removal can contribute to the
reduction of noise, confusion and complexity.
Besides, it increases the chances of identifying
important genes, classifying diseases and predicting
several outcomes (e.g., type of cancer). Several
computacional strategies have been applied to gene
expression classification problems (Shah and
Kusiak, 2007).
Learning methods constitute an automatic and
intelligent technique, which has been used widely
for solving different real and complex situations.
Since its introduction in bioinformatics, learning
approaches have helped to speed up diverse
researches. Since they are inexpensive and efficient,
its applications have become more popular and
constantly growing (Liu et al., 2004).
Usually, there are two different learning
schemes, supervised and unsupervised learning. In
the first one, the output is given, or there is some
type of previous knowledge about the data. On the
second one, however, there is no previous
knowledge about the data. General tasks performed
are classification, characterization and clustering.
The supervised approach is the most used in
biological problems where two sets of samples are
presented. The program, then, must generate a
classifier which is able to distinguish between these
two datasets. Then, it can be used as a base for the
classification of unseen data (Liu et al., 2004).
In an article published by Ahmad et al. (2015), it
was mentioned that when analyzing data mining in
healthcare, classification is one of the most popular
methods. The authors then follow by giving a list of
the most commonly used classification algorithms in
healthcare, such as K Nearest Neighbor (KNN),
Decision Trees, Support Vector Machines, Artificial
Neural Networks and Bayesian Methods (Ahmad et
al., 2015).
Decision Trees (DT) are considered to be one of
the most popular approaches when it comes to
classifiers. They can be built from data which is
already available in several fields. Every non leaf
node denotes a test to be performed, while branches
are outcomes. The tree ends in a leaf node, which
represents a class label. The most common use of
DTs is to calculate conditional probabilities. They
allow for class separation based on information gain.
The main advantages presented by this method are
the fact that DTs are self-explanatory, easy to
follow, the ability to handle nominal and numeric
attributes, ability to handle missing values.
However, they also present several disadvantages.
Most algorithms require the trees to have discrete
values, for they use the divide and conquer method.
Their performance gets lower the more complex is
the interaction among attributes. In that way, other
classifiers can describe the relationship among
variables in a way DTs would make it really
challenging (Ahmad et al., 2015).
Artificial Neural Networks (ANN), on the other
hand, were considered to be the best classification
algorithm before the introduction of methods such as
DTs and Support Vector Machines. This allowed
them to be one of the most widely used in several
different fields. They have widely used in supporting
diagnosis of diseases such as cancers and in
predicting outcomes. Their basic elements are
neurons (also called nodes), which are
interconnected and work in parallel to produce
outcome functions. The main ability behind ANNs is