Data obtained from cancer related gene expression
studies typically consists of expression level mea-
surements of thousands of genes. This complexity
calls for data analysis methodologies that will effi-
ciently aid in extracting relevant biological informa-
tion. Previous gene expression analysis work em-
phasizes clustering techniques (nonsupervised classi-
fication), which aim at partitioning the set of genes
into subsets that are expressed similarly across differ-
ent conditions. On the other hand, supervised clas-
sification techniques (also called class prediction or
class discrimination) with the aim to assign examples
to predefined categories, (Golub et al., 1999; Diaz-
Uriarte and de Andr
´
es, 2006; Nitsch et al., 2010).
The objectives of supervised classification tech-
niques are: 1) to build accurate classifiers that enable
the reliable discrimination between different cancer
classes, 2) to identify biomarkers of diseases, i.e. a
small set of genes that leads to the correct discrimi-
nation between different cancer states. This second
purpose of supervised classification can be achieved
by classifiers that provide understandable results and
indicate which genes contribute to the discrimination.
Following this line, in this paper the goal is to
apply two techniques to classify and select features
to tumor datasets in order to carry out an analysis of
these datasets and to obtain the information that pro-
vide understandable results. We use the Fuzzy Ran-
dom Forest method (FRF) proposed in (Bonissone
et al., 2010; Cadenas et al., 2012a) and the Feature Se-
lection Fuzzy Random Forest method (FRF-fs) pro-
posed in (Cadenas et al., 2013).
This paper is organized as follows. First, in Sec-
tion 2 some techniques applied to gene expression
data reported in literature are briefly described. Next,
in Section 3, the applied methods are described. Then,
in Section 4 we perform an analysis of two tumor
datasets using these methods. Finally, in Section 5
remarks and conclusions are presented.
2 MACHINE LEARNING AND
GENE EXPRESSION DATA
In this section, we describe some of the machine
learning techniques used for the management of gene
expression data.
2.1 Cluster Analysis based Techniques
Clustering is one of the primary approaches to ana-
lyze such large amount of data to discover the groups
of co-expressed genes. In (Mukhopadhyaya and
Maulikb, 2009) an attempt to improve a fuzzy clus-
tering solution by using SVM classifier is presented.
In this regard, two fuzzy clustering algorithm, VGA
and IFCM have been used.
In (Alon et al., 1999) a clustering algorithm to or-
ganize the data in a binary tree is used. The algorithm
was applied to both the genes and the tissues, reveal-
ing broad coherent patterns that suggest a high degree
of organization underlying gene expression in these
tissues. Coregulated families of genes clustered to-
gether. Clustering also separated cancerous from non-
cancerous tissue.
In (Golub et al., 1999) a SOM to divide the
leukemia examples into cluster is used. First, they
applied a two-cluster SOM to automatically discov-
ering the two types of leukemia. Next, they applied
a four-cluster SOM. They subsequently obtained im-
munophenotype data on the examples and found that
the four classes largely corresponded to AML, T-
lineage ALL, B-lineage ALL, and B-lineage ALL, re-
spectively. The four-cluster SOM thus divided the ex-
amples along another key biological distinction.
In (Ben-Dor et al., 2000) a clustering based clas-
sifier is built. The clustering algorithm on which the
classifier is constructed is the CAST algorithm that
takes as input a threshold parameter t, which controls
the granularity of the resulting cluster structure, and
a similarity measure between the tissues. To classify
a example they cluster the training data and example,
maximizing compatibility to the labeling of the train-
ing data. Then they examine the labels of all elements
of the cluster the example belongs to and use a simple
majority rule to determine the unknown label.
2.2 Techniques for Feature Selection
and Supervised Classification
Discovering novel disease genes is still challenging
for constitutional genetic diseases (a disease involv-
ing the entire body or having a widespread array of
symptoms) for which no prior knowledge is available.
Performing genetic studies frequently result in large
lists of candidate genes of which only few can be fol-
lowed up for further investigation. Gene prioritiza-
tion establishes the ranking of candidate genes based
on their relevance with respect to a biological process
of interest, from which the most promising genes can
be selected for further analysis, (Nitsch et al., 2010).
This is a special case of feature selection, a well-
known problem in machine learning.
In (Golub et al., 1999) a procedure that uses
a fixed subset of “informative genes” is developed.
These “informative genes” are chosen based on their
correlation with the class distinction.
UsingaFuzzyDecisionTreeEnsembleforTumorClassificationfromGeneExpressionData
321