FUZZY HYPER-CLUSTERING FOR PATTERN CLASSIFICATION
IN MICROARRAY GENE EXPRESSION DATA ANALYSIS
Jin Liu and Tuan D. Pham
School of Engineering and Information Technology, University of New South Wales, Canberra ACT 2600, Australia
Keywords:
Fuzzy c-means, Hyperplanes, Microarray gene expression data.
Abstract:
Based on the motivation by computational challenges in microarray data analysis, we propose a fuzzy hyper-
cluster analysis as a new framework for pattern classification using such type of data. This approach uses
hyperplanes to represent the cluster centers in the fuzzy c-means algorithm. We present in this position paper
the formulation of a hyperplane-based fuzzy objective function and suggest possible solutions. Fuzzy hyper-
clustering approach appears to have potential as a novel alternative to analyze microarray gene expression
data. Furthermore, the proposed hyper-clustering algorithm is not only confined to microarray data analysis
but can be used as a general approach for classifying closely related features.
1 INTRODUCTION
Microarray technology has been developed during
last few years and becomes a popular analysis method
for studying gene expression. The advantage of mi-
croarray analysis lies in that it enables researchers to
work on the expression patterns of tens of thousands
genes simultaneously. Although microarray technol-
ogy provides convenient methods to analyze gene ex-
pression, the analysis process is complex and diffi-
cult. Many computational tools has been applied to
microarray gene expression data analysis. Accord-
ing to (Pham et al., 2006), these methods can be
classified into two categories, classification-based and
clustering-based. Classification-based methods like
support vector machines (SVMs) (Statnikov et al.,
2005) and k-nearest neighborhood (k-NN) (Pham,
2005). Clustering-based methods like fuzzy c-means
(FCM)(Asyali and Alci, 2005) and self-organizing
map (SOM) (Dougherty et al., 2002).
In this paper, we propose a fuzzy hyper-clustering
approach for pattern classification in microarray gene
expression data. The proposed method can be viewed
as an extension of the fuzzy c-means clustering which
uses hyperplanes as cluster centers. We formulate
the objective function for the fuzzy hyper-clustering
and discuss possible solutions using iterative numeri-
cal method or nature-inspired optimization method.
In literature review, there are some methods be-
ing similar to the proposed fuzzy hyper-clustering.
In (Bradley and Mangasarian, 2000), the authors
proposed to use k planes as cluster prototypes and
adopted eigenvalue decomposition to calculate these
fitting planes, the work is followed by a number of
research. Some of the following research used par-
allel or non-parallel fitting planes to perform binary
classification (Yang et al., 2009), some methods were
extended to multicategory classification by using one-
from-rest methods for each class (Jayadeva et al.,
2007).
The papers mentioned above are similar but dif-
ferent to the proposed approach. In the proposed
method, memberships of data samples assigned to
the cluster prototypes are of continuous values which
make it suitable for gene expression data in which
patterns often overlap, and the proposed clustering
method is a kind of unsupervised learning which is
also different from the work mentioned above.
The rest of the paper is organized as the follows.
In Section 2, we report current challenges in microar-
ray data analysis. Section 3 presents the proposed
fuzzy hyper-clustering and possible solutions. Fi-
nally, concluding remarks of this position paper is
given in Section 4.
2 CHALLENGES IN
MICROARRAY DATA
ANALYSIS
Although microarray data sets can be in different
forms due to various experiments platforms, they im-
pose common challenges in the analysis which we
415
Liu J. and D. Pham T. (2010).
FUZZY HYPER-CLUSTERING FOR PATTERN CLASSIFICATION IN MICROARRAY GENE EXPRESSION DATA ANALYSIS.
In Proceedings of the Third International Conference on Bio-inspired Systems and Signal Processing, pages 415-418
DOI: 10.5220/0002719504150418
Copyright
c
SciTePress
will discuss in the next section.
2.1 High-dimension Small-sample
Firstly, microarray data are often in high-dimension
and small-sample. Most of the publicly available
microarray data sets usually consist of less than 20
samples with more than thousands of gene features
(Golub et al., 1999). The high-dimension small-
sample problem is partly caused by the imbalanced
developing speed between slow sample collection and
rapid sequencing technology. As the development of
microarray chips also follows the Moore’s law from
the semiconductor, the high-dimension small-sample
problem would probably exist for long time.
2.2 High Redundancy
Another character of microarray gene-expression data
is that the data is highly redundant. Thus, the algo-
rithm which is to be utilized has to be able to dis-
cover expression patterns in a large amount of ir-
relevant genes. In this case, feature selection be-
comes important to improve the performance of pat-
tern classification. Unfortunately, most of the con-
ventional feature selection algorithms may not work
well in high-dimension and small-sample microarray
data sets (Ding and Peng, 2003). The requirements
for feature selection in microarray include that the
algorithm should be computationally efficient, and it
should work well with small-sample data.
2.3 Inherent Noise
The effect of noise introduced from the manufac-
turing process of microarray chips could not be ig-
nored. The noise could be introduced from different
stages and by different reasons. The inherent noise
makes the analysis difficult for some computational
tools like c-means clustering and hierarchical cluster-
ing as these methods are sensitive to incomplete or
inaccurate information. Although the missed feature
value can be imputed through some kind of estima-
tion, the estimated value could be even more unre-
liable, and may adversely affect the analysis results
(Suzuki et al., 2000). To produce reliable analysis re-
sults, the algorithms which are to be applied should
be robust to noise and reduce the negative effects.
2.4 Gene Expression Overlapping
Another problem in microarray data analysis is that
the clusters usually overlap (Baken et al., 2008). This
is because each gene can have more than one func-
tion. Popular computational tools like c-means and
SOM assign crisp membership to genes, which would
distort the clustering shape in overlapped gene ex-
pression analysis and could not identify co-expression
with different groups of genes. As an alternative,
fuzzy clustering, which assigns continuous member-
ship grades, can be used to analyze overlapped infor-
mation about gene multi-functionality and to reveal
the relative likelihood of each gene belonging to each
cluster.
To get a reliable results and meaningful explana-
tions from microarray gene expression data, the com-
putational tools have to be capable for analyzing data
sets with characters listed above. Facing challenges
for microarray data analysis, there is always a need
for new analysis method that could bear better perfor-
mance.
3 A FUZZY
HYPER-CLUSTERING
TECHNIQUE
Being motivated by the useful concepts of coupling
fuzzy c-means clustering with hyperplane mapping,
we aim to solve the expression-overlapping, high-
dimension, and small-sample problems in microarray
gene expression data analysis by proposing a fuzzy
hyper-clustering technique. We discuss the proposed
method in subsequent sections.
3.1 Hyperplane-based Fuzzy
Clustering
Being different from most current clustering tech-
niques such as c-means and fuzzy c-means clustering,
which represent the cluster centers using p-dimension
mean vectors of the data samples, the proposed fuzzy
hyper-clustering adopts the geometry of hyperplanes,
which was employed to develop support vector ma-
chines and other kernel-based methods (Cristianini
and Shawe-Taylor, 2000), to represent its cluster cen-
ters h
j
= (w
j
, v
j
), j = 1, ..., c, where c is the number
of clusters. From now on, we will refer h
j
as hyper-
cluster, an example of a two dimensional hypercluster
in a single class is shown in Figure 1.
In the proposed clustering technique, sample
points are assigned fuzzy memberships to each hy-
percluster according to its distances to the hyperclus-
ters. The aim of the fuzzy hyper-clustering is to find
a fuzzy partition matrix U = [u
ij
], i = 1, ..., n, n is the
number of samples; and hyperclusters h
j
that mini-
BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing
416
0 5 10 15 20
0
50
100
150
Min: sum of d(x,v), x is all the sample points
Figure 1: Two-dimensional fuzzy hyper-clustering for a sin-
gle class.
mizes the sum of the distances from all sample points
to all hyperclusters.
U
n×c
=
u
11
u
12
. ... u
1c
u
21
u
22
. ... u
2c
. . . ... .
... . . u
ij
...
u
n1
. . ... u
nc
(1)
c
j=1
u
ij
= 1;i = 1, ..., n; j = 1, ..., c; (2)
u
ij
[0, 1] (3)
where u
ij
is the fuzzy membership of i-th object data
vector to j-th hypercluster. The resulting partition ma-
trix and hyperclusters would minimize the following
objective function:
J =
n
i=1
c
j=1
u
m
ij
d(x
i
, h
j
) (4)
where h
j
is the j th hypercluster (w
j
, v
j
),
h
j
= {w
1j
, w
2j
, w
3j
, ..., w
pj
, v
j
}
and w
j
= {w
1j
, w
2j
, w
3j
, ..., w
pj
} is a p-dimensional
normal vector to the j-th hypercluster. The distance
from a data point to the hypercluster is:
d(x
i
, h
j
) =
|w
j
· x
i
v
j
|
||w
j
||
2
(5)
||w
j
|| = 1; j = 1, ..., c;w
ij
6= 0 (6)
where w
j
·x
i
is the dot product between vector w
j
and
vector x
i
.
3.2 Proposed Solutions
To find a solution that minimizes the above objec-
tive function J, we adopt an iterative numerical model
which updates the fuzzy partition process until a con-
vergence of the solution is reached. This process is
analogous to the fuzzy c-means clustering algorithm.
In addition, we propose to use nature-inspired opti-
mization based methods as alternative solutions.
3.2.1 Iterative Numerical Method
For an iterative numerical method, by taking the first
derivatives of objective function J with respective to
the variants and setting them to zero, we can get
the necessary conditions for the objective function to
reach a minimum. The parameters can be updated ac-
cording to the following steps:
1. Initialize partition matrix U and hyperclusters
h
j
, j = 1, ..., c.
2. Calculate the new hyperclusters which minimize
the objective function J under the current partition
matrix U
t
, where t is the iteration count, then we
get the updated h
t+1
j
.
3. Following the above computation, we obtain the
newly updated fuzzy hyperclusters h
t+1
j
, then cal-
culate the new fuzzy partition matrix, the updated
fuzzy partition matrix U that minimizes the objec-
tive function J under the current fuzzy hyperclus-
ters h
t+1
j
.
4. If the algorithm converges, then the computation
stops. Otherwise, go to Step 2.
We consider the algorithm converges if the maxi-
mum change in the partition matrix between itera-
tions is less than a preset positive small number ε.
The resulting partition matrix U
and hyperclusters
h
j
, j = 1, ..., c, satisfy the solution which minimizes
the objective function J.
3.2.2 Particle Swarm Optimization
Particle swarm optimization (PSO) is a kind of evolu-
tionary computation which has been used in many ar-
eas including clustering (Feng et al., 2006). When ap-
plying PSO into the proposed fuzzy hyper-clustering,
we need to identify the positions of particles and the
fitness function. An intuitive choice is to use the c hy-
perclusters h
j
, j = 1, ..., c, to represent the position of
a particle, and the objective function J to be the fitness
function.
The PSO-based fuzzy hyperclustering can be up-
dated according to the following steps:
FUZZY HYPER-CLUSTERING FOR PATTERN CLASSIFICATION IN MICROARRAY GENE EXPRESSION DATA
ANALYSIS
417
1. Encode the position for each particle as c hy-
perclusters h
j
, j = 1, ..., c, then initialize partition
matrix U and first generation of particles.
2. Start the evolution under the current partition ma-
trix U
t
and fitness function J. The PSO updates
position and velocity for each particle. Evolution
continues until an optimal particle that minimizes
the objective function J under the current partition
matrix U
t
is found. Then we get the c hyperclus-
ters h
t+1
j
, j = 1, ..., c.
3. After we get the c hyperclusters h
t+1
j
, we then cal-
culate the new fuzzy partition matrix for each data
point, the updated fuzzy partition matrix U
t+1
would minimize the objectivefunction J under the
current fuzzy hyperclusters.
4. If the algorithm converges or reaches the max-
imum iteration numbers, the computation stops.
Otherwise, go to Step 2.
The convergence condition is similar to that of the
iterative numerical solution.
4 CONCLUSIONS
We have presented a proposed fuzzy hyper-clustering
algorithm for pattern classification in microarray
gene expression data. We formulated the objective
function for the proposed hyper-clustering and dis-
cussed possible solutions using numerical and nature-
inspired optimization methods. The proposed cluster-
ing method can be: 1) suitable for overlapping data
samples as fuzzy membership is utilized; 2) compu-
tationally efficient as the calculation for hyperclus-
ters may use generalized eigenvalue decomposition
which is simpler than that in standard SVMs; 3) po-
tential to handle nonlinear data as a kernelized ver-
sion of the proposed method can take advantage of
the kernel trick for nonlinear data analysis; 4) suit-
able for high dimensions small sample sizes data sets
as the supervised version of the proposed method can
be viewed as a variant of SVMs which currently is
known as the best high-dimension small-sample prob-
lem solver. Furthermore, the proposed approach can
be applied to many other different areas, not only con-
fined to microarray gene expression analysis.
REFERENCES
Asyali, M. H. and Alci, M. (2005). Reliability analysis of
microarray data using fuzzy c-means and normal mix-
ture modeling based classification methods. Bioinfor-
matics, 21:644–649.
Baken, K. A., Pennings, J. L., Jonker, M. J., Schaap, M. M.,
de Vries, A., van Steeg, H., Breit, T. M., and van Lov-
eren, H. (2008). Overlapping gene expression pro-
files of model compounds provide opportunities for
immunotoxicity screening. Toxicology and Applied
Pharmacology, 226:46–59.
Bradley, P. S. and Mangasarian, O. L. (2000). k-plane clus-
tering. J. Global Optimization, 16:23–32.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction
to Support Vector Machines and Other Kernel-based
Learning Methods. Cambridge University Press,
Cambridge.
Ding, C. and Peng, H. (2003). Minimum redundancy fea-
ture selection from microarray gene expression data.
In Proc. 2003 IEEE Computer Society Bioinformatics
Conference, pages 523–529.
Dougherty, E. R., Barrera, J., Brun, M., Kim, S., Cesar,
R. M., Chen, Y., Bittner, M., and Trent, J. M. (2002).
Inference from clustering with application to gene-
expression microarrays. J. Computational Biology,
9:105–126.
Feng, H. M., Chen, C. Y., and Ye, F. (2006). Adap-
tive hyper-fuzzy partition particle swarm optimiza-
tion clustering algorithm. Cybernetics and Systems,
37:463–479.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasen-
beek, M., Mesirov, J. P., Coller, H., Loh, M. L., Down-
ing, J. R., Caligiuri, M. A., Bloomfield, C. D., and
Lander, E. S. (1999). Molecular classification of can-
cer: Class discovery and class prediction by gene ex-
pression monitoring. Science, 286:531–537.
Jayadeva, Khemchandaniand, R., and Chandra, S. (2007).
Fuzzy multi-category proximal support vector classi-
fication via generalized eigenvalues. Soft Computing,
11:679–685.
Pham, T. D. (2005). An optimally weighted fuzzy k-NN al-
gorithm. In Proc. 2005 Int. Conf. Advances in Pattern
Recognition, pages 239–247.
Pham, T. D., Wells, C., and Crane, D. I. (2006). Analysis
of microarray gene expression data. Current Bioinfor-
matics, 1:37–53.
Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D.,
and Levy, S. (2005). A comprehensive evaluation
of multicategory classification methods for microar-
ray gene expression cancer diagnosis. Bioinformatics,
21:631–643.
Suzuki, T., Hashimoto, S.-i., Toyoda, N., Nagai, S., Ya-
mazaki, N., Dong, H.-Y., Sakai, J., Yamashita, T.,
Nukiwa, T., and Matsushima, K. (2000). Comprehen-
sive gene expression profile of LPS-stimulated human
monocytes by SAGE. Blood, 96:2584–2591.
Yang, X., Chen, S., Chen, B., and Pan, Z. (2009). Proxi-
mal support vector machine using local information.
Neurocomputing, in-print.
BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing
418