Applying models to data affected by outliers can
even produce incorrect inferences. In the past, the
influence of outliers was rarely considered when an-
alyzing data from standard microarrays. According
to the new current, outlier detection is used as a pre-
processing on data for their cleaning. However, it is
substantial to emphasize that, in many cases, outliers
may simply be the result of natural variability in the
data.
In this work we propose a novel outlier detection
approach, which combines Hierarchical Clustering
and Robust Principal Component Analysis. This en-
semble mechanism, which joins two techniques gen-
erally not adopted in this context, allows to derive a
pseudo mathematical classification of outlier samples
in GEP data. The obtained classification could be then
used to propose a new decision-making model. The
model is usually chosen based on how it separates the
data into two or more clusters. We propose a data pre-
processing mechanism that, independently of these,
identifies the anomalies and integrates the anomalies
detection tool in the context of microarrays.
The paper is organized as following. Section 2
briefly overviews the two algorithms adopted in our
proposal together with some main methodologies fre-
quently used in gene expression field. Section 3 illus-
trates the experimental results obtained applying the
proposed method on three different datasets (one ar-
tificial and two real medical datasets). Comparisons
with techniques usually used to detect anomalies are
also presented and the advantages of the proposed ap-
proach are discussed. Finally, conclusions and direc-
tions of future research are sketched in Section 4.
2 METHODS FOR OUTLIERS
DETECTION
The approach we propose is based on two important
techniques, which are already independently used for
anomaly detection.
Clustering can be considered the most important
unsupervised learning problem to find a structure in
a collection of unlabeled data. A cluster is therefore
a group of objects which are “similar” between them
and are “dissimilar” to the objects belonging to other
clusters. The outliers are therefore those samples be-
longing to a separate micro cluster, because they are
distant from most of the other data. They are usu-
ally identified by increasing the number of clusters.
In particular, Hierarchical clustering allowing to se-
lect a distance measure is chosen. In gene expression
data analysis, when clusters of observations with the
same overall profiles need to be achieved, correlation-
based distance (used as a dissimilarity measure) has to
be considered the appropriate choice.
On the other hand, distance is not the only param-
eter to be set in clustering algorithms, also the method
defining how to separate two different clusters is a
task to be managed. In our experiments we used Pear-
son correlation distance and Average method, accord-
ing to the empirical criterion assessing their stability
described in Section 3.
The second technique involved in the proposed
approach is Robust Principal Component Analysis
(ROBPCA) method (Hubert et al., 2005), which com-
bines the strengths of Projection-Pursuit techniques
(PP) (Croux et al., 2007) and robust covariance es-
timation. The former is used for reducting the initial
dimensionality, whereas the second, in particular the
Minimum Covariance Determinant (MCD) estimator,
is applied to the obtained smaller data space.
Consider an n × p data matrix X = X
n,p
, where
n indicates the number of the observations and p the
original number of variables, the ROBPCA method
proceeds in three main steps:
1. the data are pre-processed such that the trans-
formed data are lying in a subspace whose dimen-
sion is at most n − 1.
2. a preliminary covariance matrix S
0
is constructed
and used for selecting the number of components
k that will be retained in the sequel, yielding a k-
dimensional subspace well fitted to the data.
3. data points are projected on this subspace where
their location and scatter matrix are robustly es-
timated, from which its k nonzero eigenvalues
`
1
,... `
k
are computed. The corresponding eigen-
vectors are the k robust principal components.
Let P
p,k
be the p × k eigenvector matrix (orthogonal
columns), the location estimate is denoted by the p-
variate column vector
ˆ
ν and called the robust center.
The scores are the entries of the n × k matrix
T
n,k
= (X
n,p
− 1
n
ˆ
ν
>
) · P
p,k
(1)
The k robust principal components generate a p × p
robust scatter matrix S of rank k given by
S = P
p,k
L
k,k
P
>
p,k
(2)
where L
k,k
is the diagonal matrix with the eigenvalues
`
1
,. .. `
k
.
Similarly to classical PCA, the ROBPCA method
is location and orthogonal equivariant, these proper-
ties is not trivial for other robust PCA estimators.
It should be noted that dimensionality reduction ap-
proaches are widely used in the context of microarray
data analysis (Esposito et al., 2020) but, for the best
of our knowledge, this is the first time that ROBPCA
is applied for outlier detection in microarray data.
Anomalies Detection in Gene Expression Matrices: Towards a New Approach
163