Anomalies Detection in Gene Expression Matrices: Towards a New

Approach

Nicoletta Del Buono

1 a

, Flavia Esposito

1,2 b

, Laura Selicato

1 c

and Maria Carmela Vegliante

2 d

Members of INDAM-GNCS Research Group, Department of Mathematics, University of Bari Aldo Moro,

via E. Orabona 4, I-70125, Bari Italy

Hematology and Cell Therapy Unit, IRCCS - Istituto Tumori Giovanni Paolo II, Bari, Italy

Keywords:

Outlier Detection, Gene Expression Proﬁling, Clustering, Robust PCA.

Abstract:

One of the main problems in analyzing real data is often related to the presence of anomalies. Anomalous cases

may, in fact, spoil the resulting analysis as well as contain valuable information at the same time. In both cases,

the ability to detect these occurrences is very important. Particularly, in biomedical ﬁeld, a proper identiﬁcation

of outliers allows to develop novel biological hypotheses not taken into consideration when experimental

biological data are considered. In this paper, we address the problem of detecting outlier samples in gene

expression data. We propose an ensemble approach for anomalies detection in gene expression matrices based

on the use of hierarchical clustering and Robust Principal Component Analysis, that allows to derive a novel

pseudo mathematical classiﬁcation of anomalies.

1 INTRODUCTION

Real datasets often contain observations which be-

have differently from the majority of data. When an

occurrence is different from the dominant part of the

data or is sufﬁciently unlikely under the assumed data

probability model, it is considered as an anomaly or

outlier.

Outliers may be caused by errors, but they may

result from exceptional circumstances, or belong to

another data population. On the one hand, anoma-

lies may produce deleterious effect on the conclusions

drawn from the data analysis, on the other hand, they

may contain important information. Hence, the con-

cern of detecting outliers lies on the interest of the

outliers themselves or on the fact that they could con-

taminate the downstream statistical analysis.

In biomedical ﬁeld, an outlier can be an abnormal

sample that deviates signiﬁcantly from the other sam-

ples in its class. Typically, this occurs when a sample

of one class is accidentally assigned to another class.

In a context of carcinogenic pathology, this may mean

that such a patient’s disease is a special case.

https://orcid.org/0000-0001-5079-875X

https://orcid.org/0000-0002-2791-9610

https://orcid.org/0000-0001-9248-3879

https://orcid.org/0000-0002-3165-1768

Hence, in biological datasets, a proper outlier

identiﬁcation could be of some interest: in fact, de-

pending on the type of analysis to be performed, this

would allow biologists to consider whether these data

should be removed or not.

In this paper, we address the problem of detecting

outliers in Gene Expression Proﬁling (GEP) data, that

is microarray data which contain gene expression lev-

els for a given number of samples labeled with a bio-

logical class (tumor type or experimental condition).

In microarrays there are two main types of outliers

referred to the case when instances are genes or sam-

ples, respectively (Shieh and Hung, 2009). The for-

mer is present when a gene has abnormal expression

values in one or more samples from the same class.

Whereas, the latter can be seen dually as samples that

belong to a different class present in the data (often re-

ferred to as mislabeled samples) or as samples that do

not belong to any class present in the data (called ab-

normal samples). The origin of these outliers can be

ambiguous, they can result from an undiscovered bi-

ological class, poor class deﬁnitions, experimental er-

rors, or extreme biological variability. Note that when

we say that an anomalous sample does not belong to

its class, we are not necessarily contesting the validity

of its label. In fact, a sample may still be a tumor, but

having expression levels that differ considerably from

those of other tumor samples.

162

Buono, N., Esposito, F., Selicato, L. and Vegliante, M.

Anomalies Detection in Gene Expression Matrices: Towards a New Approach.

DOI: 10.5220/0010342301620169

In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021) - Volume 3: BIOINFORMATICS, pages 162-169

ISBN: 978-989-758-490-9

Applying models to data affected by outliers can

even produce incorrect inferences. In the past, the

inﬂuence of outliers was rarely considered when an-

alyzing data from standard microarrays. According

to the new current, outlier detection is used as a pre-

processing on data for their cleaning. However, it is

substantial to emphasize that, in many cases, outliers

may simply be the result of natural variability in the

data.

In this work we propose a novel outlier detection

approach, which combines Hierarchical Clustering

and Robust Principal Component Analysis. This en-

semble mechanism, which joins two techniques gen-

erally not adopted in this context, allows to derive a

pseudo mathematical classiﬁcation of outlier samples

in GEP data. The obtained classiﬁcation could be then

used to propose a new decision-making model. The

model is usually chosen based on how it separates the

data into two or more clusters. We propose a data pre-

processing mechanism that, independently of these,

identiﬁes the anomalies and integrates the anomalies

detection tool in the context of microarrays.

The paper is organized as following. Section 2

brieﬂy overviews the two algorithms adopted in our

proposal together with some main methodologies fre-

quently used in gene expression ﬁeld. Section 3 illus-

trates the experimental results obtained applying the

proposed method on three different datasets (one ar-

tiﬁcial and two real medical datasets). Comparisons

with techniques usually used to detect anomalies are

also presented and the advantages of the proposed ap-

proach are discussed. Finally, conclusions and direc-

tions of future research are sketched in Section 4.

2 METHODS FOR OUTLIERS

DETECTION

The approach we propose is based on two important

techniques, which are already independently used for

anomaly detection.

Clustering can be considered the most important

unsupervised learning problem to ﬁnd a structure in

a collection of unlabeled data. A cluster is therefore

a group of objects which are “similar” between them

and are “dissimilar” to the objects belonging to other

clusters. The outliers are therefore those samples be-

longing to a separate micro cluster, because they are

distant from most of the other data. They are usu-

ally identiﬁed by increasing the number of clusters.

In particular, Hierarchical clustering allowing to se-

lect a distance measure is chosen. In gene expression

data analysis, when clusters of observations with the

same overall proﬁles need to be achieved, correlation-

based distance (used as a dissimilarity measure) has to

be considered the appropriate choice.

On the other hand, distance is not the only param-

eter to be set in clustering algorithms, also the method

deﬁning how to separate two different clusters is a

task to be managed. In our experiments we used Pear-

son correlation distance and Average method, accord-

ing to the empirical criterion assessing their stability

described in Section 3.

The second technique involved in the proposed

approach is Robust Principal Component Analysis

(ROBPCA) method (Hubert et al., 2005), which com-

bines the strengths of Projection-Pursuit techniques

(PP) (Croux et al., 2007) and robust covariance es-

timation. The former is used for reducting the initial

dimensionality, whereas the second, in particular the

Minimum Covariance Determinant (MCD) estimator,

is applied to the obtained smaller data space.

Consider an n × p data matrix X = X

n,p

, where

n indicates the number of the observations and p the

original number of variables, the ROBPCA method

proceeds in three main steps:

1. the data are pre-processed such that the trans-

formed data are lying in a subspace whose dimen-

sion is at most n − 1.

2. a preliminary covariance matrix S

is constructed

and used for selecting the number of components

k that will be retained in the sequel, yielding a k-

dimensional subspace well ﬁtted to the data.

3. data points are projected on this subspace where

their location and scatter matrix are robustly es-

timated, from which its k nonzero eigenvalues

,... `

are computed. The corresponding eigen-

vectors are the k robust principal components.

Let P

p,k

be the p × k eigenvector matrix (orthogonal

columns), the location estimate is denoted by the p-

variate column vector

ν and called the robust center.

The scores are the entries of the n × k matrix

n,k

= (X

n,p

− 1

) · P

p,k

(1)

The k robust principal components generate a p × p

robust scatter matrix S of rank k given by

S = P

p,k

k,k

p,k

(2)

where L

k,k

is the diagonal matrix with the eigenvalues

,. .. `

Similarly to classical PCA, the ROBPCA method

is location and orthogonal equivariant, these proper-

ties is not trivial for other robust PCA estimators.

It should be noted that dimensionality reduction ap-

proaches are widely used in the context of microarray

data analysis (Esposito et al., 2020) but, for the best

of our knowledge, this is the ﬁrst time that ROBPCA

is applied for outlier detection in microarray data.

Anomalies Detection in Gene Expression Matrices: Towards a New Approach

163

Figure 1: Outliers Classiﬁcation, with p = 3 and k = 2.

An advantage of using PCA related techniques lies

on the possibility of classify outlier according to their

position in the projected subspace. An example of

this is depicted in Figure 1, where four type of outliers

according to the location of the observations, can be

distinguished. The regular observations that form one

homogeneous group near the PCA subspace gener-

ated by the principal components. The good leverage

points, which lie on the same plane as the PCA sub-

space but away from the regular observations. The or-

thogonal outliers, which have a large orthogonal dis-

tance to the PCA subspace but their projection is on

the PCA subspace. Finally, the bad leverage points,

which have a large orthogonal distance and whose

projection on the PCA subspace is remote from the

regular observations.

To understand and quantify how far an observation

is from the center of the ellipse, deﬁned by regular

observations (score = 0), two distances are adopted.

The Score Distances SD

∑

j=1

i j

is a measure of the distance between an observation

belonging to the PCA k-dimensional subspace and

the origin of that subspace. The Orthogonal Distance

= ||x

− ˆµ − P

p,k

||,

where ` are the eigenvalues of the dispersion ma-

trix MCD and t

i j

are the robust scores for each j =

1,. .. ,k, ˆµ is the robust estimate of the center, that

measures the deviation (i.e. the lack of adaptation)

of an observation from the PCA k-dimensional sub-

space.

Based on these measures, a plot, namely diag-

nostic plot or outlier map, can be constructed to dis-

tinguish between regular observations and the three

types of outliers. An example of this plot is depicted

in Figure 2, with the robust score distance SD

and

the orthogonal distance OD

on the horizontal and

vertical axis, respectively. In the diagnostic plot the

Figure 2: Outlier map obtained with ROBPCA (from the

Rospca package available in R).

ﬁrst quadrant at the top right contains the bad out-

liers, the second quadrant on the left the orthogonal

outliers, the third quadrant the regular observations,

ﬁnally the fourth quadrant contains the good leverage

points, based on the previous classiﬁcation.

To classify the observations, two cutoff lines are

then drawn according to the data. The cutoff value on

the horizontal axis is obtained from the 0.975 quan-

tile of the χ-square distribution with k degrees of free-

dom:

SD >

k,.975

The cutoff values for orthogonal distances are ob-

tained using the Wilson-Hilferty approximation for

a Chi-Squared distribution, i.e. the orthogonal dis-

tances, raised to 2/3, are distributed approximately

normally. Therefore, the cutoff values for anomalous

observations are given by

OD > (ˆµ +

σz

.975

)

where ˆµ and

σ are the MCD estimates of µ and σ of

the above normal distribution. As for standard PCA

approaches, also its robust variants need a criterion to

choose the number of its components. An example of

this is proposed in (Hubert et al., 2005) and selects

k components according to

∑

j=1

∑

j=1

≈ 90 or, for

instance, such that

≥ 10

−3

, where `

are the eigen-

values of S

, the robust covariance matrix of the data

and r its rank. In the experimentation carried out, k

was always chosen using this criterion.

In gene expression matrix analysis, anomaly de-

tection is commonly performed using Bioconductor

package arrayQuality (Paquet and Yang, 2010). This

performs the univariate analysis based on two inde-

pendent methods that assign a rank for each column.

The ﬁrst technique, taking one sample at time, com-

pares its probability distribution to the one from the

whole dataset with a Kolmogorov-Smirnov statistics.

Instead, the other technique simply ranks each sam-

ple according the sum across all genes. At the end,

BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms

164

the outlier samples is chosen according to the stan-

dard univariate detection approach applied on the the

rank score.

3 PROPOSED APPROACH

To detect anomalous samples, we use the combined

approach of Hierarchical Clustering and Robust PCA

for our data typology performed sequentially. The

ﬁrst technique provides a preliminary view of the

dataset, the choice of distance is validated by the

Cophenetic Correlation Coefﬁcient (CCC) and the

adaptation to clustering by the Silhouette coefﬁcient.

The second technique characterizes the type of outlier

found previously, based on where it is placed on the

plot.

From the experimentation carried out (performed

on a i7 octa core machine with 16Gb of RAM in R

enviroment (R-Team, 2015)), outliers that are of ”low

quality” are found to be very extreme bad outliers,

above a certain threshold. On the contrary, the ”mis-

labeled” type outliers are on the border between the

orthogonal type outliers and the bad leverage points.

3.1 Synthetic Dataset

As a ﬁrst step for our study, we simulated a typi-

cal cancer dataset with known outliers as proposed

in (Barghash et al., 2016). Each dataset contains

two clearly distinguishable sample classes. Abnormal

samples do not belong to either class or that are sim-

ply mislabeled.

On the rows we have 1000 genes and on the columns

100 samples (50 for each class). The ﬁrst 900 lines

are drawn from the same normal distribution for both

classes, the remaining 100 were drawn from differ-

ent distributions for samples of classes C

and C

, re-

spectively. In addition, samples 10, 15 and 20 of the

class were exchanged with samples 60, 65 and 70

of the C

class. Finally the last sample of each class

was replaced by one with a different distribution (for

example the Poisson distribution). As previously dis-

cussed, to detect outliers, samples are hierarchically

clustered using Pearson distance and different linkage

methods. Subsequently, the CCC and the Silhoette in-

dex are used to validate the quality of clustering. The

CCC expresses the correlation between the original

dissimilarity matrix and the one inferred based on the

classiﬁcation, CCC ≥ 0.8 denotes a good agreement,

whereas CCC < 0.8 indicates that the dendrogram is

not a good representation of the relationships between

objects. Table 1 shows the CCCs corresponding to the

Figure 3: Heatmap of the synthetic dataset.

various methods, selecting as average linkage the best

method for this clustering.

Table 1: CCC corresponding to the various methods.

Linkage Method CCC

average 0.92

ward.D2 0.35

complete 0.58

single 0.9

centroid 0.64

Based on the clustering vector and on the set of

distances, the algorithm calculates the average dis-

similarity of a point x

to its current class and the

lowest dissimilarity of the point to other classes, in-

dicated as a(x

) and b(x

), respectively. Formally, for

all x

∈ C

the above dissimilarities are deﬁned as fol-

low:

a(x

) =

| − 1

∑

j∈C

,i6= j

d(x

) and

b(x

) = min

k6=i

∑

j∈C

d(x

On the other hand, the Silhouette coefﬁcient of a point

is deﬁned as

s(x

) =

b(x

) − a(x

)

max{a(x

),b(x

)}

s(x

) ranges between ] − 1,1[ where 1 indicates a bet-

ter ﬁt to the current cluster and -1 means that the point

actually belongs to the other class or a so called neigh-

boring cluster. In fact by deﬁnition follows:

• s(x

) = 0 if |C

| = 1,

• −1 ≤ s(x

) ≤ 1.

It can be observed that the two techniques return the

same results, in particular the hierarchical clustering

repositions the mislabel outliers in the right cluster,

Anomalies Detection in Gene Expression Matrices: Towards a New Approach

165

Figure 4: Silhoutette.

Figure 5: Circular dendrogram.

the abnormal type outliers form a separate cluster, as

depicted in Figures 4 and 5.

Using RobPCA for ﬁnding outlier in this data, al-

lows to detect more information, as described in Sec-

tion 2. These results are depicted in Figure 6: in this

case, the ”mislabeled” samples are conﬁgured as good

outliers.

3.2 Real Dataset

The experimentation on real data was carried out on

two particular cancer datasets. The ﬁrst dataset, here-

after denoted as A, is characterized by 21120 genes

and 593 samples, divided as follows: 591 samples

with a Diffuse Large Cell B Lymphoma, a particular

Figure 6: Robust PCA.

Figure 7: Silhouette Coefﬁcient of Dataset A.

type of blood cancer, (DLBCL) and 2 samples with a

solid ovarian tumor.

Dataset B, instead, is composed by 21120 genes

and 597 samples, where 591 are from the same DL-

BCL dataset and the remaining are equally divided

between Follicular Lymphoma (FL), Mantle Cell

Lymphoma (MCL) and Burkitt’s Lymphoma (BL).

Raw data

are downloaded from Gene Expression

Omnibus databases, pre-processed removing back-

groud, normalizing and batch effect correction pro-

cedures.

The aim of experimental session is twofold: ﬁrstly

we want to detect if the proposed approach is able to

ﬁnd ovarian outliers from the whole of blood cancer

samples giving more robust and detailed results re-

spect to the existing methods in literature. Secondly,

we want to stress more the approach proposed when

the samples are biologically similar.

3.2.1 Dataset A

The proposed approach provides the following re-

sults. Through the Hierarchical Clustering technique

we obtain the division of the dataset into 3 clusters.

This results in a large cluster with 589 samples and

two smaller clusters of cardinality 2, as we can see in

the Table 2 and in the Silhouette plot in Figure 7. The

ﬁrst, as expected, containing the two samples with

ovarian cancer and the second containing two outliers

that the domain experts did not expect.

Table 2: Silhouette Coefﬁcient of Dataset A.

cluster size ave.sil.width

1 589 0.47

2 2 0.90

3 2 0.56

Flanking these results by the outcome from Ro-

bust PCA we conﬁrm the results obtained from clus-

tering but also we are able to provide information on

the classiﬁcation of the outliers. Figure 8 illustrates

In particular, samples related to DLBCL are associ-

ated to GSE10846, GSE132929, GSE23501, GSE34171,

GSE87371 and GSE98588; samples of ovarian cancer

to GSE9891; whereas the other six samples in Dataset

B were randomly choosen from GSE12195, GSE55267,

GSE93261, GSE26673 and GSE21452.

BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms

166

Figure 8: ROBPCA plot of Dataset A.

Figure 9: Density Plot per cluster.

the location of outliers: the 4 outliers of clusters 2

and 3 are located in the Bad leverage points quadrant.

To investigate better these samples which are differed

from most of the data an analysis on the degradation

of RNA was performed. This analysis conﬁrms that

the 4 outliers correspond to a degradation of the sam-

ple, as described below.

On the border with Good Leverage there are ob-

servations with a lower silhouette, they indicate that

although they are not strictly anomalous samples, they

do not ﬁt very well with the data. Only one sample is

an orthogonal outlier.

The results produced adopting the proposed ap-

proach can be assessed also by using density plots.

Particularly, Figure 9 evidences the different distri-

bution between samples labeled according to cluster

membership.

On the other hand, Figure 10 illustrates the den-

sity of the outliers obtained using RobPCA accord-

ing to their classiﬁcation. Mislabeled Samples are de-

picted in black and red. Abnormal Samples are the

samples in blue and gray. In light blue, yellow and

Figure 10: Outliers DensityPlot for each abnormal sample.

(GSM is the standard acronym for samples in GEO-Gene

Expression Omnibus).

Figure 11: Degradation Plot.

purple we have the samples with low silhouette and

in green the sample identiﬁed as an orthogonal out-

lier by the Robust PCA. In particular, note the distri-

bution of the sample identiﬁed as orthogonal outlier,

in green. While all the other have a lower number

of genes very expressed than the unexpressed, in this

sample the two parts are almost equivalent.

According to this classiﬁcation, it can be as-

sert that the ovarian samples are correctly detected,

whereas the remaining six samples need to be further

investigated. To this aim, the quality of the RNA has

been examined using a RNA degradation plot.

It can be observed in Figure 11 that the two non-

mislabeled outliers are degraded, in fact, the observed

trend differs from that of the others and it is not nearly

constant. This implies that their information power

can be neglected.

Finally, we compared the results provided by the

proposed approach with those obtained from the stan-

dard ”KS” and ”sum” techniques. Figure 12 reports

(as a mechanism of comparison) the Euler-Venn di-

agram drawn with an on-line tool

(Hulsen et al.,

2008).

In the ﬁrst case, our results coincide with those of

web application for the comparison and visualization

of biological lists using area-proportional Venn diagrams

Anomalies Detection in Gene Expression Matrices: Towards a New Approach

167

Figure 12: Comparison of approaches for dataset A.

Figure 13: Degradation Plot of anomalous samples found

through the KS and SUM techniques.

the ”KS” technique, but are fewer in number. How-

ever we believe that our results have more relevance

because with ”KS” 34 samples are considered out-

liers, a high number compared to the nature of the out-

liers. Normally the number of outliers is very small.

The cause probably lies in the origin of the technique

which is based on a distance in terms of probability

distributions. The second technique, instead, ﬁnds

different outliers compared to our approach, only one

outlier is in common. In Figure13 we show the de-

gradetion plot of all the outliers found with the alter-

native techniques, with some control samples. We can

observe that not all anomalous samples found with

these technologies are degraded. This makes us be-

lieve our approach is the most reliable.

3.2.2 Dataset B

Differently from previous case, for this dataset the

choice of the Pearson distance was necessary because

the inserted samples differ little. The Cophenetic co-

efﬁcient is still high, equal to 0.8647102, then we can

consider a good agreement. In this case, through the

hierarchical clustering technique we obtain 3 clusters,

too. Table 3 reports the obtained results, while Figure

14 draws the related Silhouette plot.

Since the experimental dataset is that of the pre-

vious case in which the ovarian tumor samples were

replaced with the samples with FL, MCL and BL tu-

Figure 14: Silhouette Dataset B.

Table 3: Silhouette Coefﬁcient of Dataset B.

cluster size ave.sil.width

1 589 0.55

2 2 0.98

3 6 0.42

mors, the results for the equal part of the dataset are

the same. Instead, as it can be observed from the Ro-

bust PCA in Figure 15, the samples of type MCL, FL

and BL tumors are outliers. In particular, these are

Bad Leverage Outliers since they lie in the ﬁrst quad-

rant. A different distribution of the FL, MCL and

BL tumors samples depicted in red can be observed

when compared to the other samples depicted in yel-

low in Figure 16. The outliers ﬁnd before are depicted

in blue. As it can be observed in Figure 17 and in

Figure18, also in this case comparing the results with

those of standard techniques, the same consideration

as before can be drawn.

4 CONCLUSIONS AND FUTURE

WORKS

An ensemble mechanism combining Robust PCA and

Hierarchical Clustering with opportune distances was

proposed to search for abnormalities in gene expres-

sion matrices in a more reasonable way. It is con-

ﬁgured as an additional tool and allows to derive a

pseudo mathematical classiﬁcation of outlier samples

in GEP data focusing on microarray. Moreover since

recent works focus only on RNA-seq data (Chen et al.,

2020), we will extend our approach to be interchange-

able between different platforms such as RNA-seq

and GEP derived from Nanostring Technologies. The

preliminary experimental results performed using the

proposed approach showed that it is possible to make

a pseudo-classiﬁcation of the outliers based on their

nature. Future works should be performed to identify

the thresholds within which it is possible to associate

the mathematically deﬁned outlier to the biological

outlier. The obtained results are quite promising and

suggest the usefulness of the proposed mechanism as

pre-processing for the analysis of datasets that need to

be further studied. For example, the proposed mech-

anism demonstrates to be able to eliminate degraded

BIOINFORMATICS 2021 - 12th International Conference on Bioinformatics Models, Methods and Algorithms

168

Figure 15: DD-plot Dataset B.

Figure 16: DensityPlot per cluster.

samples, carrying out analyzes on particular samples

and possibly reclassifying mislabeled type outliers.

Future research should be devoted to construct a

new decision-making model which incorporates the

proposed ensemble mechanism as data pre-processing

method to identify the anomalies and integrate the

anomalies detection tool in the context of microarrays

to searching and classifying samples that can generate

new biological hypotheses.

ACKNOWLEDGEMENTS

This work was supported in part by the GNCS-

INDAM (Gruppo Nazionale per il Calcolo Scientiﬁco

of Istituto Nazionale di Alta Matematica) Francesco

Severi, P.le Aldo Moro, Roma, Italy.

REFERENCES

Barghash, A., Arslan, T., and Helms, V. (2016). Robust

detection of outlier samples and genes in expression

Figure 17: Comparison of approaches for dataset B.

Figure 18: Degradation Plot of anomalous samples found

through the KS and SUM techniques.

datasets. J. Proteomics Bioinform, 9(02):38–48.

Chen, X., Zhang, B., and Wang, T. (2020). Robust principal

component analysis for accurate outlier sample detec-

tion in rna-seq data. BMC Bioinformatics, 21, 269.

Croux, C., Filzmoser, P., and Oliveira, M. (2007). Algo-

rithms for projection–pursuit robust principal compo-

nent analysis. Chemometrics and Intelligent Labora-

tory Systems, 87(2):218–225.

Esposito, F., Boccarelli, A., and Del Buono, N. (2020).

An nmf-based methodology for selecting biomarkers

in the landscape of genes of heterogeneous cancer-

associated ﬁbroblast populations. Bioinformatics and

Biology Insights, 14:117793222090682.

Hubert, M., Rousseeuw, P. J., and Vanden Branden, K.

(2005). Robpca: a new approach to robust principal

component analysis. Technometrics, 47(1):64–79.

Hulsen, T., de Vlieg, J., and Alkema, W. (2008). Biovenn

- a web application for the comparison and visualiza-

tion of biological lists using area-proportional venn di-

agrams. BMC Genomics, 9(1):488.

Paquet, A. and Yang, J. (2010). arrayquality: Assessing

array quality on spotted arrays.

R-Team, C. (2015). R: A Language and Environment for

Statistical Computing. R Foundation for Statistical

Computing, Vienna, Austria.

Shieh, A. D. and Hung, Y. S. (2009). Detecting outlier sam-

ples in microarray data. Statistical applications in ge-

netics and molecular biology, 8(1).

Anomalies Detection in Gene Expression Matrices: Towards a New Approach

169