in this paper.
2. High Pairwise-correlation Filtering to remove
redundancy. Pairs of features showing a high pair-
wise correlation (i.e. Spearman correlation > 0.8)
are filtered to remove the feature having the higher
mean correlation with all the other features in the
dataset
1
. The parallel algorithm we implemented
to perform this task is outlined in Appendix A.
3. Blocking ID Estimation (Section 3.1.3). This
estimation approach is inspired by the blocking
analysis applied in (Facco et al., 2017) to com-
pute an ID estimate less affected by noise in the
data samples. In this paper, we revisit it in order
to provide an ID estimate (hereafter referred to as
blocking-ID) robust with respect to the curse of
dimensionality (affecting datasets characterized
by an extremely large dimensionality compared to
the limited sample cardinality). Furthermore, we
use it to identify the minimum number of features,
C , that should be kept by the next unsupervised
FS approach (Step 4 below) to ensure that, with
a certain degree of confidence, most of the salient
information is kept, while noise and redundancies
are minimized.
4. Unsupervised FS via Hierarchical Clustering
(Solorio-Fern
´
andez et al., 2020; Gagolewski,
2021) to select C features that are the medoids of
the corresponding C feature clusters (Section 3.2
and Appendix B).
5. Blocking ID Estimation in order to monitor po-
tential retained noise and information loss.
6. Dimensionality Reduction: The representative
feature set is finally embedded to a lower-
dimensional space, whose dimensionality is cho-
sen based on the blocking-ID estimate computed
in Step 3. To this aim, we compare four differ-
ent DR techniques (UMAP, t-SNE, RPCA and
RCUR) and choose the one that allows obtain-
ing a global ID estimate that is comparable to the
blocking-ID computed in Step 3.
3.1 ID Estimation
The ID (Johnsson, 2011; Campadelli et al., 2015)
of a dataset is the minimum number of parameters
1
Is worth mentioning that, from a biological perspec-
tive, the correlation between omics variables could often be
meaningful: for example, correlated variables in expression
data could relate to the same molecular pathway (Allocco
et al., 2004), and thus to coregulated genes. However, from
a statistical point of view, highly-correlated variables may
affect the reliability of estimates leading to inflated values.
Thus, their removal is often advisable.
needed to maintain its characterizing structure; in
other words, the ID is the minimum number of di-
mensions of a lower dimensional space where the
data can be projected (by a smooth mapping) in or-
der to minimize the information loss. When a dataset
X
n
= {x
i
}
n
i=1
⊂ R
D
has ID equal to d, the D- dimen-
sional samples are assumed to be uniformly drawn
from a manifold with topological dimension equal to
d, that has been embedded in a higher D-dimensional
space through a nonlinear smooth mapping. Unfor-
tunately, the estimation of the topological dimension
of a manifold using a limited set of points uniformly
drawn from it is a challenging, not yet solved task. All
the SOTA ID estimation techniques exploit differing
underlying theories, according to which they are of-
ten grouped into the following four main categories:
Projective ID estimators, Topological-based ID esti-
mators, Fractal ID estimators, and Nearest-Neighbors
(NN) based ID estimators.
In the bioinformatics field, the available datasets
are often noisy and complex. In this context, Pro-
jective, Topological-based and Fractal ID estimators
are often outperformed by NN estimators; Fractal ID
estimators fail when the points are noisy and/or not
uniformly drawn from the underlying manifold, while
Projective and Topological-based ID estimators pro-
duce reliable estimates for data drawn from mani-
folds with mainly low curvature and low ID values.
On the other hand, NN estimators have shown their
robustness on not-uniformly drawn, noisy and com-
plex datasets, where the two main assumptions at the
base of Fractal, Topological-based, and Projective ID
estimators are often violated. Indeed, (1) the points
cannot be assumed to be uniformly drawn from the
manifold where they are assumed to lie, and (2) the
complexity of the available datasets allows assuming
that the points lie on more-than-one, eventually in-
tersecting manifolds, each characterized by a specific
topological dimension.
To account for the aforementioned issues, NN es-
timators often compute a reliable “global” ID estima-
tor by integrating all the “local” IDs estimated over
point-neighborhoods. Based on these remarks, in this
work we estimated the ID of the available multi-omics
views by comparing two NN ID estimators, namely
DANCo (see subsection 3.1.1) and TWO-NN (see
subsection 3.1.2).
3.1.1 DANCo
DANCo (Ceruti et al., 2014) estimates the (potentially
high) ID of a dataset by comparing the joint probabil-
ity density functions (pdfs) characterizing the point-
neighborhood distributions in the input dataset to a set
of pdfs, each characterizing the point-neighborhood
Intrinsic-Dimension Analysis for Guiding Dimensionality Reduction in Multi-Omics Data
245