Authors:
Valentina Guarino
1
;
Jessica Gliozzo
2
;
1
;
Ferdinando Clarelli
3
;
Béatrice Pignolet
4
;
5
;
Kaalindi Misra
3
;
Elisabetta Mascia
3
;
Giordano Antonino
3
;
6
;
Silvia Santoro
3
;
Laura Ferré
3
;
6
;
Miryam Cannizzaro
3
;
6
;
Melissa Sorosina
3
;
Roland Liblau
5
;
Massimo Filippi
6
;
7
;
8
;
9
;
Ettore Mosca
10
;
Federica Esposito
3
;
6
;
Giorgio Valentini
1
;
11
and
Elena Casiraghi
1
;
11
;
12
Affiliations:
1
AnacletoLab - Computer Science Department, Università degli Studi di Milano, Via Celoria 18, 20135, Milan, Italy
;
2
European Commission, Joint Research Centre (JRC), Ispra, Italy
;
3
Laboratory of Neurological Complex Disorders, Division of Neuroscience, Institute of Experimental Neurology (INSPE), IRCCS San Raffaele Scientific Institute, 20132 Milan, Italy
;
4
CRC-SEP, Neurosciences Department, CHU Toulouse, France
;
5
Infinity, CNRS, INSERM, Toulouse University, UPS, Toulouse, France
;
6
Neurology and Neurorehabilitation Unit, IRCCS San Raffaele Scientific Institute, 20132 Milan, Italy
;
7
Vita-Salute San Raffaele University, 20132 Milan, Italy
;
8
Neurophysiology Unit, IRCCS San Raffaele Scientific Institute, 20132 Milan, Italy
;
9
Neuroimaging Research Unit, Division of Neuroscience, Institute of Experimental Neurology (INSPE), IRCCS San Raffaele Scientific Institute, 20132 Milan, Italy
;
10
Institute of Biomedical Technologies, National Research Council, Segrate (Milan), Italy
;
11
CINI, Infolife National Laboratory, Roma, Italy
;
12
Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, U.S.A.
Keyword(s):
Dimensionality Reduction, Intrinsic Dimensionality, Feature Selection, Feature Clustering, Omics Datasets.
Abstract:
Multi-omics data are of paramount importance in biomedicine, providing a comprehensive view of processes underlying disease. They are characterized by high dimensions and are hence affected by the so-called ”curse of dimensionality”, ultimately leading to unreliable estimates. This calls for effective Dimensionality Reduction (DR) techniques to embed the high-dimensional data into a lower-dimensional space. Though effective DR methods have been proposed so far, given the high dimension of the initial dataset unsupervised Feature Selection (FS) techniques are often needed prior to their application. Unfortunately, both unsupervised FS and DR techniques require the dimension of the lower dimensional space to be provided. This is a crucial choice, for which a well-accepted solution has not been defined yet. The Intrinsic Dimension (ID) of a dataset is defined as the minimum number of dimensions that allow representing the data without information loss. Therefore, the ID of a dataset is
related to its informativeness and complexity. In this paper, after proposing a blocking ID estimation to leverage state-of-the-art (SOTA) ID estimate methods we present our DR pipeline, whose subsequent FS and DR steps are guided by the ID estimate.
(More)