Authors:
D.-T. Phan
1
;
P. Leray
1
and
C. Sinoquet
2
Affiliations:
1
Polytech/ University of Nantes, France
;
2
Faculty of Sciences and University of Nantes, France
Keyword(s):
Linkage Disequilibrium, Genome-wide Association Study, Multilocus Association Study, Data Dimension Reduction, Probabilistic Graphical Model, Bayesian Network.
Related
Ontology
Subjects/Areas/Topics:
Bioinformatics
;
Biomedical Engineering
;
Biostatistics and Stochastic Models
;
Data Mining and Machine Learning
Abstract:
Association genetics, and in particular genome-wide association studies (GWASs), aim at elucidating the etiology of complex genetic diseases. In the domain of association genetics, machine learning provides an
appealing alternative framework to standard statistical approaches. Pioneering works (Mourad et al., 2011) have proposed the forest of latent trees (FLTM) to model genetical data at the genome scale. The FLTM is a
hierarchical Bayesian network with latent variables. A key to FLTMconstruction is the recursive clustering of variables, in a bottom up subsuming process. In this paper, we study the impact of the choice of the clustering
method to be plugged in the FLTM learning algorithm, in a GWAS context. Using a real GWAS data set describing 41400 variables for each of 3004 controls and 2005 individuals affected by Crohn’s disease, we
compare the influence of three clustering methods. Data dimension reduction and ability to split or group putative causal SNPs in agreement with th
e underlying biological reality are analyzed. To assess the risk
of missing significant association results through subsumption, we also compare the methods through the corresponding FLTM-driven GWASs. In the GWAS context and in this framework, the choice of the clustering
method does not impact the satisfying performance of the downstream application, both in power and detection of false positive associations.
(More)