# Modeling Genetical Data with Forests of Latent Trees for Applications in Association Genetics at a Large Scale - Which Clustering Method should Be Chosen?

### D.-T. Phan, P. Leray, C. Sinoquet

#### Abstract

Association genetics, and in particular genome-wide association studies (GWASs), aim at elucidating the etiology of complex genetic diseases. In the domain of association genetics, machine learning provides an appealing alternative framework to standard statistical approaches. Pioneering works (Mourad et al., 2011) have proposed the forest of latent trees (FLTM) to model genetical data at the genome scale. The FLTM is a hierarchical Bayesian network with latent variables. A key to FLTMconstruction is the recursive clustering of variables, in a bottom up subsuming process. In this paper, we study the impact of the choice of the clustering method to be plugged in the FLTM learning algorithm, in a GWAS context. Using a real GWAS data set describing 41400 variables for each of 3004 controls and 2005 individuals affected by Crohn’s disease, we compare the influence of three clustering methods. Data dimension reduction and ability to split or group putative causal SNPs in agreement with the underlying biological reality are analyzed. To assess the risk of missing significant association results through subsumption, we also compare the methods through the corresponding FLTM-driven GWASs. In the GWAS context and in this framework, the choice of the clustering method does not impact the satisfying performance of the downstream application, both in power and detection of false positive associations.

#### References

- Abel, H. and Thomas, A. (2011). Accuracy and Computational Efficiency of a Graphical Modeling Approach to Linkage Disequilibrium Estimation. Statistical Applications in Genetics and Molecular Biology, 10(1):Article 5.
- Ackerman, M. and Ben-David, S. (2009). Clusterability: a Theoretical Study. In Dyk, D. and Welling, M., editors, Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS09), Journal of Machine Learning Research, Proceedings Track, volume 5, pages 1-8.
- Balding, D. (2006). A Tutorial on Statistical Methods for Population Association Studies. Nature Reviews Genetics, 7(10):781-791.
- Barrett, J., Hansoul, S., Nicolae, et al. (2008). Genomewide Association Defines more than 30 Distinct Susceptibility Loci for Crohn's Disease. Nature Genetics, 40(8):955-962.
- Ben-Dor, A., Shamir, R., and Yakhini, Z. (1999). Clustering Gene Expression Patterns. In Third Annual International Conference on Research in Computational Molecular Biology (RECOMB99), pages 33-42.
- Browning, B. and Browning, S. (2007). Efficient Multilocus Association Testing for Whole Genome Association Studies Using Localized Haplotype Clustering. Genetic Epidemiology, 31:365-375.
- Cahill, J. (2002). Error-Tolerant Clustering of Gene Microarray Data. Bachelors Honors Thesis, Boston College, Massachusetts.
- Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Second International Conference on Knowledge Discovery and Data Mining (KDD96), pages 226-231.
- Fowlkes, E. and Mallows, C. (1983). A Method for Comparing Two Hierarchical Clusterings. Journal of the American Statistical Association, 78(383):553-569.
- Gabriel, S., Schaffner, S., Moore, J., et al. (2002). The Structure of Haplotype Blocks in the Human Genome. Science, 296(5576):2225-2229.
- Gibbs, R., Belmont, J., Hardenbol, P., et al. (2003). The International HapMap Project. Nature, 426(6968):789- 796.
- Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193-218.
- Meila, M. (2005). Comparing Clusterings: an Axiomatic View. In Twenty-second International Conference on Machine Learning (CML05), ACM, pages 577-584.
- Mirkin, B. (1998). Mathematical Classification and Clustering: from How to What and Why. Classification, Data Analysis, and Data Highways, 690:172-181.
- Mourad, R., Sinoquet, C., and Leray, P. (2011). A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies. BMC Bioinformatics, 12:16+.
- Pritchard, J. and Przeworski, M. (2001). Linkage Disequilibrium in Humans: Models and Data. The American Journal of Human Genetics, 69(1):1-14.
- Purcell, S., Neale, B., Todd-Brown, K., et al. (2007). PLINK: a Toolset for Whole-genome Association and Population-based Linkage Analysis. The American Journal of Human Genetics, 81(3):559-575.
- Rand, W. (1971). Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846-850.
- The 1000 Genomes Project Consortium (2010). A Map of Human Genome Variation from Population-scale Sequencing. Nature, 467(7319):1061-1073.
- Verzilli, C., Stallard, N., and Whittaker, J. (2006). Bayesian Graphical Models for Genome-wide Association Studies. The American Journal of Human Genetics, 79:100-112.
- Wang, N., Akey, J., Zhang, K., Chakraborty, R., and Jin, L. (2002). Distribution of Recombination Crossovers and the Origin of Haplotype Blocks: the Interplay of Population History, Recombination, and Mutation. The American Journal of Human Genetics, 71(5):1227-1234.
- WTCCC (2007). Wellcome Trust Case Control Consortium. Genome-wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls. Nature, 447(7145):661-678.

#### Paper Citation

#### in Harvard Style

Phan D., Leray P. and Sinoquet C. (2015). **Modeling Genetical Data with Forests of Latent Trees for Applications in Association Genetics at a Large Scale - Which Clustering Method should Be Chosen?** . In *Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)* ISBN 978-989-758-070-3, pages 5-16. DOI: 10.5220/0005179800050016

#### in Bibtex Style

@conference{bioinformatics15,

author={D.-T. Phan and P. Leray and C. Sinoquet},

title={Modeling Genetical Data with Forests of Latent Trees for Applications in Association Genetics at a Large Scale - Which Clustering Method should Be Chosen?},

booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)},

year={2015},

pages={5-16},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0005179800050016},

isbn={978-989-758-070-3},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)

TI - Modeling Genetical Data with Forests of Latent Trees for Applications in Association Genetics at a Large Scale - Which Clustering Method should Be Chosen?

SN - 978-989-758-070-3

AU - Phan D.

AU - Leray P.

AU - Sinoquet C.

PY - 2015

SP - 5

EP - 16

DO - 10.5220/0005179800050016