Consensus Clustering for Cancer Gene Expression Data - Large-Scale Analysis using Evidence Accumulation Approach
Isidora Šašić, Sanja Brdar, Tatjana Lončar-Turukalo, Helena Aidos, Ana Fred
2017
Abstract
Clustering algorithms are extensively used on patient tissue samples in order to group and visualize the microarray data. The high dimensionality and probe specific noise make the selection of the appropriate clustering algorithm an uneasy task. This study presents a large-scale analysis of three clustering algorithms: k-means, hierarchical clustering (HC) and evidence accumulation clustering (EAC) on thirty-five cancer gene expression data sets selected to benchmark the performance of the clustering algorithms. Separated performance analysis was done on data sets from Affymetrix and cDNA chip platforms to examine the possible influence of the microarray technology. The study revealed no consistent algorithm ranking can be inferred, though in general EAC presented the best compromise of adjusted rand index (ARI) and variance. However, the results indicated that ARI variance under repeated k-means initializations offers useful information on the need to implement more complex clustering techniques. If repeated K-means converges to the same partition, also confirmed by the HC clustering, there is no need to run EAC. However, under moderate or highly variable ARI in repeated K-means, EAC should be used to reduce the uncertainty of clustering and unveil the data structure.
References
- Alizadeh, A. et al., 2000. Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling. Nature, 403(6769), pp.503-511.
- Ayad, H. and Kame, M., 2008., Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters, IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1),160-173.
- Bredel,M. et al., 2005. Functional network analysis reveals extended gliomagenesis pathway maps and three novel MYC-interacting genes in human gliomas. Cancer Research, 65, 8679-8689.
- de Souto,M. et al., 2008. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 9, 497.
- D'haeseleer, P., 2005. How does gene expression clustering work?. Nature Biotechnology, 23(12), pp.1499-1501.
- Fred,A.L.N., and Jain,A.K., 2005. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Analysis and Machine Intelligence., 27, 835- 850.
- Golub, T. et al, 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286(5439), pp.531- 537.
- Hadjitodorov, S. T., Kuncheva, L. I., Todorova, L. P. 2006. Moderate diversity for better cluster ensembles. Information Fusion, 7(3), pp. 264-275
- Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning. New York: Springer.
- Hubert, L., & Arabie, P., 1985. Comparing partitions. Journal of classification, 2(1), 193-218. Springer.
- Iam-on, N., Tossapon, B., and Garrett, S., 2010, LCE: a link-based cluster ensemble method for improved gene expression data analysis, Bioinformatics, 26(12), pp. 1513-1519
- Jain, A.K., 2010. Data clustering: 50 years beyond k-means, Pattern Recognition Letters, 14 (4), pp. 327-344.
- Kuo et al., 2002. Analysis of matched mRNA measurements from two different microarray technologies, Bioinformatics, 18(3), pp. 405-412.
- Lloyd, S., 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28, 129-137. Originally as an unpublished Bell laboratories Technical Note (1957).
- Mimaroglu, S., Aksehirli, E., 2012. Diclens: Divisive clustering ensemble with automatic cluster number. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 9(2), 408-420.
- Sorlie,T. et al., 2003. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences, 100(14), pp.8418-8423.
- Steinhaus, H., 1956. Sur la division des corp materiels en parties. Bulletin of Acad. Polon. Sci., IV(C1. III), 801- 804.
- Strehl, A., and Ghosh, J., 2002., Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions, Journal of Machine Learning Research,3, 583-617.
- Rand, W. M., 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846-850.
- Rung, J., Brazma, A., 2013. Reuse of public genome-wide gene expression data," Nature Reviews Genetics, vol. 14(2), pp. 89-99.
- Ward, H., 1963. Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association. 58 (301): 236-244.
Paper Citation
in Harvard Style
Šašić I., Brdar S., Lončar-Turukalo T., Aidos H. and Fred A. (2017). Consensus Clustering for Cancer Gene Expression Data - Large-Scale Analysis using Evidence Accumulation Approach . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017) ISBN 978-989-758-214-1, pages 176-183. DOI: 10.5220/0006174501760183
in Bibtex Style
@conference{bioinformatics17,
author={Isidora Šašić and Sanja Brdar and Tatjana Lončar-Turukalo and Helena Aidos and Ana Fred},
title={Consensus Clustering for Cancer Gene Expression Data - Large-Scale Analysis using Evidence Accumulation Approach},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)},
year={2017},
pages={176-183},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006174501760183},
isbn={978-989-758-214-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)
TI - Consensus Clustering for Cancer Gene Expression Data - Large-Scale Analysis using Evidence Accumulation Approach
SN - 978-989-758-214-1
AU - Šašić I.
AU - Brdar S.
AU - Lončar-Turukalo T.
AU - Aidos H.
AU - Fred A.
PY - 2017
SP - 176
EP - 183
DO - 10.5220/0006174501760183