Metrics for Clustering Comparison in Bioinformatics

Giovanni Rossi

2016

Abstract

Developing from a concern in bioinformatics, this work analyses alternative metrics between partitions. From both theoretical and applicative perspectives, a useful and interesting distance between any two partitions is HD, which counts the number of atoms finer than either one but not both. While faithfully reproducing the traditional Hamming distance between subsets, HD is very sensible and computable through scalar products between Boolean vectors. It properly deals with complements and axiomatically resembles the entropy-based variation of information VI distance. Entire families of metrics (including HD and VI) obtain as minimal paths in the weighted graph given by the Hasse diagram: submodular weighting functions yield path-based distances visiting the join (of any two partitions), whereas supermodularity leads to visit the meet. This yields an exact (rather than heuristic) approach to the consensus partition (combinatorial optimization) problem.

References

  1. Aigner, M. (1997). Combinatorial Theory. Springer.
  2. Almudevar, A. and Field, C. (1999). Estimation of singlegeneration sibling relationships based on DNA markers. Journal of Agricultural, Biological and Environmental Statistics, 4(2):136-165.
  3. Berger-Wolf, T. Y., Sheikh, S. I., DasGupta, B., Ashley, M. V., Caballero, I. C., Chaovalitwongse, W., and Putrevu, S. L. (2007). Reconstructing sibling relationship in wild populations. Bioinf., 23(13):i49-i56.
  4. Bollobas, B. (1986). Combinatorics. Set Systems, Hypergraphs, Families of Vectors, and Combinatorial Probability. Cambridge University Press.
  5. Brøondsted, A. (1983). An introduction to convex polytopes. Springer.
  6. Brown, D. G. and Dexter, D. (2012). Sibjoin: a fast heuristic for half-sibling reconstruction. Algorithms in Bioinformatics, LNCS 7534:44-56.
  7. Celeux, G., Diday, E., Govaert, G., Lechevalier, G., and Ralambondrainy, H. (1989). Classification Automatique Des Données. Dunod.
  8. Day, W. (1981). The complexity of computing metric distances between partitions. Math. Soc. Sc., 1(3):269- 287.
  9. Deza, M. M. and Deza, E. (2013). Encyclopedia of Distances - Second Edition. Springer.
  10. Ellerman, D. (2013a). An introduction to logical entropy and its relation to Shannon entropy. International Journal of Semantic Computing, 7(2):121-145.
  11. Ellerman, D. (2013b). An introduction to partition logic. Logic Journal of the IGPL, 22(1):94-125.
  12. Godsil, C. and Royle, G. F. (2001). Algebraic Graph Theory. Springer.
  13. Graham, R., Knuth, D., and Patashnik, O. (1994). Concrete Mathematics. Addison-Wesley.
  14. Grünbaum, B. (2001). Convex Polytopes. Springer.
  15. Gusfield, D. (2002). Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters, 82:159-164.
  16. Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193-218.
  17. Konovalov, D. A. (2006). Accuracy of four heuristics for the full sibship reconstruction problem in the presence of genotype errors. Adv. Bioinf. Comp. Bio., 3:7-16.
  18. Konovalov, D. A., Bajema, N., and Litow, B. (2005a). Modified SimpsonO(n3) algorithm for the full sibship reconstruction problem. Bioinf., 21(20):3912-3917.
  19. Konovalov, D. A., Litow, B., and Bajema, N. (2005b). Partition-distance via the assignment problem. Bioinf., 21(10):2463-2468.
  20. Korte, B. and Vygen, J. (2002). Combinatorial Optimization: Theory and Algorithms (2nd edition). Springer.
  21. Lerman, I. C. (1981). Classification et Analyse Ordinale des Données. Dunod.
  22. Meila, M. (2007). Comparing clusterings - an information based distance. J. of Mult. Ananysis, 98(5):873-895.
  23. Mirkin, B. G. (1996). Mathematical Classification and Clustering. Kluwer Academic Press.
  24. Mirkin, B. G. and Cherny, L. B. (1970). Measurement of the distance between distinct partitions of a finite set of objects. Aut. and Rem. Con., 31(5):786-792.
  25. Mirkin, B. G. and Muchnik, I. (2008). Some topics of current interest in clustering: Russian approaches 1960- 1985. Electronic Journal for History of Probability and Statistics, 4(2):1-12.
  26. Pinto Da Costa, J. F. and Rao, P. R. (2004). Central partition for a partition-distance and strong pattern graph. REVSTAT - Statistical Journal, 2(2):127-143.
  27. Rénier, S. (1965). Sur quelques aspects mathématiques des problémes de classification automatique. ICC Bulletin, 4:175-191. Reprinted in Mathématiques et Sciences Humaines 82:13-29, 1983.
  28. Rossi, G. (2011). Partition distances. arXiv:1106.4579v1.
  29. Rota, G.-C. (1964a). The number of partitions of a set. American Mathematical Monthly, 71:499-504.
  30. Rota, G.-C. (1964b). On the foundations of combinatorial theory I: theory of Möbius functions. Z. Wahrscheinlichkeitsrechnung u. verw. Geb., 2:340-368.
  31. Sebo?, A. and Tannier, E. (2004). On metric generators of graphs. Math. of Op. Res., 29(2):383-393.
  32. Sheikh, S. I., Berger-Wolf, T. Y., Khokhar, A. A., Caballero, I. C., Ashley, M. V., Chaovalitwongse, W., Chou, C.-A., and DasGupta, B. (2010). Combinatorial reconstruction of half-sibling groups from microsatellite data. J. Bioinf. Comp. Biol., 8(2):337-356.
  33. Stanley, R. (1971). Modular elements of geometric lattices. Algebra Universalis, (1):214-217.
  34. Stern, M. (1999). Semimodular Lattices. Theory and Applications. Encyclopedia of Mathematics and its Applications 73. Cambridge University Press.
  35. Warrens, M. J. (2008). On the equivalence of Chen's Kappa and the Hubert-Arabie adjusted Rand index. Journal of Classification, 25(1):177-183.
  36. Whitney, H. (1935). On the abstract properties of linear dependence. Amer. J. of Math., 57:509-533.
Download


Paper Citation


in Harvard Style

Rossi G. (2016). Metrics for Clustering Comparison in Bioinformatics . In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-173-1, pages 299-308. DOI: 10.5220/0005707102990308


in Bibtex Style

@conference{icpram16,
author={Giovanni Rossi},
title={Metrics for Clustering Comparison in Bioinformatics},
booktitle={Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2016},
pages={299-308},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005707102990308},
isbn={978-989-758-173-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Metrics for Clustering Comparison in Bioinformatics
SN - 978-989-758-173-1
AU - Rossi G.
PY - 2016
SP - 299
EP - 308
DO - 10.5220/0005707102990308