SuperPhy - A Pilot Resource for Integrated Phylogenetic and Epidemiological Analysis of Pathogens

Matthew Whiteside, Chad R. Laing, Akiff Manji, Victor P. J. Gannon

2014

Abstract

Advances in DNA sequencing technology have created new opportunities in fields such as clinical medicine and epidemiology, where performing real-time, genome-based surveillance and identification of phenotypic characteristics of bacterial pathogens is now possible. New analytical tools and infrastructure are needed to analyze these genomic datasets, store the data, and provide the essential biological information to end-users. We have implemented an online whole-genome analyses platform called SuperPhy that uses Panseq as an engine to compare bacterial genomes, the Fisher’s exact test to identify sub-group specific loci, and FastTree to create maximum-likelihood trees. SuperPhy facilitates the upload of genomes for both private and public use. Analyses include: 1) genomic comparisons of clinical isolates, and identification of virulence and antimicrobial resistance genes in silico, 2) associations between specific genotypes and phenotypic meta-data (e.g., geospatial distribution, host, source); 3) identification of group-specific genome markers (presence/ absence of specific genomic regions, and single-nucleotide polymorphisms) in bacterial populations; 4) the ability to manipulate the display of phylogenetic trees; 5) identify statistically significant clade-specific markers. The SuperPhy pilot database currently contains genome sequences for 1063 Escherichia coli strains. Future work will extend SuperPhy to include multiple pathogens.

References

  1. Altschul, S. F., Madden, T. L., Schffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389-402.
  2. Antezana, E., Kuiper, M., and Mironov, V. (2009). Biological knowledge management: the emerging role of the semantic web technologies. Briefings in Bioinformatics, 10(4):392-407.
  3. Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2013). GenBank. Nucleic Acids Res., 41(Database issue):36- 42.
  4. Bostock, M., Ogievetsky, V., and Heer, J. (2011). data-driven documents. Visualization and Computer Graphics, IEEE Transactions on, 17(12):2301-2309.
  5. Chen, L., Xiong, Z., Sun, L., Yang, J., and Jin, Q. (2012). VFDB 2012 update: toward the genetic diversity and molecular evolution of bacterial virulence factors. Nucleic Acids Res., 40(Database issue):D641-645.
  6. Chen, L., Yang, J., Yu, J., Yao, Z., Sun, L., Shen, Y., and Jin, Q. (2005). VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res., 33(Database issue):D325-328.
  7. Fang, H., Oates, M. E., Pethica, R. B., Greenwood, J. M., Sardar, A. J., Rackham, O. J., Donoghue, P. C., Stamatakis, A., de Lima Morais, D. A., and Gough, J. (2013). A daily-updated tree of (sequenced) life as a reference for genome research. Sci Rep, 3:2015.
  8. Federhen, S. (2012). The NCBI Taxonomy database. Nucleic Acids Res., 40(Database issue):D136-143.
  9. Goecks, J., Nekrutenko, A., Taylor, J., and $author.lastName, a. f. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology, 11(8):R86.
  10. Kahn, S. D. (2011). On the future of genomic data. Science (New York, N.Y.), 331(6018):728-729.
  11. Kupferschmidt, K. (2011). Outbreak detectives embrace the genome era. Science, 333(6051):1818-1819.
  12. Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. L. (2004). Versatile and open software for comparing large genomes. Genome biology, 5(2):R12.
  13. Laing, C., Buchanan, C., Taboada, E. N., Zhang, Y., Kropinski, A., Villegas, A., Thomas, J. E., and Gannon, V. P. J. (2010). Pan-genome sequence analysis using panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinformatics, 11:461.
  14. Laing, C., Villegas, A., Taboada, E. N., Kropinski, A., Thomas, J. E., and Gannon, V. P. J. (2011). Identification of salmonella enterica species- and subgroupspecific genomic regions using panseq 2.0. Infection, Genetics and Evolution: Journal of Molecular Epidemiology and Evolutionary Genetics in Infectious Diseases.
  15. Lanzn, A. and Oinn, T. (2008). The taverna interaction service: enabling manual interaction in workflows. Bioinformatics (Oxford, England), 24(8):1118-1120.
  16. Letunic, I. and Bork, P. (2011). Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res., 39(Web Server issue):W475-478.
  17. Loman, N. J., Misra, R. V., Dallman, T. J., Constantinidou, C., Gharbia, S. E., Wain, J., and Pallen, M. J. (2012). Performance comparison of benchtop highthroughput sequencing platforms. Nature biotechnology, 30(5):434-439.
  18. Lukjancenko, O., Wassenaar, T. M., and Ussery, D. W. (2010). Comparison of 61 sequenced escherichia coli genomes. Microbial Ecology, 60(4):708-720.
  19. Markowitz, V. M., Chen, I. M., Palaniappan, K., Chu, K., Szeto, E., Pillay, M., Ratner, A., Huang, J., Woyke, T., Huntemann, M., Anderson, I., Billis, K., Varghese, N., Mavromatis, K., Pati, A., Ivanova, N. N., and Kyrpides, N. C. (2013). IMG 4 version of the integrated microbial genomes comparative analysis system. Nucleic Acids Res.
  20. Mungall, C. J., Emmert, D. B., et al. (2007). A chado case study: an ontology-based modular schema for representing genome-associated biological information. Bioinformatics, 23(13):i337-i346.
  21. Price, M. N., Dehal, P. S., and Arkin, A. P. (2010). FastTree 2 approximately maximum-likelihood trees for large alignments. PLoS ONE, 5(3):e9490.
  22. R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  23. Riley, D. R., Angiuoli, S. V., Crabtree, J., Hotopp, J. C. D., and Tettelin, H. (2012). Using sybil for interactive comparative genomics of microbes on the web. Bioinformatics, 28(2):160-166.
  24. Scheutz, F., Teel, L. D., Beutin, L., Pirard, D., Buvens, G., Karch, H., Mellmann, A., Caprioli, A., Tozzoli, R., Morabito, S., Strockbine, N. A., Melton-Celsa, A. R., Sanchez, M., Persson, S., and O'Brien, A. D. (2012). Multicenter evaluation of a sequence-based protocol for subtyping shiga toxins and standardizing stx nomenclature. Journal of clinical microbiology, 50(9):2951-2963.
  25. Teeling, H. and Glckner, F. O. (2012). Current opportunities and challenges in microbial metagenome analysis-a bioinformatic perspective. Briefings in bioinformatics.
  26. Vallenet, D., Belda, E., Calteau, A., Cruveiller, S., Engelen, S., Lajus, A., Le Fevre, F., Longin, C., Mornico, D., Roche, D., Rouy, Z., Salvignol, G., Scarpelli, C., Thil Smith, A. A., Weiman, M., and Medigue, C. (2012). MicroScope-an integrated microbial resource for the curation and comparative analysis of genomic and metabolic data. Nucleic Acids Research, 41(D1):D636-D647.
Download


Paper Citation


in Harvard Style

Whiteside M., R. Laing C., Manji A. and P. J. Gannon V. (2014). SuperPhy - A Pilot Resource for Integrated Phylogenetic and Epidemiological Analysis of Pathogens . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014) ISBN 978-989-758-012-3, pages 40-48. DOI: 10.5220/0004798800400048


in Bibtex Style

@conference{bioinformatics14,
author={Matthew Whiteside and Chad R. Laing and Akiff Manji and Victor P. J. Gannon},
title={SuperPhy - A Pilot Resource for Integrated Phylogenetic and Epidemiological Analysis of Pathogens},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014)},
year={2014},
pages={40-48},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004798800400048},
isbn={978-989-758-012-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014)
TI - SuperPhy - A Pilot Resource for Integrated Phylogenetic and Epidemiological Analysis of Pathogens
SN - 978-989-758-012-3
AU - Whiteside M.
AU - R. Laing C.
AU - Manji A.
AU - P. J. Gannon V.
PY - 2014
SP - 40
EP - 48
DO - 10.5220/0004798800400048