String Searching in Referentially Compressed Genomes

Sebastian Wandelt; Ulf Leser

doi:10.5220/0004143400950102

String Searching in Referentially Compressed Genomes

Sebastian Wandelt, Ulf Leser

2012

Abstract

Background: Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There is a need for algorithms working on compressed data directly, avoiding costly decompression. Summary: In our work, we address this problem by proposing an algorithm for exact string search over compressed data. The algorithm works directly on referentially compressed genome sequences, without needing an index for each genome and only using partial decompression. Results: Our string search algorithm for referentially compressed genomes performs exact string matching for large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome, especially for short queries. We think that this is an important step towards space and runtime efficient management of large biological data sets.

References

Ahn, S.-M., Kim, T.-H., Lee, S., Kim, D., et al. (2009). The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Research, 19(9):1622-1629.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410.
Antoniou, D., Theodoridis, E., and Tsakalidis, A. (2010). Compressing biological sequences using self adjusting data structures. In Information Technology and Applications in Biomedicine.
Bharti, R. K., Verma, A., and Singh, R. (2011). A biological sequence compression based on cross chromosomal similarities using variable length lut. International Journal of Biometrics and Bioinformatics, 4:217-223.
Bhola, V., Bopardikar, A. S., Narayanan, R., Lee, K., and Ahn, T. (2011). No-reference compression of genomic data stored in fastq format. In BIBM, pages 147-150.
Boyer, R. S. and Moore, J. S. (1977). A fast string searching algorithm. Commun. ACM, 20(10):762-772.
Brandon, M. C., Wallace, D. C., and Baldi, P. (2009). Data structures and compression algorithms for genomic sequence data. Bioinformatics, 25(14):1731-1738.
Chen, W., Lu, Y., Lai, F., Chien, Y., and Hwu, W. (2011). Integrating human genome database into electronic health record with sequence alignment and compression mechanism. J Med Syst.
Chiang, G.-T., Clapham, P., Qi, G., Sale, K., and Coates, G. (2011). Implementing a genomic data management system using iRODS in the Wellcome Trust Sanger Institute. BMC Bioinformatics, 12(1):361+.
Daily, K., Rigor, P., Christley, S., Xie, X., and Baldi, P. (2010). Data structures and compression algorithms for high-throughput sequencing technologies. BMC bioinformatics, 11(1):514+.
Deorowicz, S. and Grabowski, S. (2011). Robust Relative Compression of Genomes with Random Access. Bioinformatics.
Duc Cao, M., Dix, T. I., Allison, L., and Mears, C. (2007). A simple statistical algorithm for biological sequence compression. In Proceedings of the 2007 Data Compression Conference, pages 43-52, Washington, DC, USA. IEEE Computer Society.
Grabowski, S. and Deorowicz, S. (2011). Engineering relative compression of genomes. CoRR, abs/1103.2351.
Hunt, E., Atkinson, M. P., and Irving, R. W. (2002). Database indexing for large dna and protein sequence collections. The VLDB Journal, 11(3):256-271.
Kahn, S. D. (2011). On the future of genomic data. Science, 331(6018):728-729.
Kaipa, K. K., Bopardikar, A. S., Abhilash, S., Venkataraman, P., Lee, K., Ahn, T., and Narayanan, R. (2010). Algorithm for dna sequence compression based on prediction of mismatch bases and repeat location. In Bioinformatics and Biomedicine Workshops (BIBMW).
Kent, W. J. (2002). BLATThe BLAST-Like Alignment Tool. Genome Research, 12(4):656-664.
Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., and Haussler, D. (2002). The human genome browser at UCSC. Genome Res, 12(6):996-1006.
Kuruppu, S., Beresford-Smith, B., Conway, T., and Zobel, J. (2012). Iterative dictionary construction for compression of large dna data sets. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 9(1):137-149.
Kuruppu, S., Puglisi, S. J., and Zobel, J. (2010). Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In Proceedings of the 17th international conference on String processing and information retrieval, SPIRE'10, pages 201-206, Berlin, Heidelberg. Springer-Verlag.
Mishra, K. N., Aaggarwal, D. A., Abdelhadi, D. E., and Srivastava, D. P. C. (2010). An efficient horizontal and vertical method for online dna sequence compression. International Journal of Computer Applications, 3(1):39-46. Published By Foundation of Computer Science.
Pande, P. and Matani, D. (2011). Compressing the human genome against a reference. Technical report, Stony Brook University.
Peltola, H. and Tarhio, J. (2003). Alternative algorithms for bit-parallel string matching. In SPIRE, pages 80-94.
Pennisi, E. (2011). Will Computers Crash Genomics? Science, 331(6018):666-668.
Pratas, D. and Pinho, A. J. (2011). Compressing the human genome using exclusively markov models. In Rocha, M. P., Rodrguez, J. M. C., Fdez-Riverola, F., and Valencia, A., editors, PACBB, volume 93 of Advances in Intelligent and Soft Computing, pages 213- 220. Springer.
Schadt, E. E., Turner, S., and Kasarskis, A. (2010). A window into third-generation sequencing. Human molecular genetics, 19(R2):R227-R240.
Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14(3):249-260.
Vey, G. (2009). Differential direct coding: a compression algorithm for nucleotide sequence data. The Journal of Biological Database and Curation, 2009.
Vlimki, N., Mkinen, V., Gerlach, W., and Dixit, K. (2009). Engineering a compressed suffix tree implementation. ACM Journal of Experimental Algorithmics, 14.
Wan, R., Anh, V. N., and Asai, K. (2011). Transformations for the compression of fastq quality scores of next generation sequencing data. Bioinformatics.

Download

Paper Citation

in Harvard Style

Wandelt S. and Leser U. (2012). String Searching in Referentially Compressed Genomes . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 95-102. DOI: 10.5220/0004143400950102

in Bibtex Style

@conference{kdir12,
author={Sebastian Wandelt and Ulf Leser},
title={String Searching in Referentially Compressed Genomes},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={95-102},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004143400950102},
isbn={978-989-8565-29-7},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - String Searching in Referentially Compressed Genomes
SN - 978-989-8565-29-7
AU - Wandelt S.
AU - Leser U.
PY - 2012
SP - 95
EP - 102
DO - 10.5220/0004143400950102