Authors:
Sebastian Wandelt
and
Ulf Leser
Affiliation:
Humboldt-Universität zu Berlin, Germany
Keyword(s):
Genome Compression, Referential Compression, String Search.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
BioInformatics & Pattern Discovery
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Methodologies and Technologies
;
Operational Research
;
Optimization
;
Symbolic Systems
Abstract:
Background: Improved sequencing techniques have led to large amounts of biological sequence data. One
of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes,
storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot
of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There
is a need for algorithms working on compressed data directly, avoiding costly decompression.
Summary: In our work, we address this problem by proposing an algorithm for exact string search over compressed
data. The algorithm works directly on referentially compressed genome sequences, without needing
an index for each genome and only using partial decompression.
Results: Our string search algorithm for referentially compressed genomes performs exact string matching for
large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome
, especially for
short queries. We think that this is an important step towards space and runtime efficient management of large
biological data sets.
(More)