In our research, we will mainly focus on function-
ality and efficiency.
Good compression (effectiveness) is important but
might be inefficient in a network context. In gen-
eral, full-file compression offers higher compression
ratios than block-based compression. However, full-
file compression requires the transmission of the com-
plete files, even when only a small part of the data is
needed. As a result, a lot of data is transmitted with-
out ever being used, causing excessive and unneces-
sary network load.
To maximise efficiency, a novel compression algo-
rithm will be designed that splits up the data into
small blocks. Each of these blocks will be encoded
separately using prediction tools. These tools will
predict blocks either with a reference (a previously
encoded block) or without a reference. This block-
based approach is expected to affect the effectiveness
of the compression algorithm but will offer higher ef-
ficiency and more functionality.
For this block-based compression scheme we will in-
vestigate features such as:
• Random Access, to allow transmission of partial
data;
• Streaming, to offer live access to sequencing
data;
• Parallel Processing, to speed up compression and
decompression;
• Metadata, to facilitate the management of data
and to support random access;
• Encryption, to protect privacy;
• Editing and Analysis in the Compressed Do-
main;
• Error Detection and Correction, to ensure that
the data transmission is error free;
• Load-balancing, choosing a subset of prediction
tools to balance computing time, compression ra-
tio and network load;
• Smart Prediction Tool Selection, to optimise
compression efficiency by selecting a subset
of compression tools based on the different
characteristics of different species genomes.
Despite the focus on efficiency and functionality, it is
clear that effectiveness is not to be neglected. During
our initial research, we discovered that there is no ref-
erence benchmark set to compare algorithms (Sebas-
tian Wandelt et al., 2013). It is therefore a side goal
to define such a reference benchmark set that can be
shared amongst researchers to compare different solu-
tions in speed, memory usage and compression ratio.
In order to be accepted generally, the benchmark set
will need to consist of several types of genomic data:
• Reads (with quality information) and full se-
quences;
• Aligned reads and unaligned reads;
• Genes, chromosomes and full genomes;
• Bacteria, viruses, humans and other species.
Note that the ideal genomic data compression tool of-
fers support for all these types of genomic data.
3 RESEARCH PROBLEM
In this doctoral research, we aim at answering the
following overall research question:
“How can we apply existing technologies in me-
dia compression, storage, management and analysis
to genomic data in such a way that they improve on
current existing technologies in the sense of compres-
sion, transmission and/or analysis efficiency?”
The above research question can be broken up
into a number of smaller research questions:
“How can we design a file format for the compres-
sion of genomic data in such a way that it supports all
features that are currently already available for, e.g.,
video files? These features include, amongst others,
random access, streaming capabilities, metadata, en-
cryption, DRM, load-balancing, compressed-domain
manipulation, parallel processing, error detection
and error correction.”
“Which set of prediction and coding tools pro-
vides the highest compression ratio?”
“How does random access affect the compression
efficiency and network load?”
“How does error correction affect the compres-
sion efficiency? Does it consume the compression
gain?”
“Which set of prediction and coding tools pro-
vides the highest compression ratio? Can these tools
compensate for the loss of efficiency due to the use of
block-based compression?”
“How can we integrate the proposed file format
into existing technologies that allow re-use of already
existing file streaming technologies?”
BIOSTEC2014-DoctoralConsortium
4