parallelization.
In this paper we propose a distributed algorithm
for Frequency Sorting of DNA spectrograms that
achieves efficient and scalable distribution of the
computation, enabling significant speedup and
allowing the processing and analysis of large
genomic sequences, such as entire genomes. We
report the performance of the distributed Frequency
Sorting implementation in terms of speedup and
execution time, when applied to the entire human
chromosome 21 for several sets of parameters.
2 RELATED WORK
In (Anastassiou, 2000) an optimization procedure
improving upon traditional Fourier analysis
performance in detecting coding regions in DNA
sequences is introduced. Color spectrograms of
biomolecular sequences are used as visualization
tools providing information about the local nature,
structure and function of the sequences. Color maps
help visually identifying protein coding areas for
both DNA strands, but also the coding direction and
the reading frame for each of the exons.
In (Sussillo, 2004) a slightly modified version of
the spectrogram development tool is applied to
explore patterns characteristic in the genomes of
various organisms (among which E. coli, M.
tuberculosis, C. elegans, D. melanogaster and H.
sapiens). Interesting features were detected, some of
which are common to all organisms and some are
unique to a particular organism.
In (Santo, 2007) the spectral analysis tool was
improved with hierarchical clustering in order to
optimize the viewing of spectra and to detect
patterns in large amounts of sequence data.
3 THE FREQUENCY SORTING
METHOD
The Frequency Sorting method and several
algorithms used for sorting have been described in
detail in (Bucur, 2008). Frequency Sorting
comprises the following steps:
• Create a Spectrogram
• Apply a Binning Function and Build
Frequency Histograms
• Sorting
• Visualization using SpectroVideo
In this paper we apply our Top Down Hierarchical
Sorting (TDHS) algorithm to sort the DNA
spectrogram. The intuitive visual representation
makes it easy to detect patterns. Once interesting
patterns have been detected, the actual Fourier
values, mapped to colours in the SpectroVideo,
should also be taken into account for an accurate
analysis.
4 THE DISTRIBUTED FS
ALGORITHM
Combining Frequency Sorting with SpectroVideo
supports the discovery of novel frequency patterns
in large genomic repositories of sequences.
Applying Frequency Sorting to a large dataset is
very data-intensive, requiring large amounts of
computations and memory. Additionally, a large
number of experiments, varying the values several
parameters (window size, bin size, window overlap,
threshold of Fourier values), need to be run in order
to detect all relevant patterns. Therefore, an
algorithm needs to be designed that allows an
efficient distribution of the data and of the
computations, exploiting the potential for
parallelization.
In each iteration of FS, the bin sizes are
computed for each frequency and nucleotide
independently. The bin values are then compared
across all frequencies and nucleotides, and based on
the result of the comparison the domain of windows
is split and reordered. As histograms are built per
frequency and nucleotide, it is very efficient to split
the same way the data domain of Fourier values
among several processors and to build the
histograms in parallel.
In our algorithm, a distributor node is
responsible for distributing the sub-domains of the
dataset among several worker nodes. The largest
source of overhead in the algorithm is the initial
distribution of the Fourier values corresponding to
the assigned frequencies to the worker nodes. Each
worker node is assigned a set of frequencies (one or
more) for which to compute at each iteration step the
bin sizes and build the histograms, and receives the
Fourier values in those frequencies across all
windows. The resulting histograms are compared
among the worker nodes and a decision concerning
the split of the domain is taken. To minimize the
overhead of data transfer, the frequencies are
assigned at the beginning of the execution and will
not change. The Fourier values are distributed per
BIOINFORMATICS 2010 - International Conference on Bioinformatics
208