for ACdisc is 0.1, resulting in the most common tenth
of the AC-probes to be discarded. Lower ACdisc val-
ues will cause more common AC-probes to be taken
into account. As result, the average AC-probe will
cause less distinction between blocks, which is not
desirable. With higher values, a smaller amount of
AC-probes will be taken into account, causing less
approved AC-probes to be found from query, which
makes the initial search phase less tolerant to differ-
ences between the query and the text. Suitable values
in our experiments have been in the range [0.001, 0.1].
The parameter mp defines the minimum amount
of approved AC-probes a block has to have in com-
mon with the query sequence in order to be con-
sidered potential. The default value for mp is 0.1.
Lower mp values allow blocks that differ more from
the query sequence to be taken into account. Higher
mp values leave less room for differences between
the approved AC-probes of the block and the query
sequence. Together the parameters ACdisc and mp
have a major effect on run times, especially when
very low values are chosen. Combination of ACdisc
values of 0.003 or smaller and mp values of 0.05 or
smaller should be avoided, as this leads to increase
of run times by 40–100 orders of magnitude. If the
user wishes to take initially less promising blocks in
to the final alignment stage, we suggest values ACdisc
= 0.01 and mp = 0.07 to be used. More such balanced
combinations are shown in Section 3.
As the final cutoff affecting the initial search
phase, each block has to contain at least a portion rp
of maximum amount of approved AC-probes match-
ing between a single block and the query sequence,
in order to be considered potential. The default value
for rp is 0.8. As the occurrences of approved AC-
probes can be scattered to distant, non-related regions
within a block, values very close to 1 may lead to
situations where blocks resulting in more more sat-
isfactory alignments with the query are discarded. In
our experiments, rising the rp value above 0.8 has de-
creased the run times only by 10–20%, and we do not
advise to use values higher than this.
The alignment phase has one adjustable parameter
value mt, which is the minimum amount of probe-hits
a structured set has to have for it to be considered to
correspond to an alignment between the query and the
text sequence. The default value for mt is 60. This
parameter effectively defines the minimum length of
a constructed alignment. If user wants to take into
account very short alignments between the query se-
quence and the database text block, smaller values can
be used. In our experience, changing the value of mt
from 60 to 10 causes run time increase by slightly less
than 10%. However, if shorter alignments are not ex-
plicitly desired, the default value is recommended, as
smaller values will result in larger amount of short,
probably less interesting alignments to be output.
Additionally, there are few built-in choices re-
garding values, which affect the function of our tool.
One of these is the length of the AC-probes, which
was chosen to be 10 nucleotides in addition to the
dinucleotide AC. If the length of the probe is in-
creased, this increases the size of the index and re-
quires database sequence to have longer identical re-
gions with the query sequence. Shorter AC-probes
will occur more commonly, causing less distinction
between blocks.
Similar balancing is required for the length and
amount of k-mers included in a single BG-probe,
which is used by the BG algorithm in the alignment
phase. The q must be small enough for the size of the
k-mer encoded database text files to be manageable.
However it is beneficial to have BG-probes, which
have a high probability of being unique in a block.
Our choices for the length of the AC-probes and BG-
probes are balanced compromises, which have proven
out to work well in our experiments.
3 RESULTS
Our algorithm was compared with the algorithms
Mega BLAST (Zhang et al., 2000) and BLAT (Kent,
2002). For GAST and Mega BLAST, searches were
made against a database consisting of the whole hu-
man genome received from the Ensembl genome
database (Hubbard, T. J. P. et al. , 2007). The release
in question was based on the NCBI 36 assembly of the
human genome. In the case of BLAT, the system used
for the runs lacked the memory to perform searches
against the whole human genome. Therefore, an-
other set of searches with BLAT, Mega BLAST, and
GAST were performed against the chromosome 1 of
the same genome. All the runs were performed on a
machine with 1GB DDRII SDRAM (667MHz) and an
Intel Core 2 Duo T5500 (1.66 GHz) processor, run-
ning Ubuntu 7.04. All the run times in this section
are times used by the program itself and any library
subroutines it calls. The tests were later repeated on
another machine with 6 GB of RAM in order to elim-
inate possible paging effects. No bias of this sort was
detected.
The AC-index described in the previous section
was created, using block size 500, 000, AC-probe
length 12 and ACdisc value 0.1. In addition, the
database files were encoded with 7-mers. These steps
were performed for the full genome and for the chro-
mosome 1 separately.
BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms
86