Table 1: List of DNA sequences used in experiments.
Name Accession # Role GC% Length (bp)
E. coli NC 000913 Background 50.8 4,639,675
M. tuberculosis NC 000962 Train 65.6 4,411,532
B. subtilis NC 000964 Train 43.6 4,214,630
B. fragilis NC 003228 Train 43.2 5,205,140
G. metallireducens NC 007517 Test 59.6 3,997,420
C. welchii NC 008261 Test 28.4 6,513,368
H. pylori NC 012973 Test 39.2 1,576,758
positions by F-measure. F-measure values of pecu-
liar compositions are one order of magnitude larger
than those of substrings extracted by z-score crite-
ria. Thirdly, we develop how to set parameters so that
the evaluation value becomes high using training data,
and then verify these parameters using test data.
2 RELATED WORK
Putting some limitations on the syntax of patterns,
contrast pattern finding methods are expected to
find infrequent patterns (Beißbarth and Speed, 2004;
Huang et al., 2003; Ji et al., 2005). However, some
domain specific knowledge is necessary to define
such a word properly.
To find infrequent patterns, or under-represented
patterns, scores based on statistical testing have also
been extensively studied (Apostolico et al., 2000;
Horng et al., 2002; Leung et al., 1996; Marschall and
Rahmann, 2009; Schbath, 1997; Robin et al., 2005),
such as z-score and χ
2
-score. These scores assume
a probabilistic model and, to find infrequent patterns,
use the deviation of frequencies of candidate patterns
from their expected frequencies. However, mining al-
gorithms based on statistical testing suffer from the
data sparseness problem, which is an appearance of
Zipf’s law. Therefore, it is important to decide appro-
priate lengths of subsequences. However, it is difficult
to decide an appropriate length since subtle changes
on the length make large difference on the number of
candidate patterns.
3 EXPERIMENTS
3.1 Data Sets
The data sets used in our experiments are whole DNA
sequences of 7 bacteria (Table 1). We use the whole
DNA sequence as input data.
As the common background set for all experi-
ments in this section, we choose E. coli since it is a
Table 2: Trained parameters achieving highest F
1/4
values,
and corresponding precisions and recalls, for training se-
quences.
#
θ
B
η precision recall F
1/4
NC 000964 1.9 8 0.8047 0.1534 0.6438
NC
003228 2.4 6 0.7567 0.1345 0.5949
NC
000962 1.9 7 0.4199 0.0327 0.2476
well-studied species. As a training target data, we use
B. subtilis, which is another popular bacterium. In
addition to that, we use B. fragilis and M. tuberculo-
sis because we have already found that the length and
GC content of a target sequence affect found pecu-
liar compositions from preliminary experiments and,
compared to B. subtilis, B. fragilis has a similar GC
content and a longer length while M. tuberculosis a
larger GC content and a similar length.
3.2 Training Parameters
FPCS requires three parameters θ
T
, θ
B
and η. We set
θ
T
= 2, which is the minimum integer greater than 1,
because it is shown that the least influential parame-
ter among these parameters is θ
T
(Ikeda and Suzuki,
2009). To decide other two parameters, we caluculate
F
β
=
(1+ β
2
) · P· R
β
2
· P+ R
,
where P and R denote precision and recall, respec-
tively, and both of them are defined by positions of
features on target sequences.
We choose β = 1/4 for F
β
which weighs precision
four times as much as recall although F-measure typ-
ically means F
1
, which puts weight on precision and
recall equally. However, our goal is not to find these
features but to show that found peculiar compositions
match biological features. To this end, precision val-
ues are desired to be high while we do not need high
recall values.
Table 2 shows trained parameters, where RNAs
are considered as relevant features for B. subtilis and
B. fragilis, and transposons for M. tuberculosis. From
genetic maps, like 1, we find that relevant features are
different. This is because GC-content of M. tubercu-
losis is much larger than those of the other sequences.
Infrequent,Unexpected,andContrastPatternDiscoveryfromBacterialGenomesbyGenome-wideComparativeAnalysis
309