orthologous genes in a group, having at least one
high scoring site, being
i
K
or greater, given that
upstream regions are random sequences.
)),,|(log(
*
SLNKkPR
iiii
At the final step we utilize “Bernoulli Estimator”
(BE) routine (Kalinina, 2004) which assumes that
input values are a mixture from two distributions
representing the noise and the signal. Only
distribution that represents the noise is required for
automatic inference of the optimal threshold
distinguishing the signal from the noise. Applying
BE to regulatory potentials calculated for all OGs
given
*
S , the most probable content of a regulon
can be automatically identified. Considering all
possible values of
*
S , the optimal threshold
delivering the minimum to BE probability, can be
obtained.
As a result, the optimal threshold for TFBS score
*
S
, the optimal threshold for the regulatory
potential, and the subset of OGs predicted to be
members of the regulon can be calculated. It should
be noted that in this case the same “universal” TFBS
score threshold has been used for all OGs under
consideration
We also implemented two additional
modifications of the developed approach considering
different levels of sensitivity.
2.2 Individual TFBS Score Threshold
In this minor modification, instead of one universal
TFBS score threshold, we use individual threshold
for each OG. It allows to take into account the
possible difference in affinity of TF factor to DNA
binding sites among different members of the same
regulon. It was shown that such differences can be
evolutionary conserved and thus have functional
meaning. (Kotelnikova, 2005)
2.3 No TFBS Score Threshold
This modification is fundamentally different from
the two previous ones. This version does not use
threshold to filter out weak sites, but rather allows
all putative binding sites to contribute to the
regulatory potential of an OG. For a particular OG
of size N we consider a set of N best scores
{
s
1
,s
2
...s
}. The regulatory potential is calculated
as a probability to observe OG with maximum
scores {
s
1
,s
2
...s
} or better by chance.
3 TESTING
3.1 Comparison with the Results of
Manual Analysis
The developed algorithms have been extensively
tested on 62 manually curated regulons from
Shewanella collection retrieved from the RegPrecise
database (http://regprecise.lbl.gov). All regulons
were classified into three classes: i) local (1-2
operons), ii) medium (3-10 operons), and iii) global
(more than 10 operons). As expected, all three
versions performed similarly well on local regulons
representing the most abundant class of regulons in
microbial genomes. For 24 out of 39 local regulons
the regulon content was predicted correctly.
For medium and large size regulons, the
"universal TFBS score threshold" approach was able
to predict 63% and 36% true members of regulons.
The careful analysis of predicted regulon content
revealed that in both cases it comprises the core of
the true regulon with very high TFBS score and high
level of TFBS conservation across all genomes
under analysis. At the same time the specificity is
very high in both cases (95%). Thus the approach
can be used for automatic accurate reconstruction of
the core of regulon, and provides a good starting
point for the detailed manual curation.
3.2 PWM Quality
All the methods of regulatory motif prediction give
some number of variants as output. This gives rise to
the problem of motif quality estimation.
Conservation of nucleotides in PWM positions does
not always reflect the true quality of the motif found.
The best test of motif quality is arrangement of sites
recognized with PWM in the genome: if sites are
found upstream of genes which can be included into
one metabolic pathway, the motif is certainly found
correctly. Unfortunately, we don't always have the
information about gene function. On the other hand,
by applying comparative genomics, we can select
evolutionarily conservative sites. Our approach is
based on comparative genomics; in addition, we
compute the minimal probability of selecting these
sites by chance. This probability could reflect the
motif quality. To verify this assumption, we used the
following test.
It is known that, as a rule, each TF regulates its
own gene. Even when other regulated genes for a TF
are unknown, one can search for motifs upstream of
TF gene in a set of closely related genomes. Such a
motif is a first approximation, and can be improved
AUTOMATED REGULON CONTENT PREDICTION AND ESTIMATION OF PWM QUALITY
323