Table 1: Statistics of dbSELF and dbSNP. N
total
and N
ef f
are the number of variants and the effective variants in the dbSNP,
respectively. N
ebase
and N
error
are the number of effective bases and errors, respectively; the error rate is defined as the ratio
of the number of errors to the total number of bases in raw data. All quantities, except the number of variants in the dbSNP,
are averaged over 10 samples, and the corresponding standard devations are in the parentheses.
Database Genotype
Human N
total
N
ef f
N
ebase
N
error
Error rate (%)
dbSNP 318,739,162 5,965,063 (857,126) 17,919,487 (4,950,378) 14,570,351 (3,424,599) 0.11 (0.04)
dbSELF 2,824,914 (571,181) 2,822,702 (570,685) 14,316,273 (4,649,247) 18,173,565 (4,020,284) 0.14 (0.05)
Rice N
total
N
ef f
N
ebase
N
error
Error rate
dbSNP 12,185,568 1,411,561 (955,475) 8,936,172 (6,314,281) 7,410,641 (3,009,390) 0.24 (0.12)
dbSELF 1,505,469 (1,001,366) 1,464,242 (981,338) 10,815,057 (6,969,199) 5,531,758 (2,333,059) 0.18 (0.09)
ference is expressed as ∆Q
ij
≡ Q
SNP
ij
− Q
SELF
ij
, where
Q
SNP
ij
and Q
SELF
ij
are the recalibrated quality scores of
base j in sample i obtained by using the dbSNP and
the dbSELF, respectively. Note that ∆Q
ij
is an inte-
gral value as the base quality score is an integer.
Figure 3 shows relative frequency distributions of
∆Q
ij
’s for both human and rice averaged over 10
samples together with the standard deviations repre-
sented by the error bars. From Fig. 3, we see that the
disribution of ∆Q
ij
’s for human is symmetric about
∆Q
ij
= 0, and the majority of bases (about 64%)
have ∆Q
ij
= 0 and more than 95% of bases have
∆Q
ij
≤ 1. This means that more than 95% of the re-
calibrated base quality scores obtained by using two
different databases are the same or differ by one Phred
score. This result suggests that the dbSELF can serve
a reasonably good alternative to the dbSNP.
In the case of rice, however, whereas about 22%
of bases have ∆Q
ij
= 0, more than 70% of bases
have their recalibrated quality scores obtained by the
dbSELF higher than those obtained by the dbSNP.
Considering that BQSR with the rice dbSNP under-
estimates the recalibrated quality score compared to
the human dbSNP as discussed in Section 2.3, the rice
dbSELF can alleviate, at least in part if not entirely,
the under-estimate of the recalibrated scores. In this
sense, the rice dbSELF may substitute for the rice db-
SNP for a better BQSR result.
As stated in Section 1, the genotype of a base is
regarded as an error when the base in a BAM/SAM
does not match with the reference at a position not
listed in the database. As a complementary to the er-
ror, we define an effective base as a mismatched base
that is identified by an effective variant listed in the
database. Thus, a mismatched base is either an error
or an effective base. In Table 1, we list the number
of variants and effective variants in the two databases,
together with the statistics of effective bases and er-
rors. In addition, we estimate the error rate, which is
the ratio of the number of errors to the total number
of genotyped bases in the raw data (i.e., FASTQ file).
Because all quantities, except the number of variants
in the dbSNP, depend on samples, we report in Table 1
the mean and the standard deviation of the quantities
over 10 samples. Note that the standard deviations
of the numbers of variants, effective variants, and ef-
fective bases for rice are larger than those for human
regardless of the database. This is due to the charac-
teristics of the reference sequence discussed in 2.3.
We see from Table 1 that, for both human and rice,
while the dbSELF contains less number of variants
than the dbSNP, almost all variants in the dbSELF
are effective variants. This is expected because the
dbSELF is nothing but a set of variants called from
samples without BQSR step. In the case of human,
we find that the dbSELF contains far less number of
variants (about 0.8%) than the dbSNP does. However,
more than 99% of variants in the dbSELF are effec-
tive variants, whereas only about 2% in the dbSNP
are effective. More importantly, although the dbSELF
has less than a half as many effective variants as the
dbSNP has, the error rates obtained by using the two
databases differ by only 0.03%. This difference is not
a significant compared to the difference in the number
of effective variants.
In the case of rice, we can see from Table 1 that
the dbSELF contains more effective variants (about
4%) than the dbSNP, although the dbSELF contains
less number of variants (about 12%) than the dbSNP.
While about 12% of variants in the dbSNP are effec-
tive, more than 97% of variants in the dbSELF are
effective. The fact that the rice dbSELF identifies
more effective variants is a primary reason that BSQR
using the dbSELF gives higer recalibrated scores on
average than the dbSNP does. In addition, the db-
SELF generates more effective bases than the dbSNP
does; as a result, the error rate using the dbSELF
is smaller than that using the dbSNP. This basically
yields higher Q
SELF
ij
on average than Q
SNP
ij
.
Note that there is no reason in prior that the er-
ror rate of rice should be greater than that of human.
Rather, we should expect about the same error rate
for both human and rice. In this sense, the fact that
the error rate using the dbSELF is almost comparable