being just a second order statistics can capture TFBS
information.
More information can be incorporated taken into ac-
count missing values of TFBS. When a treatment of
missing values is incorporated the detector perfor-
mance increases. When only that nucleotides present
in at least 50% of the sequences are taken into ac-
count, the AUC is greater than when all gaps are
present in the model. The reason is that gaps are
placed in the beginning and end of the sequences,
and in some positions we have almost no information
available to construct a model. An equilibrium be-
tween information and uncertainty incorporated must
be reached for each TFBS.
A more complex estimation of missing values, BPCA,
has been proved to perform better when the percent-
age of missing values is low, but to fall quickly to
worse results than the simple approximation to the
mean of the chromosome, when more missing val-
ues are considered. BPCA fails when no information
is available in a certain position because this method
tries to estimate a value using the existing informa-
tion.
ACKNOWLEDGEMENTS
This work has been partially supported by the Span-
ish Ministerio de Ciencia y Tecnologa through the CI-
CYT GRANT TEC2007-63637 and the Ramon y Ca-
jal program.CIBER-BBN is an initiative of the Span-
ish ISCIII. E.P. wants to thank IBEC for supporting
her PhD financially.
REFERENCES
Anastassiou, D. (2001). Genomic signal processing. Signal
Processing Magazine, IEEE, 18(4):8–20.
Bailey, T. and Elkan, C. (2006). Meme:discovering and
analizing dna and protein sequence motifs. Nucleic
acids research, 34:W369–W373.
Bailey, T. and Gribskov, M. (1998). Combining evidence
using p-values: Application to sequence homology
searches. Bioinformatics, 14:48–54.
Baker, W., van den Broek, A., Camon, E., Hingamp, P.,
Sterk, P., Stoesser, G., and Tuli, M. A. (2000). The
EMBL Nucleotide Sequence Database. Nucl. Acids
Res., 28(1):19–23.
Bembom, O., Kelez, S., and van der Laan, M. J. (2007). Su-
pervised Detection of Conserved Motifs in DNA Se-
quences with Cosmo. Statistical Applications in Ge-
netics and Molecular Biology, 6:article 8.
Bishop, C. (1999). Variational principal components. In Ar-
tificial Neural Networks, 1999. ICANN 99. Ninth In-
ternational Conference on (Conf. Publ. No. 470), vol-
ume 1, pages 509–514 vol.1.
Bulyk, M. L., Johnson, P. L. F., and Church, G. M. (2002).
Nucleotides of transcription factor binding sites exert
interdependent effects on the binding affinities of tran-
scription factors. Nucl. Acids Res., 30(5):1255–1261.
Edgar, R. (2004). Muscle: multiple sequence alignment
with high accuracy and high throughput. Nucleic
Acids Res, 32(5):1792–1797.
Elnitski, L., Jin, V. X., Farnham, P. J., and Jones, S. J.
(2006). Locating mammalian transcription factor
binding sites: A survey of computational and exper-
imental techniques. Genome Research, 16(12):1455–
1464.
Kel, A., Gossling, E., Reuter, I., Cheremushkin, E.,
Kel-Margoulis, O., and Wingender, E. (2003).
MATCHTM: a tool for searching transcription factor
binding sites in DNA sequences. Nucl. Acids Res.,
31(13):3576–3579.
Oba, S., Sato, M.-a., Takemasa, I., Monden, M., Matsub-
ara, K.-i., and Ishii, S. (2003). A Bayesian missing
value estimation method for gene expression profile
data. Bioinformatics, 19(16):2088–2096.
Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W. W.,
and Lenhard, B. (2004). JASPAR: an open-access
database for eukaryotic transcription factor binding
profiles. Nucl. Acids Res., 32(suppl 1):D91–94.
Schmid, C. D., Perier, R., Praz, V., and Bucher, P. (2006).
EPD in its twentieth year: towards complete promoter
coverage of selected model organisms. Nucl. Acids
Res., 34:D82–85.
Silverman, B. and Linske, R. (1986). A measure of dna
periodicity. Journal of Theoretical Biology, 118:295–
300.
Stacklies, W., Redestig, H., Scholz, M., Walther, D., and
Selbig, J. (2007). pcaMethods a bioconductor package
providing PCA methods for incomplete data. Bioin-
formatics, 23(9):1164–1167.
Stormo, G. (2000). Dna binding sites: Representation and
discovery. Bioinformatics, 16:16–23.
Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994).
CLUSTAL W: improving the sensitivity of progres-
sive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight
matrix choice. Nucl. Acids Res., 22(22):4673–4680.
Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I.,
Matys, V., Meinhardt, T., Prubeta, M., Reuter, I., and
Schacherer, F. (2000). TRANSFAC: an integrated sys-
tem for gene expression regulation. Nucl. Acids Res.,
28(1):316–319.
A SUBSPACE METHOD FOR THE DETECTION OF TRANSCRIPTION FACTOR BINDING SITES
107