space is found by a Principal Component Analysis
(PCA) of a training dataset containing known exam-
ples of TFBS. The structure of the sequences that are
recognized by a specific transcription factor can be
captured by its covariance, taking into account inter-
positional dependence. While PWM based methods
use only the information regarding the frequency of
each nucleotide at a given position, the development
of methods that capture interpositional dependence
opens new possibilities to improve detector perfor-
mance.
2 MATERIALS AND METHODS
The analysis has been done on four different groups
of DNA sequences previously aligned. Each one
is recognized by a transcription factor as a bin-
ding site. The first two groups of sequences
come from Dr. Schneider data’s (http://www.-
lmmb.ncifcrf.gov/∼toms), and corresponds to the
Dr. Thomas Schneider’s work on the characteri-
zation of transcription factor binding sites (Schnei-
der, 1997). The last two groups have been obtained
using the TRANSFAC data base (http://www.gene-
regulation.com/pub/databases.html), that contains
data on transcription factors, their binding sites and
the regulated genes. The last groups of sequences
have been aligned using MUSCLE (Edgar, 2004). Ta-
ble 1 summarizes the characteristics of these groups
of sequences.
Not all the aligned sequences, corresponding to
a certain transcription factor, have the same length,
because of missing data at the extremes of some se-
quences. To carry out PCA sequences of the same
length are needed, therefore data have been prepro-
cessed and missing values have been omitted. Al-
though many techniques to model missing data have
been proposed, the present work only considers posi-
tions where the nucleotide is present for all sequences.
In order to perform or a PCA analysis conversion
to numerical sequences is needed. Each nucleotide
has been assigned to a vertex of a regular tetrahedron,
so that nucleotides are symmetric among each others
(Silverman and Linske, 1986). In figure 1, the posi-
tion of each nucleotide in a tetrahedron is schemati-
cally shown.
Vectors corresponding to each nucleotide of a se-
quence have been concatenated to a vector, and the
different sequences corresponding to a TFBS have
been arranged in matrix form. The result obtained is
a matrix with 3· Number of nucleotides columns and
as many rows as the number of original sequences.
Principal component analysis performs a eigen-
−0.5
0
0.5
1
−1
0
1
−0.5
0
0.5
1
A=(0,0,1)
T=(2sqrt(2),0,−1/3)
C=(−sqrt(2)/3,−sqrt(6)/3,−1/3)
G= (− sqrt(2)/3, −sqrt(6)/3,−1/3)
Figure 1: Schema to illustrate the numerical representation
of DNA. Each nucleotide is placed in a vertex of a regular
tetrahedron.
analysis of the covariance and permits to project the
data into a subspace defined by the set of eigenvectors
capturing the maximum variance. Equation 1, shows
the PCA decomposition, where X is the DNA numer-
ical matrix with dimensions M × 3N (where M is the
number of sequences and N is the binding site length),
A are the scores, B are the eigenvectors (loadings),
and E the error. Dimensions of A are M · npc and
dimensions of B correspond to 3N · npc, where npc
represents the number of principal components in the
model.
X = AB
T
+ E (1)
In case of DNA sequences,the similarly distribu-
tion of the variance along all dimensions indicates the
complexity of DNA data. In order to capture a great
percentage of variance more dimensions have to be
taken into account.
A detector has been created using Q-residuals,
calculated using equation 2, where E is the distance
orthogonal to the subspace defined by the principal
components. Since most of the variance of trans-
cription factor binding sites has been captured by
the model, the Q-residuals of a sequence belonging
to a binding site should be small, while random se-
quences should have higher Q-residuals, according to
their symmetric variance in all the space. Defining a
threshold on Q-residuals, TFBS can easily been dis-
tinguished from random sequences.
Q = EE
T
(2)
Models with different number of principal com-
ponents have been built for each transcription factor
and to demonstrate that binding sites have a struc-
ture 100,000 random sequences have been used as
test data. To evaluate the detector we propose the use
of Receiving Operating Characteristic (ROC) curves,
which show the true positive rate (TP) against the
false positive rate (FP). ROC curves have been com-
puted in a range of principal components in order to
A PRELIMINARY STUDY ON THE DETECTION OF TRANSCRIPTION FACTOR BINDING SITES
507