A SUBSPACE METHOD FOR THE DETECTION OF

TRANSCRIPTION FACTOR BINDING SITES

Erola Pairo, Santiago Marco

Institut de Bioenginyeria de Catalunya, Baldiri i Reixac 13, 08028, Barcelona, Spain

Departament d’electronica, Universitat de Barcelona, Mart´ı i Franqu`es 1, 08028, Barcelona, Spain

Alexandre Perera

Centre de Recerca en Enginyeria Biom´edica

CIBER de Bioingenier´ıa, Materiales y Nanomedicina (CIBER-BBN), Spain

Keywords:

Transcription factors, Binding sites, Numerical DNA, Principal components analysis, Missing values, BPCA.

Abstract:

Transcription Factor binding sites are short and degenerate sequences, located mostly at the promoter of the

gene, where some proteins bind in order to regulate transcription. Locating these sequences is an impor-

tant issue, and many experimental and computational methods have been developed. Algorithms to search

binding sites are usually based on Position Speciﬁc Scoring Matrices (PSSM), where each position is treated

independently. Mapping symbolical DNA to numerical sequences, a detector has been built with a Principal

Component Analysis of the numerical sequences, taking into account covariances between positions. When

a treatment of missing values is incorporated the Q-residuals detector, based on PCA, performs better than

a PSSM algorithm. The performance on the detector depends on the estimation of missing values and the

percentage of missing values considered in the model.

1 INTRODUCTION

The Central dogma of molecular biology establishes

that information ﬂows from DNA to RNA by means

of a process called transcription, and then RNA is

translated into proteins. Gene expression is highly

regulated by complex mechanisms that involve both

transcription and translation.

One of the most important mechanisms to regulate

transcription is the binding of some proteins, tran-

scription factors, to DNA speciﬁc sequences located

mostly near the gene start site. These transcription

factor binding sites (TFBS) are commonly short se-

quences (typically 5-20 bp), that show high variability

without loss of function, although they are evolution-

ary conserved. In order to unravel the mechanisms in-

volved in gene expression, ﬁnding and understanding

the function of these sequences is a major challenge

in biology.

In the last years there has been many computa-

tional and experimental advances in the discovery

of TFBS (Elnitski et al., 2006), and that, together

with the increasing availability of genome data, made

it possible to develop TFBS databases like JAS-

PAR (http://jaspar.binf.ku.dk)(Sandelin et al., 2004)

or TRANSFAC (http://www.gene-regulation.com)

(Wingender et al., 2000), and models to search for

TFBS within genome data.

Although evidences that interdependences between

nucleotides in TFBS exists (Bulyk et al., 2002), most

of the methods used to model or search for bind-

ing sites in databases, are based on Position Speciﬁc

Scoring Matrices (PSSM) methods (Stormo, 2000),

which assume that each position in the binding site

is independent. Some examples of algorithms using

PSSM, are MAST (Bailey and Gribskov, 1998), based

on the QFAST algorithm and available in Internet as

part of the MEME suite (Bailey and Elkan, 2006) and

MATCH (Kel et al., 2003), that uses information per

sequence, in order to construct a PSSM.

A large body of knowledge exists for speciﬁc event

detection in numerical sequences (signals). For this

reason it may be interesting to translate symboli-

cal DNA sequences into numerical sequences. This

translation has been advocated by different authors

with different methods, see for instance (Anastassiou,

102

Pairo E., Marco S. and Perera A. (2010).

A SUBSPACE METHOD FOR THE DETECTION OF TRANSCRIPTION FACTOR BINDING SITES.

In Proceedings of the First International Conference on Bioinformatics, pages 102-107

DOI: 10.5220/0002697301020107

 SciTePress

2001).

Last year we proposed a detector, based on a Princi-

pal Components Analysis of the numerical DNA se-

quences, using it to detect yeast and E. Coli TFBS

within synthetic and chromosomic data. The scope

of the project was to demonstrate that, even if the co-

variance is just a second order statistics, it can capture

information of position interdependencies in binding

sites, and, consequently, a detector can be built using

that information. In this paper we extent the analy-

sis of that detector and incorporate the treatment of

missing values, comparing the results of our detector

with a PSSM algorithm for real data using S. Cere-

visiae TFBS and with MAST algorithm for synthetic

data and D.Melanogaster TFBS.

2 MATERIALS AND METHODS

2.1 Data

In order to carry out models and subsequently search

for binding sites in chromosomes, Saccharomyces

cerevisiae and Drosophila Melanogaster TFBS have

been extracted from the TRANSFAC public database,

which contains data on transcription factors, their

binding sites and regulated genes.

In the case of yeast TFBS, the information on the

relative position to the gene has also been collected.

Chromosome sequences of all the yeast genome

and gene positions, belonging to genes regulated by

transcription factors modeled, have been taken from

the EMBL database (Baker et al., 2000).

For the analysis of Drosophila TFBS, 1923 promoter

sequences have been collected between -499 and

+100 relative to Transcription Start Site (TSS), from

the Eukaryotic Promoter Database (Schmid et al.,

2006) in order to build the background models used

to simulate Drosophila DNA. In table 1, information

on the TFBS used in this study is summed up,

showing the number of sequences and nucleotides for

each TF.

2.2 Preprocessing

2.2.1 Alignment

Sequences in TRANSFAC belonging to the same

binding site do not have the same length. In order to

analyze the TFBS is needed to align these sequences.

To construct models for yeast TFBS, two different

algorithms have been used, MUSCLE (Edgar, 2004)

and CLUSTALW (Thompson et al., 1994). The rea-

son to use different alignment methods is that they

producedifferent matrices with signiﬁcant differences

in the detection of binding sites. The method giving

a better result has been used in each case. To study

D.Melanogaster TFBS, the sequences in TRANSFAC

have been used as an input for the MEME program,

which is a motif discovery that return motifs mod-

els. The aligned sequences of the MEME output have

been used as an input for MAST and Q-residuals, to

ensure that the comparison is done using the same se-

quences.

Due to the differences in length, when sequences are

aligned gaps must appear. In some positions there are

nucleotides present in some sequences but missing in

others. These gaps (missing values) are located at the

beginning and end of the sequences. Although inser-

tions or deletions in intermediate positions are the-

oretically possible, they are not frequent, and align-

ments producing them have been discarded.

2.2.2 Conversion to Numerical Sequences

Once the sequences have been aligned, in order to

perform a PCA, they must be converted to a rectan-

gular matrix of numerical sequences. Two processes

are needed. First translate symbolical DNA to numer-

ical sequences, and then a treatment of the missing

values.

The conversion from symbolical to numerical DNA

used is that proposed for Silverman and Linske (Sil-

verman and Linske, 1986), where each nucleotide is

placed at the vertex of a regular tetrahedron. It is a

conversion symmetric for all nucleotides, as it can be

seen in ﬁgure 1. Each sequence of length M becomes

a sequence of length 3× M, concatenating numerical

vectors corresponding to each nucleotide. Then, the

N sequences belonging to the same transcription fac-

tor where arranged in a matrix form.

−0.5

0.5

−1

−0.5

0.5

T=(2sqrt(2)/3, 0, −1/3)

A=(0, 0, 1)

G=(−sqrt(2)/3, −sqrt(6)/3, −1/3)

C=(−sqrt(2)/3, sqrt(6)/3, −1/3)

Figure 1: Schema to illustrate the numerical representation

of DNA. Each nucleotide is placed in a vertex of a regular

tetrahedron.

A SUBSPACE METHOD FOR THE DETECTION OF TRANSCRIPTION FACTOR BINDING SITES

103

Table 1: Information about TFBS used, the alignment method, and the results using the different methods.

TF Organism Alignment No M.V PSSM 50% all

ROX1 S. Cerevisiae MUSCLE 8.1520 7.1040 X X

ABF1 S. Cerevisiae MUSCLE 7.1972 7.8609 9.5956 8.1868

MCM1 S. Cerevisiae MUSCLE 3.5518 3.1236 4.7872 3.9900

Repr. CAR1 S. Cerevisiae CLUSTALW 6.0748 7.6009 7.8240 X

MIG1 S. Cerevisiae CLUSTALW 5.3507 4.7297 7.6454 4.4844

ADF1 D. Melanogaster MEME 7.5159 8.7789 10.2449 X

Bcd D. Melanogaster MEME 5.7499 5.8511 5.9865 X

2.2.3 Missing Values Estimation

Three different treatments have been studied for the

missing values. The ﬁrst one is to omit missing val-

ues, taking into account only the positions where the

nucleotide is known for all sequences. That is the

common treatment when a construction of a DNA ma-

trix is needed, for example in PSSM algorithms.

The second treatment consists in the assumption that

missing values are nucleotides that do not affect the

binding of a TF to that particular sequence, but that

may be important in the sequences where the position

is present. To use the information in the sequences

where the nucleotide is present is need to estimate

these missing values. The approach taken is that,

using the nucleotides probability distribution in the

genome, and the numerical conversion, the nucleotide

can be located at the mean of the chromosome, as in

equation 1

−→

MV = P(A)

−→

A + P(C)

−→

C + P(G)

−→

G + P(T)

−→

T (1)

In order to do a more accurate estimation, a

Bayesian Principal Components Analysis (BPCA) is

performed. BPCA is a Bayesian estimation method

for a probabilistic reformulation of PCA. It was ﬁrst

proposed by Bishop to choose automatically the num-

ber of principal components (Bishop, 1999), and later

used in micro-array experiments to estimate missing

values, showing a high accuracy (Oba et al., 2003).

To implement this method we use the R PcaMethods

package (Stacklies et al., 2007).

Using these techniques, different percentages of miss-

ing values can be estimated, keeping the number of

nucleotides equal in all sequences in order to have a

rectangular matrix.

2.3 Deﬁnition of the Subspace Method

2.3.1 Principal Components Analysis

Principal Components Analysis reduces the dimen-

sionality of TFBS dataset while retaining as much

as the variance present in the original data. This is

achieved projecting the intercorrelated data into the

subspace of eigenvectorsretaining the maximum vari-

ance giving new variables which are uncorrelated. In

equation 2, the PCA bilinear decomposition is de-

ﬁned. X is the TFBS numerical matrix, A is the pro-

jected data, called scores, B is the loading matrix,

which deﬁne the subspace where data is projected and

E is the error obtained from reducing the dimension-

ality.

X = AB

+ E (2)

2.3.2 Q-residuals Detector

The detector has been built using the Q-residuals of

the Principal Components model, calculated as in

equation 3, where E is the error obtained modeling the

binding sites. Q-residuals are the Euclidean distance

from a given sequence to the subspace of principal

components.

Q = EE

(3)

Most of the variance must be explained by the

model. Q-residuals of sequences belonging to the

modeled TFBS should be smaller than Q-residuals of

random or other genomic sequences. Consequently,

deﬁning a threshold should be sufﬁcient to distinguish

between TFBS from other sequences.

2.4 Comparison

2.4.1 Comparison with MATCH within Real

Data

To compare our detector to existing PSSM meth-

ods, showing that calculating interdependencies can

lead to an improvement on the detection, we im-

plement the MATCH algorithm, but taking into ac-

count the probability distribution of nucleotides in

the yeast genome. The PSSM matrix is calculated

as in equation 4, where f

i,b

is the frequency of each

BIOINFORMATICS 2010 - International Conference on Bioinformatics

104

nucleotide in each position and I(i) is the informa-

tion vector. Then a Similarity Score for the se-

quence and the core (ﬁve ﬁrst consecutive more con-

served positions), are used to discriminate between

TFBS as in the MATCH program (http://www.gene-

regulation.com/pub/programs.html)

Score =

∑

i=1

I(i) f

i,b

(4)

2.4.2 Comparison with MAST within Synthetic

Data

A comparison with TFBS of a more complex or-

ganism has been done using the TFBS of D.

Melanogaster. The background sequences of

Drosophila have been simulated with a fourth-order

Markov Model, constructed using the Drosophila pro-

moter regions from the EPD and the Cosmo R pack-

age (Bembom et al., 2007). Drosophila DNA se-

quences have been simulated and each 1000 nu-

cleotides a TFBS sequence has been inserted.

TFBS sequences (without alignment) have been used

as the MEME input, and the aligned sequences from

the MEME output have been the input for MAST and

the Q-residuals detector, to ensure that the compari-

son using the same sequences, aligned the same way.

2.4.3 Comparison Method

Receiver Operating Characteristic (ROC) curves have

been computed to compare the performance of the

detectors, using a leave-one-out cross validation

method. When comparison has been performed

within real data, the sequences located at the chro-

mosome have been omitted.

The Area Under Curve (AUC) has been calculated to

show the accuracy of the different detectors. In the

case of the MATCH algorithm, the Core Similarity

Score has been ﬁxed in its maximum value, in order

to obtain the greatest AUC curve, varying just the Se-

quence Similarity Score threshold. In MAST com-

parison, different models have been constructed using

MEME, and the model with best AUC has been com-

pared to Q-residuals detector.

To have an accurate detector means to have an ex-

tremely small false positive rate, with an AUC near

than one. Differences between detector become al-

most zero, even if they are signiﬁcant. To avoid this

problem, a new parameter has been deﬁned in equa-

tion 5, which changes as the logarithm of AUC, mak-

ing differences in performance more visible.

α = −log(1 − AUC) (5)

Not only PSSM methods have been compared to

our detector, but AUC has also been used to com-

pare between detectors taking into account different

amounts of gaps and between the two methods to es-

timate missing values.

3 RESULTS

3.1 Comparison to PSSM Algorithms

3.1.1 Comparison with MATCH within Real

Data

Comparison to the MATCH algorithm has been done

in three different cases. First when missing values

have been omitted, then when only these positions

where the nucleotides are present at least in a 50% of

the sequences have been considered, and ﬁnally using

all the positions available. The estimation of the miss-

ing values in the last two cases, has been the assump-

tion that they are in the mean of the chromosome.

In ﬁgure 2, ROX1 and ABF1 binding sites have been

studied for different number of principal components.

ROX1 sequences have all the same length, no treat-

ment of missing values is needed, and it can be seen

in ﬁgure 2 a) that our detector has a better perfor-

mance than the PSSM algorithm. In ABF1 binding

sites, we need the treatment of missing values to out-

perform the PSSM algorithm. It can also be observed

that estimating all the missing values leads to a worse

detection than estimating just those present in at least

half of the sequences.

The same study has been done for different S.

Cerevisiae binding sites, table 1 summarizes the re-

sults, conﬁrming that, in some cases PSSM algorithm

obtains better results when no missing values are con-

sidered but the treatment of missing values increases

leads to best AUC than PSSM. In all cases, the esti-

mation of all missing values, deteriorates the detector

performance when is compared to the estimation of a

percentage of M.V. A compromise between informa-

tion and uncertainty introduced must be reached.

3.1.2 Comparison with MAST within Synthetic

data

Comparison to MAST has been done for D.

Melanogaster TFBS detection, when missing values

are omitted and when a percentage of missing values

has been estimated with the mean of the chromosome

method. In ﬁgure 3 the results using MAST and Q-

residuals detector have been shown in Bcd binding

A SUBSPACE METHOD FOR THE DETECTION OF TRANSCRIPTION FACTOR BINDING SITES

105

5 10 15

6.5

7.5

8.5

nPCs

−log(1−AUC)

PCA

PSSM

(a) AUC ROX1

2 4 6 8

nPcs

−log(1−AUC)

No M.V.

Match

50%

All

(b) AUC ABF1

Figure 2: AUC vs the number of PCs in ROX1 and ABF1,

using different percentages of missing values.

sites and synthetic data. Q-residuals detector outper-

forms MAST algorithm when information available

in at least 50% of the sequences is taken into account.

1 2 3 4 5

5.2

5.4

5.6

5.8

6.2

nPCs

−log(1−AUC)

No M.V.

MAST

50%

Figure 3: Comparison between Q-residuals detector and

MAST algorithm for D.melanogaster TFBS and synthetic

data. Q-residuals outperforms MAST when missing values

are taken into account.

The same analysis has been done in Adf1 bind-

ing sites, and results are summarized in table 1, con-

ﬁrming that Q-residuals detector outperforms PSSM

methods when missing values are taken into account.

3.2 BPCA Missing Value Estimation

Comparison between BPCA and the ﬁrst approxima-

tion of the missing values in ABF1 is shown in ﬁg-

ure 4, where the number of missing values has been

increased from 0% to 12% just incorporating to the

model positions missing in an increasing number of

sequences. It can be ﬁrst observed that the result

without missing values estimation of BPCA is not the

same as the result performing PCA, that is caused by

the fact that in BPCA estimation of missing values

vectors are not constrained to be orthogonal. Then it

can also be seen that BPCA estimation lead to better

results, for a small number of missing values consid-

ered, but then its results fall. Both BPCA and mean of

the chromosome have a maximum AUC in a percent-

age of missing values equal to 2.19%.

The percentage of missing values in TFBS matrices,

can be near 50% , and all of them are concentrated

at the beginning and end of the sequences. It leads

to positions where few nucleotides are present, and

more than ten must be estimated. BPCA method

needs information available in order to estimate miss-

ing values, when little information is available it be-

comes an unuseful method, leading to models where

all loadings are 0 vectors. In that cases, the mean-

of-the-chromosome method which needs no informa-

tion, performs better.

0 2 4 6 8 10

6.5

7.5

8.5

9.5

10.5

% M.V.

−log(1−AUC)

PCA

BPCA

Figure 4: AUC comparison between BPCA and mean of the

chromosome treatment of missing values in ABF1.

4 CONCLUSIONS

Performing a Principal Components Analysis of nu-

merical TFBS has been demonstrated to be an effec-

tive method to detect TFBS within real and synthetic

data, having always a better performance than PSSM

methods when missing values treatment is incorpo-

rated. This demonstrates that covariance, in spite of

BIOINFORMATICS 2010 - International Conference on Bioinformatics

106

being just a second order statistics can capture TFBS

information.

More information can be incorporated taken into ac-

count missing values of TFBS. When a treatment of

missing values is incorporated the detector perfor-

mance increases. When only that nucleotides present

in at least 50% of the sequences are taken into ac-

count, the AUC is greater than when all gaps are

present in the model. The reason is that gaps are

placed in the beginning and end of the sequences,

and in some positions we have almost no information

available to construct a model. An equilibrium be-

tween information and uncertainty incorporated must

be reached for each TFBS.

A more complex estimation of missing values, BPCA,

has been proved to perform better when the percent-

age of missing values is low, but to fall quickly to

worse results than the simple approximation to the

mean of the chromosome, when more missing val-

ues are considered. BPCA fails when no information

is available in a certain position because this method

tries to estimate a value using the existing informa-

tion.

ACKNOWLEDGEMENTS

This work has been partially supported by the Span-

ish Ministerio de Ciencia y Tecnologa through the CI-

CYT GRANT TEC2007-63637 and the Ramon y Ca-

jal program.CIBER-BBN is an initiative of the Span-

ish ISCIII. E.P. wants to thank IBEC for supporting

her PhD ﬁnancially.

REFERENCES

Anastassiou, D. (2001). Genomic signal processing. Signal

Processing Magazine, IEEE, 18(4):8–20.

Bailey, T. and Elkan, C. (2006). Meme:discovering and

analizing dna and protein sequence motifs. Nucleic

acids research, 34:W369–W373.

Bailey, T. and Gribskov, M. (1998). Combining evidence

using p-values: Application to sequence homology

searches. Bioinformatics, 14:48–54.

Baker, W., van den Broek, A., Camon, E., Hingamp, P.,

Sterk, P., Stoesser, G., and Tuli, M. A. (2000). The

EMBL Nucleotide Sequence Database. Nucl. Acids

Res., 28(1):19–23.

Bembom, O., Kelez, S., and van der Laan, M. J. (2007). Su-

pervised Detection of Conserved Motifs in DNA Se-

quences with Cosmo. Statistical Applications in Ge-

netics and Molecular Biology, 6:article 8.

Bishop, C. (1999). Variational principal components. In Ar-

tiﬁcial Neural Networks, 1999. ICANN 99. Ninth In-

ternational Conference on (Conf. Publ. No. 470), vol-

ume 1, pages 509–514 vol.1.

Bulyk, M. L., Johnson, P. L. F., and Church, G. M. (2002).

Nucleotides of transcription factor binding sites exert

interdependent effects on the binding afﬁnities of tran-

scription factors. Nucl. Acids Res., 30(5):1255–1261.

Edgar, R. (2004). Muscle: multiple sequence alignment

with high accuracy and high throughput. Nucleic

Acids Res, 32(5):1792–1797.

Elnitski, L., Jin, V. X., Farnham, P. J., and Jones, S. J.

(2006). Locating mammalian transcription factor

binding sites: A survey of computational and exper-

imental techniques. Genome Research, 16(12):1455–

1464.

Kel, A., Gossling, E., Reuter, I., Cheremushkin, E.,

Kel-Margoulis, O., and Wingender, E. (2003).

MATCHTM: a tool for searching transcription factor

binding sites in DNA sequences. Nucl. Acids Res.,

31(13):3576–3579.

Oba, S., Sato, M.-a., Takemasa, I., Monden, M., Matsub-

ara, K.-i., and Ishii, S. (2003). A Bayesian missing

value estimation method for gene expression proﬁle

data. Bioinformatics, 19(16):2088–2096.

Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W. W.,

and Lenhard, B. (2004). JASPAR: an open-access

database for eukaryotic transcription factor binding

proﬁles. Nucl. Acids Res., 32(suppl 1):D91–94.

Schmid, C. D., Perier, R., Praz, V., and Bucher, P. (2006).

EPD in its twentieth year: towards complete promoter

coverage of selected model organisms. Nucl. Acids

Res., 34:D82–85.

Silverman, B. and Linske, R. (1986). A measure of dna

periodicity. Journal of Theoretical Biology, 118:295–

300.

Stacklies, W., Redestig, H., Scholz, M., Walther, D., and

Selbig, J. (2007). pcaMethods a bioconductor package

providing PCA methods for incomplete data. Bioin-

formatics, 23(9):1164–1167.

Stormo, G. (2000). Dna binding sites: Representation and

discovery. Bioinformatics, 16:16–23.

Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994).

CLUSTAL W: improving the sensitivity of progres-

sive multiple sequence alignment through sequence

weighting, position-speciﬁc gap penalties and weight

matrix choice. Nucl. Acids Res., 22(22):4673–4680.

Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I.,

Matys, V., Meinhardt, T., Prubeta, M., Reuter, I., and

Schacherer, F. (2000). TRANSFAC: an integrated sys-

tem for gene expression regulation. Nucl. Acids Res.,

28(1):316–319.

A SUBSPACE METHOD FOR THE DETECTION OF TRANSCRIPTION FACTOR BINDING SITES

107