LINEAR–TIME MATCHING OF POSITION WEIGHT MATRICES

Nikola Stojanovic

Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, U.S.A.

Keywords:

DNA motifs, Position weight matrices, Genome–wide analysis, Algorithms, Genomics, Pattern matching.

Abstract:

Position Weight Matrices are a popular way of representing variable motifs in genomic sequences, and they

have been widely used for describing the binding sites of transcriptional proteins. However, the standard

implementation of PWM matching, while not inefﬁcient on shorter sequences, is too expensive for whole–

genome searches. In this paper we present an algorithm we have developed for efﬁcient matching of PWMs

in long target sequences. After the initial pre–processing of the matrix it performs in time linear to the size of

the genomic segment.

1 INTRODUCTION

Eukaryotic genes are generally regulated in com-

plex ways, through networks of protein–DNA and

protein–protein interactions, which direct chromatin

remodeling, histone modiﬁcations, formation of tran-

scriptional initiation complexes and RNA Polymerase

elongation. The prevailing opinion, corroborated by

some studies (Khambata-Ford et al., 2003; Young

et al., 2003) but being increasingly questioned (Nel-

son et al., 2004; The ENCODE Project Consortium,

2007; Stojanovic, 2009) is that most of these interac-

tions take place within a few hundred bases upstream

of the transcription start sites. Even as regions im-

portant for the regulation of genes have been found

at distal loci, around alternative ﬁrst exons, in introns

and sequences located downstream of the transcrip-

tion start sites or even genes themselves, core pro-

moter sites are still considered the most important for

the gene expression. It is also generally accepted that

the binding of transcriptional enzymes is in large part

directed by speciﬁc motifs in DNA sequence. How-

ever, while there are proteins which bind only to exact

layouts of bases, most transcription factors are rather

non-speciﬁc in their choice of a binding site.

Consequently, the search for transcription factor

binding sites has become one of the most popular sub-

ﬁelds of bioinformatics, and many algorithms have

been developed over about two decades of intensive

research. The early approaches relied on a rather

naive assumption that the motifs at the target sites

of proteins must feature information content sufﬁ-

cient for their recognition, but disillusionment soon

followed, as any attempt to isolate these and other

functional elements in DNA following this logic re-

sulted in a very large number of false positives. Re-

cent methods have thus concentrated on the incorpo-

ration of additional information to the raw sequence

data, although they have so far neglected many impor-

tant biochemical aspects. The additional information

often relied on clustering of important motifs in pu-

tative regulatory modules, phylogenetic conservation

or matching the sequence to databases of experimen-

tally conﬁrmed sites, such as TRANSFAC (Wingender,

2008) or Jaspar (Bryne et al., 2008).

Since many proteins important for transcriptional

regulation bind with low speciﬁcity, the experimen-

tally determined target motifs in DNA can be substan-

tially different. Nevertheless, they often feature a well

deﬁned consensus sequence, with speciﬁc loci vary-

ing only marginally from the consensus. The extent of

that variation is usually captured by Position Weight

Matrices (PWMs), described in more detail below.

Brieﬂy, the PWMs record the information about the

permissible variation of letters over the 4–letter DNA

alphabet, assigning a weight to each letter at each po-

sition in the motif in accordance with how often that

letter has been seen at that position within the exper-

Stojanovic N. (2010).

LINEAR–TIME MATCHING OF POSITION WEIGHT MATRICES.

In Proceedings of the First International Conference on Bioinformatics, pages 66-73

DOI: 10.5220/0002750500660073

 SciTePress

imentally determined binding sequences. The infor-

mation about functional motifs in DNA recorded in the

databases is nowadays usually in the form of PWMs,

and they can be used for searches in previously un-

characterized DNA fragments, for motifs sufﬁciently

similar to the consensus of a particular protein bind-

ing site. Over the years many programs have been

written for matching PWMs, including MATCH (Kel

et al., 2003) by the TRANSFAC database team, and

variants such as combinations of PWMs using mixture

model (Hannenhalli and Wang, 2005).

With the advances in microarray technology large

sets of putatively co-expressed genes became avail-

able, stimulating the development of methods to de-

tect conserved motifs in their upstream regions, such

as (Hughes et al., 2000), as well as the search for

putatively co-regulated genes by the identiﬁcation of

shared regulatory modules, such as in (Qin et al.,

2003). It is intuitive that if a group of genes is co-

ordinately regulated, it should be controlled by simi-

lar sets of transcription factors. From the hypothesis

that protein binding is largely directed by target DNA

sequence motifs it follows that same (or similar) mo-

tifs should be present in regulatory sequences of co-

expressed genes, moreover as a cluster, or clusters.

Consequently, this led to further exploitation of motif

over-representation in sets of target sequences (Apos-

tolico et al., 2000; van Helden, 2004).

In the course of our work on the study of these

genomic environments, we have developed software

which extracts the most signiﬁcant shared short (5–25

bases) approximate motifs found within the upstream

sequences (i.e. putative promoter regions) of genes

postulated to be co-expressed by microarray and

other experiments (Singh and Stojanovic, 2006). We

have recently expanded that work to genome–wide

searches for similar layouts, as reﬂected by conglom-

erations of a statistically signiﬁcant number of motifs

previously discovered to be shared within the promot-

ers of a training set of co-expressed genes (Singh and

Stojanovic, 2009). Unfortunately, even as we were

able to record the consensus of the motifs of interest

in the form of PWMs, our genome–wide search had to

be executed using string representation (standing for

multiple exact patterns), since the current methods for

PWM matching are not efﬁcient. When the matching

needs to be performed in limited environments repre-

senting gene upstream sequences, genomic domains

or gene clusters, naive approaches work well, how-

ever for whole–genome scans (having in mind that,

for instance, the human genome features more than

three billion bases) any performance worse than lin-

ear is not practical.

In this paper we describe an algorithm we have

devised to efﬁciently match patterns represented by

Position Weight Matrices. It relies on a somewhat ex-

pensive pre–processing step, however the cost of that

pre–processing is well compensated by the efﬁciency

gains once it is applied to whole–genome searches.

Before proceeding with the algorithm itself, in the

following chapter we shall present some basic ideas

governing the construction and use of the PWMs.

2 POSITION WEIGHT

MATRICES

If all sites where a same protein binds would feature

identical bases it would be a matter of simple pattern

matching to ﬁnd them all. Unfortunately, for gene

regulatory proteins that is usually not the case, and

one has to deal with approximate matching, consen-

sus sequences and ambiguity codes. This is more than

a technical complication, as it introduces unreliability

to motif recognition. The ambiguity codes do not dis-

tinguish between the bases they stand for, so rare or

even impossible (concerning protein binding) combi-

nations of characters in the represented motifs may

match, leading to a large fraction of false positives.

On the other hand, if a character which infrequently

appears in the binding sequences is completely elim-

inated from the consensus, sites which do bind the

protein may be skipped, adding poor sensitivity to al-

ready compromised speciﬁcity. Furthermore, binding

motifs often contain irrelevant positions, and trying

to match any particular character at these loci would

obviously be futile.

A partial remedy to this problem can be an as-

signment of weights to characters of the consensus

sequence, so that for each position where more than

one base is possible there is a probability of occur-

rence associated with each choice. Stormo (Stormo,

1990) thus proposed a matrix–based approach, where

a position in a protein binding site is represented by a

column of a matrix, with one row for each nucleotide.

The guiding idea of the method was that if every char-

acter possible at a given position is assigned a score

such that the addition of scores for all positions pro-

vides an estimate about how close the sequence is

to the known protein binding site patterns, one can

decide whether to accept or reject the site based on

whether the total score is above or below a certain

predeﬁned threshold. At any scanned sequence posi-

tion j aligned with matrix column i, a simple lookup

at the matrix entry at column i and row corresponding

to the letter found at j would then provide the score.

The threshold can be empirically determined

as the critical value of the ratio of probabilities

LINEAR--TIME MATCHING OF POSITION WEIGHT MATRICES

TACACCG

TACACTG

TACACGG

TACACTG

GACACCG

TACACAG

TATACAG

GATACCG

TACACCG

TAGACAG 0 10 0 10 0 3 0

2 0 1 0 0 1 10

8 0 2 0 0 2 0

0 0 7 0 10 4 0

−−− 2.00 −−− 2.00 −−− 0.26 −−−

−−− −−− 1.49 −−− 2.00 0.68 −−−

−0.32 −−− −1.32 −−− −−− −1.32 2.00

1.68 −−− −0.32 −−− −−− −0.32 −−−

a) b)

Figure 1: An example of a Position Weight Matrix: a) A set of motifs on which the matrix is built; b) Letter counts at each

position in the motif, in matrix representation; c) Full matrix, obtained by taking log

of the probability of the character

occurrence at the position, divided by the general probability of the character, here assumed to be 0.25 for each of the ‘A’, ‘C’,

‘G’, ‘T’. ‘—’ indicates an inﬁnitesimally small value.

(s)/P

(s), where s is the examined string, sub-

script F denotes that the probability is based on the

distribution of strings within the population of bind-

ing sites, and subscript G denotes that the probabil-

ity is based entirely on chance, given the overall dis-

tribution of nucleic acids in the DNA sequence un-

der consideration. As the probabilities of the occur-

rence of s are based on two population models, this

amounts to the likelihood ratio, as deﬁned in statis-

tics. It is also (somewhat unrealistically) assumed

that character distributions at different positions are

independent. That way, for a string s = c

...c

of length L, the probability of its random occurrence

would be P

(s) =

∏

i=1

), where p (c

) stands for

the probability of an individual character.

In order to estimate P

for a given string, one also

assumes the independence of probabilities for char-

acter occurrences at different positions. For any po-

sition i, we estimate the probability of occurrence of

any base p

f ,i

(b) based on how frequently it occurred

at the particular position i in known binding patterns.

Thus, having a string s = c

...c

, of length L, the

probability that it occurred under the distribution spe-

ciﬁc for the particular protein binding sites would

be P

(s) =

∏

i=1

f ,i

). This approach can suffer

from artifacts (for instance, if only combinations “AG”

or “TC” would equally likely occur within motif in-

stances used to construct the matrix, once it is con-

structed it would assign equally high scores to com-

binations of “AC” and “TG” which have never been

observed), which has led to the development of di-

nucleotide (Gershenzon et al., 2005) and even more

elaborate models, and it can also be imprecise due

to incomplete experimental data or overly permissible

thresholds. However, over many years of its applica-

tion it has been shown to yield better results than raw

pattern matching, either exact or approximate. If it is

only a relatively small number of motifs used for the

construction of a PWM, it may be more appropriate to

construct a ﬁnite state automaton which would recog-

nize all these motifs (Aho and Corasick, 1975), rather

than approximating through a PWM, but that then in-

troduces the risk of over-ﬁtting.

Expanded, the likelihood ratio P

(s)/P

(s) is then

∏

i=1

f ,i

∏

i=1

) =

∏

i=1

f ,i

)/p

)].

One can take a logarithm of this formula and con-

vert it to additive form

∑

i=1

log

f ,i

)/p

)], al-

though some correction factors are needed in order

to avoid taking a logarithm of zero when a charac-

ter does not appear at a given position in any of the

motifs used to construct the matrix. The matrix as-

sociated with the protein binding pattern of length L

has L columns and 4 rows, one for each nucleotide

‘A’, ‘C’, ‘G’ or ‘T’. If a base b occurs m

times at the

position corresponding to matrix column i within M

known binding sites, then p

f ,i

(b) = m

/M, and if it

occurs n

times within N nucleotides in the genome,

then p

(b) = n

/N. The matrix entry for base b in

column i thus contains log

N/n

M). An illustra-

tion of a set of motifs used to determine a PWM, and

the resulting matrix, is shown in Figure 1. In practice

there are several variants of PWMs, however they are

all based on the same principle.

Once a PWM is constructed, it can be used for

scanning genomic sequences, for motifs which are

sufﬁciently similar to these captured by the matrix.

Looking at a string s = c

...c

one locates the

corresponding row of the matrix for each c

, in col-

umn i, and adds these values to obtain the likelihood

ratio P

. If that likelihood ratio is above the

pre–determined (and inevitably heuristic) threshold,

a match is declared. Since the shift for just one posi-

tion can result in entirely different score, PWM match-

ing programs advance for one position, and start the

matching process over the entire matrix every time

after the shift has been made. As matrices usually

do not feature too many columns, reﬂecting the fact

that they model short motifs, this does not lead to

intractability, yet it slows down any search for an

order of magnitude, and on the genomic scale that

presents a problem. Pruning techniques have been

explored, although one can only stop further match-

BIOINFORMATICS 2010 - International Conference on Bioinformatics

min

A −1 2 4

C 3 1 −9

G 0 −2 1

T −5 3 3

A,G,T

G,T

A,G,T

2 5 9

Figure 2: An example of an FSA created for a simple PWM: a) Position weight matrix; b) Minimal scores which have to be

reached at every position in order for threshold value of 9 to be met at the end (a match); c) FSA constructed based on the

matrix. Only three strings. “CAA”, “CTA” and “CTT”, would score sufﬁciently high to meet the threshold, as reﬂected by three

paths from the root to the reporting nodes (level 3), shown as solid circles. Scores at each node are shown inside the circles.

Transitions are guided by the letters labeling the edges.

ing attempts from a position if it has been determined

that the score–so–far is insufﬁcient for a match. The

other option, recognizing a match before all charac-

ters have been examined is not possible, since some

letters at some positions may contribute (large) neg-

ative scores. Other algorithmic solutions have also

been applied, such as building indexing schemes to

facilitate matching of multiple matrices in the same

sequence (Liefooghe et al., 2006).

3 ALGORITHM

We here present an algorithm which matches a single

Position Weight Matrix against a genomic sequence

in linear time.

We start by pre-processing the matrix M , of di-

mension 4× L, into a tree–like structure, following the

basic ideas of classic algorithms of Knuth–Morris–

Pratt (Knuth et al., 1977) and Aho–Corasick (Aho and

Corasick, 1975). For every position p our algorithm

keeps track of its longest possible sufﬁx k..p,1 < k <

p, in M , which can still result in a match, and thus

the amount of shift which can be made after M has

been successfully matched, or when it has been de-

termined that a match at the current starting position,

represented by the root, is not possible.

In order to enable the tracking, the matrix M is

thus converted into a ﬁnite state machine, implement-

ing a tree T with cross–linked branches. At every

node of the tree a structure is maintained, with the

following ﬁelds:

level: Keeps track of the length of the branch starting

at the root and ending at the current node. This

information is used to determine whether a match

along the current branch is still possible, and to

check whether a match has already been success-

ful (when a node whose level equals L is reached).

score: Keeps track of the score achieved so far along

the path from the root through the current node.

If the score is less than required for the current

level it is an indication that no match is possible

at the position currently aligned with the root and

a shift to a sufﬁx has to be made; if a match has

been achieved, this score can be reported along

with match data.

sufﬁx: A pointer to another node in T , positioned

on the branch which represents a sufﬁx of the

branch ending in the current node (the longest suf-

ﬁx which can still lead to a viable match from its

starting position). The target node of this pointer

follows the transition on the same letter as the cur-

rent node

move: A four–element array, providing a pointer to

another node in T , directing the move on each

of the four letters of the DNA alphabet (‘A’, ‘C’,

‘G’ and ‘T’). It can lead to forward motion if a

match along the current branch is still possible, or

backward to another path starting from the root

(a sufﬁx) if the match at the currently examined

position cannot be achieved.

The preprocessing starts with the establishment of

a score array minimum, whose dimension L equals the

In our implementation we have kept this ﬁeld outside of

the FSA, as it is used only once during the pre-processing of

a matrix, in situations when the branch needs to be changed

on an occurrence of a letter.

LINEAR--TIME MATCHING OF POSITION WEIGHT MATRICES

Algorithm 3.1: PREPROCESS(matrix,dimension,threshold)

// Find the minimal values necessary at positions

minimum[dimension] ← threshold

for i ← dimension − 1 downto 0



max ←Maximal value in matrix column i + 1

minimum[i] ← minimum[i +1] − max

//Process the root

root.level ← 0

root.score ← 0

root.su f f ix ← root

for letter ←A,C,G,T











score ← matrix[letter][1]

if score < minimum[1]

then root.move[letter] ← root

else











child.score ← score

child.level ← 1

child.su f fix ← root

root.move[letter] ← child

ENQUEUE(child)

// Process the remaining nodes, breadth–ﬁrst

while Queue not empty











current ← DEQUEUE()

if current.level = dimension

then



for letter ←A,C,G,T

do current.move[letter] ← current.su f f ix.move[letter]

else











for letter ←A,C,G,T











score ← current.score + matrix[letter][current.level + 1]

if score < minimum[current.level + 1]

then current.move[letter] ← current.su f f ix.move[letter]

else











child.score ← score

child.level ← current.level + 1

child.su f fix ← current.su f f ix.move[letter]

current.move[letter] ← child

ENQUEUE(child)

return (root)

Algorithm 3.2: MATCH(matrix,text,threshold)

track ← PREPROCESS(matrix,matrix

column number,threshold)

for i ← 1 to text length − matrix column count + 1







track ← track.move[text[i]]

if track.level = matrix column count

then Report a match at i − matrix column count + 1, scoring track.score

Figure 3: Pseudo–code of the PWM matching algorithm.

length of the motif represented by M . Each entry in

minimum records the minimal score which must be

achieved at that position in order to lead to a possible

match. Since a match is deﬁned as a sequence scoring

at threshold value or higher, it must be minimum[L] =

threshold. If max[i] is the maximal score recorded

in the i-th column of M , then minimum[i − 1] =

minimum[i] − max[i]. Whenever a position p in M is

reached it is checked whether the score so far is still

greater than or equal to minimum[p].

The FSA T is constructed in the breadth–ﬁrst fash-

ion, after the root node has been formed, and its im-

mediate children have been placed in the queue (if

their score warranted the placement). Every time

a node is dequeued, it is ﬁrst checked if its level l

equals L. If that is the case (indicating time to re-

port a match and move on), its transitions on all four

letters are determined by following the transition on

that letter from the (sufﬁx) node pointed to by the

su f f ix pointer. If l < L, forward transitions (and the

increment of the level) are considered. If the score of

the transition on a letter falls below the minimum re-

quired at l, a back pointer is created to the extension

from the node pointed to by the su f f ix. Otherwise,

BIOINFORMATICS 2010 - International Conference on Bioinformatics

a child node is created and enqueued, deﬁning a for-

ward transition on the letter. An example of an FSA

created for a simple PWM is provided in Figure 2,

and the pseudo–code of the preprocessing algorithm

is provided in the function Preprocess in Figure 3.

The complete matching function is shown as function

Match.

After the preprocessing is completed, the matrix

matching itself is rather straightforward. The algo-

rithm maintains two pointers, one to the currently ex-

amined position i in the DNA sequence (initialized at

1), and the other to the current node in T (initial-

ized at the root). A transition in T is done depending

on the character at position i, and if the level of the

reached node equals L a match can be reported (with

appropriate score). This process continues until the

end of the sequence is reached.

4 ALGORITHM PERFORMANCE

Since the matching done by this algorithm is com-

pletely guided by the FSA T , it can be proven cor-

rect by observing that every matching substring would

lead to a traversal of a path from the root to a node

at level L, and that no mismatching substring would

reach level L. Since every time an insufﬁcient score

is obtained it would result in a decrease in the current

level, and every time a sufﬁcient score is obtained it

results in an increase of the current level, the above

properties will hold if it can be proven that the back

pointers indeed lead to a node N such that the path

from the root to N represents the longest feasible suf-

ﬁx of the currently examined substring (i.e. path from

the root).

The formal proof can be derived by mathematical

induction, showing that the following invariants hold

every time a node is dequeued and processed, during

the matrix pre-processing phase:

1. If there is a viable continuation from the position

represented by the node, on a particular letter, that

leads to the creation of a forward link, i.e. a node

at the next deeper level.

2. If there is no continuation from the position rep-

resented by the node, for each letter, that leads to

the creation of a backward link, to a node at the

same level as current, or closer to the root.

3. The su f f ix pointer of a newly enqueued node

points to another node in the FSA, at some level

closer to the root, which is the end of a path

from the root representing the longest sufﬁx of

the string represented by the path to the enqueued

node, and which can still yield a match.

4. If a path cannot be continued to a further level,

on some letter, then the transition on that letter is

made to a node representing the end of the longest

sufﬁx of the current path which can still yield a

match.

For the induction base, it is trivial to show that

properties 1 through 4 hold at the time the root is

processed (initialization), and its children are being

placed in the queue. We can assume that they are also

satisﬁed when k

node is being processed, and show

that they hold after the k + 1

has been handled. It

should be noted that the su f f ix pointers are already

ﬁxed at the time of enqueuing the node.

Invariant 1 holds trivially, as it is the decision

made in the code while processing the dequeued node.

Since the tree is processed breath–ﬁrst, that means

that all nodes at depths smaller than that of the cur-

rent one have already been processed. If there is no

viable continuation of a match on a currently consid-

ered letter, that leads to following the transition from

the node pointed to by the su f f ix pointer, which is,

by invariant 3, at a level closer to the root, and thus,

by invariants 1 and 2 being already satisﬁed at earlier

nodes, cannot lead to level deeper than the current.

This proves that invariant 2 also holds after the cur-

rent node is processed.

If there is a viable continuation from the current

node on any particular letter, then the su f f ix pointer

assigned to the newly created child is to the link of

the node at the other end of the current su f f ix pointer,

and thus at the level closer to the root than the newly

created child. If the node pointed to by the current

su f f ix featured a viable forward move on the letter,

that extends that sufﬁx which can yield a match. If

not, it extends its longest sufﬁx which can, by the in-

duction hypothesis, and thus represents a new longest

sufﬁx which can still yield a match. Therefore, invari-

ant 3 also holds.

When a forward move (one continuing the match)

cannot be done on a particular letter, it is being done

from the node pointed to by the su f f ix pointer, which,

by invariant 3, represents the longest sufﬁx of the cur-

rent path which can still yield a match. Therefore,

the transition is being made to the node extending the

other end of the su f f ix pointer on that letter, and the

invariant 4 holds.

During the search phase the traversal of the FSA

T proceeds in the forward direction (i.e. to deeper

levels) for as long as a match is still possible (invari-

ant 1). When it is no longer the case, invariant 4 guar-

antees that the jump is being made to the next position

from which there may be a possible match. As a deter-

mined mismatch cannot lead to a forward move in T

LINEAR--TIME MATCHING OF POSITION WEIGHT MATRICES

no mismatching positions can lead to reaching level L

(and thus be reported), and as the jump is being made

to the beginning of the longest sufﬁx of the currently

examined substring which can still yield a match (in-

variant 4), no matching positions can be skipped. We

thus conclude that the algorithm is correct.

4.1 Space Performance

This algorithm can have large space requirements.

For a matrix of length L it can theoretically build an

FSA with O(4

) nodes, since every possible match-

ing substring has to be represented by a path through

T reaching level L. However, all internal nodes, and

thus their children as well, for which it has been es-

tablished that they could not lead to a match, are

promptly pruned. Since in practice PWMs are con-

structed, and their thresholds set, so that they accept

a very limited number of variants of the target site,

it is expected that in most practical situations prun-

ing will be quite dramatic. The only situation when

there can exist a large conglomeration of nodes to-

wards the top of the tree, the most space–consuming

setting, is when there is a variety of possible char-

acters occurring in relatively high percentages at the

left hand side of the pattern, thus dictating extensive

branching near the root of the tree, and a more inclu-

sive threshold value. However, this situation practi-

cally never happens, since tails exhibiting substantial

variations normally do not get included in the patterns

captured by PWMs. Towards the bottom of the tree the

only branches which have not been pruned would be

these leading to a pattern variant matching the matrix

(as set by the given threshold).

Apart from the FSA encoding the matrix, the only

space requirement for the matching process itself is

the space needed to store the sequence in which the

matching is to be done.

4.2 Time Performance

Since, during the matching, no character in the text is

examined more than once (as signiﬁed by the single

for–loop with constant–time body in the algorithm in

Figure 3), this process clearly executes in linear time.

The construction of the FSA (pre-processing) takes

time proportional to the number of nodes which are

being created (as each node is being processed in con-

stant time), which, under the worst–case scenario of a

large number of matching strings, can be exponential.

However, since the children of the pruned nodes never

get processed, the exponential blowout is expected

to take place only at shallow depth, towards the top

of the tree. Furthermore, whereas pre-processing of

the long PWMs would take the longest time, these are

precisely the cases where the gain from subsequent

linear–time matching would be the largest.

Even as the PWMs are generally short (rarely ex-

ceeding 15 columns, and almost never exceeding 25)

and pruning extensive (as the thresholds are usually

very stringent), the pre-processing step has the po-

tential of annihilating the gains from the subsequent

linear–time processing. This algorithm is thus recom-

mended for use only in very large scale analyses, such

as our whole–genome scans (Singh and Stojanovic,

2009). In our recent tests, done on a genomic seg-

ment of about 500 million bases (corresponding to a

large chromosome), the naive match was taking an

average of 144.33 seconds per matrix, on an Apple

Mac Pro 2.5 GHz Intel Core Duo with 4 Gb 667 MHz

DDR2 SDRAM memory laptop computer, while an im-

plementation of our algorithm was taking an average

of 16.58 seconds per matrix. On whole–genome scale

the average gain would be about 10–20 fold per ma-

trix.

5 DISCUSSION

The space requirements of this algorithm can be

further reduced by a simple extension of the pre-

processing step, alas at the price of an increase in the

computational time. This would require the mainte-

nance of two additional ﬁelds at the nodes of the FSA:

parent pointer and number of forward links. An ad-

ditional traversal pass through the FSA can then elim-

inate all partial paths which do not extend until the

bottom (match), reducing the number of nodes to less

than the sum of the lengths of all strings which would

score above the threshold. However, we have not

implemented this modiﬁcation, as the space require-

ments of the FSA have never been a limiting factor,

and time was critical.

Often a PWM will feature letters which must not

be found at a speciﬁc position, with the correspond-

ing score in the PWM of (theoretically) −∞. This sit-

uation is particularly favorable to our algorithm, as

it leads to immediate pruning, especially when these

positions are concentrated towards the beginning of

the matrix. If they are concentrated towards the end,

the least favorable setting, one can do the matching

in the inverse complement of the sequence. For long

matrices one can also attempt the match of their most

speciﬁc core, then attempt the full match, using the

naive approach, only around the positions where the

core has matched.

Despite of its relatively expensive pre-processing,

we have found this algorithm very useful for whole–

BIOINFORMATICS 2010 - International Conference on Bioinformatics

genome scans, such as our search for the conglomer-

ations of variable motifs, with a potential of reducing

days of computation to just a few hours. This can

be of particular importance for the tools implemented

as a part of a web server. Our earlier version of the

matcher (Singh and Stojanovic, 2009) implemented at

http://bioinformatics.uta.edu/toolkit/motifs/ used di-

rect pattern matching (i.e. not based on PWMs), and

the development of this algorithm has allowed us to

consider the matrix–based approach, too.

ACKNOWLEDGEMENTS

The author is grateful to Abanish Singh, whose ef-

fort on motif ﬁnding and the implementation of the

whole–genome motif search has made us aware of the

need for this algorithm. This work has been partially

supported by NIH grant 5R03LM009033-02.

REFERENCES

Aho, A. and Corasick, M. (1975). Efﬁcient string matching:

an aid to bibliographic search. Comm. Assoc. Com-

put. Mach., 18:333–340.

Apostolico, A., Bock, M., Lonardi, S., and Xu, X. (2000).

Efﬁcient detection of unusual words. J. Comput. Biol.,

7:71–94.

Bryne, J., Valen, E., Tang, M., Marstrand, T., Winther, O.,

da Piedade, I., Krogh, A., Lenhard, B., and Sandelin,

A. (2008). JASPAR, the open access database of

transcription factor–binding proﬁles: new content

and tools in the 2008 update. Nucleic Acids Res.,

36:D102–D106.

Gershenzon, N. I., Stormo, G. D., and Ioshikhes, I. P.

(2005). Computational technique for improvement

of the position–weight matrices for the DNA/protein

binding sites. Nucleic Acids Res., 33:2290–2301.

Hannenhalli, S. and Wang, L.-S. (2005). Enhanced position

weight matrices using mixture models. Bioinformat-

ics, 21:i204–i212.

Hughes, J., Estep, P., Tavazoie, S., and Church, G. (2000).

Computational identiﬁcation of cis–regulatory ele-

ments associated with groups of functionally related

genes in Saccharomyces cerevisiae. J. Mol. Biol.,

296:1205–1214.

Kel, A. E., G

ossling, E., Reuter, I., Cheremushkin, E., Kel-

Margoulis, O. V., and Wingender, E. (2003). Match:

A tool for searching transcription factor binding sites

in dna sequences. Nucleic Acids Res., 31(13):3576–

3579.

Khambata-Ford, S., Liu, Y., Gleason, C., Dickson, M., Alt-

man, R., Batzoglou, S., and Myers, R. (2003). Iden-

tiﬁcation of promoter regions in the human genome

by using a retroviral plasmid library–based functional

reporter gene assay. Genome Res., 13:1765–1774.

Knuth, D., Morris, J., and Pratt, V. (1977). Fast pattern

matching in strings. SIAM J. Computing, 6:323–350.

Liefooghe, A., Touzet, H., and Varr, J.-S. (2006). Large

Scale Matching for Position Weight Matrices. In Pro-

ceedings of the 7

Annual Symposium on Combina-

torial Pattern Matching, CPM 2006, volume 4009 of

LNCS, pages 401–412. Springer–Verlag.

Nelson, C., Hersh, B., and Carroll, S. B. (2004). The reg-

ulatory content of intergenic DNA shapes genome ar-

chitecture. Genome Biol., 5:R25.

Qin, Z., McCue, L., Thompson, W., Mayerhofer, L.,

Lawrence, C., and Liu, J. (2003). Identiﬁcation of co-

regulated genes through Bayesian clustering of pre-

dicted regulatory binding sites. Nature Biotechnology,

21:435–439.

Singh, A. and Stojanovic, N. (2006). An efﬁcient algorithm

for the identiﬁcation of repetitive variable motifs in

the regulatory sequences of co–expressed genes. In

Proceedings of the 21

International Symposium on

Computer and Information Sciences, volume 4263 of

LNCS, pages 182–191. Springer–Verlag.

Singh, A. and Stojanovic, N. (2009). Genome–wide search

for putative transcriptional modules in eukaryotic se-

quences. In Proceedings of BIOCOMP’09, pages

848–854.

Stojanovic, N. (2009). A study on the distribution of phylo-

genetically conserved blocks within clusters of mam-

malian homeobox genes. Genetics and Molecular Bi-

ology, 32:666–673.

Stormo, G. (1990). Consensus patterns in DNA. Methods

Enzym., 183:211–221.

The ENCODE Project Consortium (2007). The ENCODE

pilot project: Identiﬁcation and analysis of func-

tional elements in 1% of the human genome. Nature,

447:799–816.

van Helden, J. (2004). Metrics for comparing regulatory se-

quences on the basis of pattern counts. Bioinformatics,

20:399–406.

Wingender, E. (2008). The TRANSFAC project as an exam-

ple of framework technology that supports the analy-

sis of genomic regulation. Brieﬁngs in Bioinformatics,

9:326–332.

Young, J. E., Vogt, T., Gross, K. W., and Khani, S. C.

(2003). A short, highly active photoreceptor–speciﬁc

enhancer/promoter region upstream of the human

rhodopsin kinase gene. Investigative Ophtamology

and Visual Science, 44:4076–4085.

LINEAR--TIME MATCHING OF POSITION WEIGHT MATRICES