a linked list, Baeza-yates and Perleberg (1996)
proposed O(N+Nf
max
) time and O(2M + σ) space
algorithm, where f
max
is the frequency of the most
commonly occurring character in the pattern. Based
on the Boolean convolution of the pattern and the
text, Abrahamson (1987) solves the problem in
O(N√(M log M)) time and O(N) space. Recently,
Nicolae and Rajasekaran (2013) have shown that for
pattern matching with wild-cards, the algorithm
(Abrahamson, 1987) can be modified to obtain an
O(N√(g log M)) time, where g is the number of non-
wild-card positions in the pattern. The algorithm
Clifford and Clifford (2007) is also based on
convolution, which takes O(Nlog(M)) time. A
randomized algorithm Kalai (2002), which is based
on (Karp and Rabin, 1987), consumes O(Nlog(M))
time. Atallah, Chyzak, and Dumas (2001)
approximate the number of mismatches from every
alignment in O(rN log(M)) time, where r is the
number of iteration algorithm has to make.
In literature, string-processing algorithms have
made extensive use of suffix trees, suffix arrays, and
automata. Most of these methods are covered
Crochemore, Hancart, and Lecroq (2007). Algorithms
based on automata give the best worst-case time
O(N). However, exponential time and space
dependence on M and k limits its practicality see
Navarro (2001). Therefore, automaton based
algorithms best suited to short patterns with low
error rates Navarro (2001). Suffix trees on the other
hand, consume space linear to the size of the text,
which may be a challenge when dealing with the
large text.
Breaking the trend, in this paper, we follow a
novel approach to solve the ‘string matching with
mismatches’ problem. The method we propose is
based on the frequencies of individual characters in
the pattern and the text, which is completely
different from the other methods proposed in the
past. The algorithm we propose avoids all complex
data-structures, yet achieves average case O(N) time
for patterns of length ≤ σ. The rest of the paper is
structured as follows: - In section 2, we introduce
few terms and notations used in this paper. To create
a theoretical base for the algorithm, the lemma,
corollaries, and examples are given in section 3. In
section 4, an algorithm is provided to pre-process the
pattern, which is a prerequisite for the main
algorithm given in section 5. For a better
understanding of the algorithm, the run-time
behavior of the algorithm is also described in the
same section. In section 6, we discuss the time and
space requirements of the algorithms. Using real-life
data, experimental results are provided in section 7.
Finally, we conclude our work in section 8.
2 PRELIMINARIES
The symbol ‘λ’ represents the alphabet -a finite non-
empty ordered set of characters, such that |λ|=σ is
the size of the alphabet. We use the symbols T and P
to represent non-empty text and pattern strings of
length N and M respectively. Both T and P are
defined over the alphabet λ. T[i] or t
i
represents the
i
th
character of T, where ‘i’ is referred to as shift,
location, or index in T. Throughout the paper, we
have used a phrase extensively “Number of matches
of P at shift t in T”, which refers to the total number
of the characters that match when pattern P is
aligned with shift t in T. In the algorithm, we refer to
this as the number of hits at shift t in T by the pattern
P. Note, for a clear relationship among the lemma,
corollaries, examples, and the algorithms we
consider the number of character matches (not
mismatches).
3 LEMMA
Consider the text T = t
0
t
1
t
2
t
3
t
4
t
5
= DBCDAB of
size N=6, and the pattern P = DABCD of size M=5.
It is easy to see that one character match may be
found provided that P is aligned at location -4 in T
(assume that there is such a location). Similarly, a
three character match may be found when P is
aligned at locations -1 and 3 in T. Traditionally,
pattern P is aligned with all locations i in T such that
0 ≤ i ≤ N-M. However, considering i’s in the
extended range (1-M) ≤ i ≤ (N-1) may also provide
useful information, particularly when the pattern and
the text are almost same in length, and the character
matches exist at opposite ends of the strings being
matched. Therefore, with the extended search space,
the ‘string matching with mismatches’ problem can
be re-formulated as:-
Given a text T and a pattern P. For every i in T
such that (1-M) ≤ i ≤ (N−1), output the Hamming
distance hd
i
such that hd
i
= ham (P, t
i
t
i+1
...t
i+M-1
),
where, t
i
=null if i < 0 or i ˃ N-1. Now, for the text
and the pattern given above, we are in a position to
say that the hamming distance between P and t
-1
t
0
t
1
t
2
t
3
= 3, i.e., hd
-1
= ham(P, t
-1
t
0
t
1
t
2
t
3
) = 3.
Similarly, hd
3
= ham(P, t
3
t
4
t
5
t
6
t
7
) = 3. The
algorithm given in section 5 solves the problem
outlined above with the extended search space.