Table 1: Experimental Results.
Keyword Naive Extended BMH Recall Precision
(#comp) (#comp) (#skip) (%) (%)
Prime Minister 79127.8 21973.2 17649.6 88.57 80.87
Potsdam Declaration 78833.6 16850.2 13488.0 61.82 100.00
(number of characters = 64900)
is not assured to be perfect. Although it is possible
to improve the accuracy by increasing the trial fre-
quency or adjusting other parameters, this will involve
a trade-off with the computational complexity. Those
are the nature of randomized algorithms.
As illustrated by the tables, the number of times
of character to character comparison of the Extended
BMH algorithm is lower than the naive method for
both keywords. Moreover, it is lower than the total
number of characters (64900), which means that re-
trieval in sublinear time was accomplished.
However, the number of times of evaluation of
the skip function must be considered as an additional
cost for the naive method with the Extended BMH al-
gorithm. Even though one ‘comp’ process and one
‘skip’ process is not exactly equal — the former is a
process that evaluates the match/non-match of d
′
in-
tegers, while the latter is a process that obtains the
minimum among d
′
integers — we can roughly es-
timate the cost of the Extended BMH algorithm just
adding (#skip) and (#comp). In this case, also, we can
conclude that the computational costs of the Extended
BMH is reduced to below that of the naive method.
6 CONCLUSIONS
In this paper we have proposed an Extended BMH
algorithm, which was developed from the BM and
BMH algorithm, to allow searching for specific
strings based on a sequence of real vectors. As an
example of its application, an experiment with string
matching using low-quality document images where
optical character recognition does not work well was
performed. The results showed that the algorithm can
contribute to reducing the computational cost com-
pared with the naive method.
Our future work will focus on developing an ef-
ficient algorithm to realize Inexact Matching rather
than Exact Matching. With such an advanced algo-
rithm, it would be possible to develop a fast algo-
rithm using LSPC, which is applicable to more diffi-
cult problems such as string matching of handwritten
documents.
REFERENCES
Andoni, A. and Indyk, P. (2006). Near-optimal hashing
algorithms for approximate nearest neighbor in high
dimensions. In Proc. Symposium on Foundations of
Computer Science, FOCS’06, pp. 459–468.
Boyer, R. S. and Moore, J. S. (1977). A fast string searching
algorithm. In Communications of the ACM, vol. 20,
pp. 762–772.
Datar, M., Indyk, P., Immorlica, N., and Mirrokni, V.
(2004). Locality-sensitive hashing scheme based on p-
stable distributions. In Proc. 20th ACM Symposium on
Computational Geometry, SoCG2004, pp. 253–262.
Gionis, A., Indyk, P., and Motwani, R. (1999). Similar-
ity search in high dimensions via hashing. In Proc.
25th Int. Conf. on Very Large Data Base, VLDB1999,
pp. 518–529.
Horspool, R. N. (1980). Practical fast searching in strings.
In Software – Practice & Experience, vol. 10, issue 6,
pp. 501–506.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. In International Journal of Com-
puter Vision, vol. 60, no. 2, pp. 91–110.
Terasawa, K., Nagasaki, T., and Kawashima, T. (2006). Im-
proved handwritten text retrieval using gradient dis-
tribution features (written in japanese). In Proc.
Meeting on Image Recognition and Understanding,
MIRU2006, pp.1325–1330.
Terasawa, K. and Tanaka, Y. (2007a). Locality sensitive
pseudo-code for document images. In Proc. 9th Int.
Conf. on Document Analysis and Recognition, IC-
DAR2007, vol. 1, pp. 73–77.
Terasawa, K. and Tanaka, Y. (2007b). Spherical lsh for
approximate nearest neighbor search on unit hyper-
sphere. In Proc. 10th Workshop on Algorithms and
Data Structures, WADS2007, LNCS4619, pp. 27–38.
THE EXTENDED BOYER-MOORE-HORSPOOL ALGORITHM FOR LOCALITY-SENSITIVE PSEUDO-CODE
441