1 10 100 1000
Number of Distinctive Queries
25
50
75
100
125
Total Running Time (seconds)
HD-tree
Prefix B-tree
Figure 10: Running time comparison; average query string
length is 6.
B-tree in terms of total running time including both
the RAM processing time and the I/O time. The
experiments were conducted in the same computing
environment (a Linux PC with 512MB RAM and
1.8GHz Pentium 4 processor). Figure 10 shows the
running time of the HD-tree and the Prefix B-tree
for 1000 queries with different numbers of distinc-
tive queries. We notice that the actual running time of
the HD-tree is comparable to that of the Prefix B-tree
even when the 1000 queries are the same. The rea-
son is that with a large amount of RAM available, the
operating system provides LRU caching for the HD-
tree as well. The HD-tree is shown to be increasingly
faster than the Prefix B-tree as the number of distinc-
tive queries increases. For 1000 distinctive queries,
the HD-tree is more than one magnitude faster than
the Prefix B-tree.
4 CONCLUSION
There is an increasing demand for efficient index-
ing techniques to support various types of queries
on large string databases. Most existing string in-
dexing techniques are either RAM-based or disk-
based. RAM-based index structures are not suitable
for string matching queries on large databases when
only a limited amount of RAM is available. Disk-
based structures, on the other hand, can index large
databases but usually do not fully utilize the available
RAM.
The HD-tree is proposed as a novel hybrid
RAM/disk-based structure, taking advantage of the
strengths of both RAM-based and disk-based struc-
tures. The HD-tree not only scales well with the sizes
of the RAM and the database, but also is efficient
for various types of queries. The experimental results
show that the HD-tree outperforms the Prefix B-tree
for prefix and substring searches. For random distinc-
tive queries, the number of disk I/Os is reduced by a
factor of two to three, while the running time is re-
duced in an order of magnitude. Therefore, we con-
clude that a hybrid RAM/disk-based index structure
such as the HD-tree is promising for supporting effi-
cient searches in large string databases whose indexes
cannot fit entirely in the RAM.
REFERENCES
Baeza-Yates, R. and Ribiero-Neto, B. (1999). Modern In-
formation Retrieval. Addison Wesley Longman Pub-
lishing Co. Inc.
Bayer, R. and McCreight, E. M. (1972). Organization and
maintenance of large ordered indexes. Acta Informat-
ica, 1(3):173–189.
Bayer, R. and Unterauer, K. (1977). Prefix b-trees. ACM
Trans. Database Syst., 2(1):11–26.
Clark, D. R. and Munro, J. I. (1996). Efficient suffix trees on
secondary storage. In Proceedings of the seventh an-
nual ACM-SIAM symposium on Discrete algorithms,
pages 383–391, Atlanta, Georgia, United States. Soci-
ety for Industrial and Applied Mathematics.
Comer, D. (1979). Ubiquitous b-tree. ACM Comput. Surv.,
11(2):121–137.
Fagin, R., Nievergelt, J., Pippenger, N., and Strong, H. R.
(1979). Extendible hashing a fast access method for
dynamic files. ACM Trans. Database Syst., 4(3):315–
344.
Ferragina, P. and Grossi, R. (1999). The string b-tree: A
new data structure for string search in external mem-
ory and its applications. J. Assoc. Comput. Mach.,
46(2):236–280.
Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. (1991).
Lexicographical indices for text: Inverted files vs. pat
trees. Technical Report OED-91-01, University of
Waterloo.
Manber, U. and Myers, G. (1990). Suffix arrays: a new
method for on-line string searches. In Proceedings
of the first annual ACM-SIAM symposium on Discrete
algorithms, pages 319–327. Society for Industrial and
Applied Mathematics.
McCreight, E. M. (1976). A space-economical suffix tree
construction algorithm. J. ACM, 23(2):262–272.
Morrison, D. R. (1968). Patricia practical algorithm to re-
trieve information coded in alphanumeric. J. ACM,
15(4):514–534.
Sleepycat (2004). Berkeley db. http://www.sleepycat.com/.
Voorhees, E. M. and Harman, D. (1997). Overview of the
sixth text retrieval conference (trec-6). In Proceedings
of the Sixth Text REtrieval Conference, pages 1–24.
NIST Special Publication.
Weiner, P. (1973). Linear pattern matching algorithms. In
14th Annual Symposium on Switching and Automata
Theory, pages 1–11. IEEE.
Xue, Q., Pramanik, S., Qian, G., and Zhu, Q. (2004). The
hybrid ram/disk-based index structure. Technical re-
port, Department of CSE, Michigan State University.
THE HYBRID DIGITAL TREE: a new indexing technique for large string databases
121