increase on the encoding time as the file size grows.
For the ST encoder, we get a larger increase.
Using the same files, we analyzed the amount of
memory required by both encoders. The results are
depicted in Figure 9, and their analysis leads us to
conclude that the SA encoder needs a much lower am-
mount of memory, that is the same for all files. The
ST encoder uses a variable ammount of memory and
the increase on the file size does not always imply an
increase on the necessary ammount of memory.
7 CONCLUSIONS
In this work, we have explored the use of suffix trees
(ST) and suffix arrays (SA) for the Lempel-Ziv 77
family of data compression algorithms, namely LZ77
and LZSS. The use of ST and SA was evaluated in
different scenarios, using standard test files of differ-
ent types and sizes. Naturally, we focused on the en-
coder side, in order to see how we could perform an
efficient search without spending too much memory.
A comparison between the ST and the SA encoders
was carried out, using the following metrics: encod-
ing time, memory requirement, and compression ra-
tio. Our main conclusions are:
• ST-based encoders require more memory than the
SA counterparts;
• the memory requirement of ST- and SA-based en-
coders is linear with the dictionary size; for the
SA-based encoders, it does not depende on the
contents of the file to be encoded;
• for small dictionaries, there is no significant dif-
ference in terms of encoding time and compres-
sion ratio, between ST and SA;
• for larger dictionaries, ST-based encoders are
slower that SA-based ones; however, in this case,
the compression ratio with ST is slightly better
than the one with SA.
These results support the claim that the use of SA
is a very competitive choice when compared to ST,
for Lempel-Ziv compression. We know exactly the
memory requirement of the SA, which depends on
the dictionary length. In application scenarios where
the length of the dictionary is large and the available
memory is scarce (e.g., a mobile device), it is prefer-
able to use SA instead of ST.
As future work, we intend to develop the SA en-
coder combining LCP and the simple accelerant and
supper accelerant (Gusfield, 1997, p´ag. 152, 153), to
speed up the search over the dictionary. This issue is
of greater importance for dictionaries of large dimen-
sions.
REFERENCES
Abouelhoda, M., Kurtz, S., and Ohlebusch, E. (2004). Re-
placing suffix trees with enhanced suffix arrays. Jour-
nal of Discrete Algorithms, 2(1):53–86.
Fiala, M. and Holub, J. (2008). DCA using suffix arrays. In
Data Compression Conference DCC2008, page 516.
Gusfield, D. (1997). Algorithms on Strings, Trees and Se-
quences. Cambridge University Press.
Karkainen, J., Sanders, P., and S.Burkhardt (2006). Linear
work suffix array construction. Journal of the ACM,
53(6):918–936.
Larsson, N. (1996). Extended application of suffix trees to
data compression. In Data Compression Conference,
page 190.
Larsson, N. (1999). Structures of String Matching and Data
Compression. PhD thesis, Department of Computer
Science, Lund University, Sweden.
Manber, U. and Myers, G. (1993). Suffix arrays: a new
method for on-line string searches. SIAM Journal on
Computing, 22(5):935–948.
McCreight, E. (1976). A space-economical suffix tree con-
struction algorithm. Journal of the ACM, 23(2):262–
272.
Sadakane, K. (2000). Compressed text databases with effi-
cient query algorithms based on the compressed suffix
array. In ISAAC’00, volume LNCS 1969, pages 410–
421.
Salomon, D. (2007). Data Compression - The complete ref-
erence. Springer-Verlag London Ltd, London, fourth
edition.
Sestak, R., Lnsk, J., and Zemlicka, M. (2008). Suffix array
for large alphabet. In Data Compression Conference
DCC2008, page 543.
Storer, J. and Szymanski, T. (1982). Data compression via
textual substitution. Journal of ACM, 29(4):928–951.
Ukkonen, E. (1995). On-line construction of suffix trees.
Algorithmica, 14(3):249–260.
Weiner, P. (1973). Linear pattern matching algorithm. In
14th Annual IEEE Symposium on Switching and Au-
tomata Theory, volume 27, pages 1–11.
Zhang, S. and Nong, G. (2008). Fast and space efficient
linear suffix array construction. In Data Compression
Conference DCC2008, page 553.
Ziv, J. and Lempel, A. (1977). A universal algorithm for
sequential data compression. IEEE Transactions on
Information Theory, IT-23(3):337–343.
SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications
12