Compressing Inverted Files using Modified LZW

Vasileios Iosifidis, Christos Makris

Abstract

In the paper, we present a compression algorithm that employs a modification of the well known Ziv Lempel Welch algorithm (LZW); it creates an index that treats terms as characters, and stores encoded document identifier patterns efficiently. We also equip our approach with a set of preprocessing {reassignment of document identifiers, Gaps} and post-processing methods {Gaps, IPC encoding, GZIP} in order to attain more significant space improvements. We used two different combinations of those discrete steps to see which one maximizes the performance of the modification we made on the LZW algorithm. Performed experiments in the Wikipedia dataset depict the superiority in space compaction of the proposed technique.

References

  1. Akritidis, L., Bozanis, P., 2012, Positional data organization and compression in web inverted indexes, DEXA 2012, pp. 422-429.
  2. Anh, Vo Ngoc, and Alistair Moffat. "Inverted index compression using word-aligned binary codes." Information Retrieval 8.1 (2005): 151-166.
  3. Arroyuelo D., S. González, M. Oyarzún, V. Sepulveda, Document Identifier Reassignment and Run-LengthCompressed Inverted Indexes for Improved Search Performance, ACM SIGIR 2013.
  4. Baeza-Yates, R., Ribeiro-Neto, B. 2011, Modern Information Retrieval: the concepts and technology behind search, second edition, Essex: Addison Wesley.
  5. Büttcher, S. Clarke, C. L. A., Cormack, G. V., 2010, Information retrieval: implementing and evaluating search engines , MIT Press, Cambridge, Mass.
  6. Callan, J. 2009, The ClueWeb09 Dataset. available at http://boston.lti.cs.cmu.edu/clueweb09 (accessed 1st August 2012).
  7. Chierichetti, F., Kumar, R., Raghavan, P., 2009. Compressed web indexes. In: 18th Int. World Wide Web Conference, pp. 451-460.
  8. Deutsch, L. Peter. "DEFLATE compressed data format specification version 1.3." (1996).
  9. He, J., Suel, T., 2011. Faster temporal range queries over versioned text, In the 34th Annual ACM SIGIR Conference, China, pp. 565-574.
  10. He, J., Yan, H., Suel, T., 2009. Compact full-text indexing of versioned document collections, Proceedings of the 18th ACM Conference on Information and knowledge management, November 02-06, Hong Kong, China.
  11. Heman, S. 2005. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam.
  12. Huffman, David A., et al. A method for the construction of minimum redundancy codes. proc. IRE, 1952, 40.9: 1098-1101.
  13. Kwong, Sam, and Yu Fan Ho. "A statistical Lempel-Ziv compression algorithm for personal digital assistant (PDA)." Consumer Electronics, IEEE Transactions on 47.1 (2001): 154-162.
  14. Makris, Christos, and Yannis Plegas. "Exploiting Progressions for Improving Inverted Index Compression." WEBIST. 2013.
  15. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25-47, 2000.
  16. Ntoulas A., Cho J., 2007. Pruning policies for two-tiered inverted index with correctness guarantee, Proceedings of the 30th Annual International ACM SIGIR conference on Research and development in Information Retrieval, July 23-27, Amsterdam, The Netherlands.
  17. Oberhumer, M. F. X. J. "LZO real-time data compression library." User manual for LZO version 0.28, URL: http://www. infosys. tuwien. ac. at/Staff/lux/marco/lzo. html (February 1997) (2005).
  18. Welch, Terry (1984). "A Technique for High-Performance Data Compression". Computer 17 (6): 8-19. doi:10.1109/MC.1984.1659158.
  19. Witten, Ian H., Alistair Moffat, and Timothy C. Bell. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999.
  20. Yan, H., Ding, S., Suel, T., 2009, Compressing term positions in Web indexes, pp. 147-154, Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
  21. Zhang, J., Long, X., and Suel, T. 2008. Performance of compressed inverted list caching in search engines. In the 17th International World Wide Web Conference WWW.
  22. Ziv, Jacob, and Abraham Lempel. "A universal algorithm for sequential data compression." IEEE Transactions on information theory 23.3 (1977): 337-343.
  23. Ziv, Jacob; Lempel, Abraham (September 1978). "Compression of Individual Sequences via VariableRate Coding". IEEE Transactions on Information Theory 24 (5): 530-536. doi:10.1109/TIT.1978.1055934.
  24. Zukowski, M., Heman, S., Nes, N., and Boncz, P. 2006. Super-scalar RAM-CPU cache compression. In the 22nd International Conference on Data Engineering (ICDE) 2006.
Download


Paper Citation


in Harvard Style

Iosifidis V. and Makris C. (2016). Compressing Inverted Files using Modified LZW . In Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-186-1, pages 156-163. DOI: 10.5220/0005857201560163


in Bibtex Style

@conference{webist16,
author={Vasileios Iosifidis and Christos Makris},
title={Compressing Inverted Files using Modified LZW},
booktitle={Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2016},
pages={156-163},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005857201560163},
isbn={978-989-758-186-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - Compressing Inverted Files using Modified LZW
SN - 978-989-758-186-1
AU - Iosifidis V.
AU - Makris C.
PY - 2016
SP - 156
EP - 163
DO - 10.5220/0005857201560163