Handling Weighted Sequences Employing Inverted Files and Suffix Trees

Klev Diamanti, Andreas Kanavos, Christos Makris, Thodoris Tokis

Abstract

In this paper, we address the problem of handling weighted sequences. This is by taking advantage of the inverted files machinery and targeting text processing applications, where the involved documents cannot be separated into words (such as texts representing biological sequences) or word separation is difficult and involves extra linguistic knowledge (texts in Asian languages). Besides providing a handling of weighted sequences using n-grams, we also provide a study of constructing space efficient n-gram inverted indexes. The proposed techniques combine classic straightforward n-gram indexing, with the recently proposed two-level n-gram inverted file technique. The final outcomes are new data structures for n-gram indexing, which perform better in terms of space consumption than the existing ones. Our experimental results are encouraging and depict that these techniques can surely handle n-gram indexes more space efficiently than already existing methods.

References

  1. Alatabbi, A., Crochemore, M., Iliopoulos, C. S., and Okanlawon, T. A. (2012). Overlapping repetitions in weighted sequence. In International Information Technology Conference (CUBE), pp. 435-440.
  2. Amir, A., Iliopoulos, C. S., Kapah, O., and Porat, E. (2006). Approximate matching in weighted sequences. In Combinatorial Pattern Matching (CPM), pp. 365376.
  3. Christodoulakis, M., Iliopoulos, C. S., Mouchard, L., Perdikuri, K., Tsakalidis, A. K., and Tsichlas, K. (2006). Computation of repetitions and regularities of biologically weighted sequences. In Journal of Computational Biology (JCB), Volume 13, pp. 1214-1231.
  4. Culpepper, J. S. and Moffat, A. (2010). Efficient set intersection for inverted indexing. In ACM Transactions on Information Systems (TOIS), Volume 29, Article 1.
  5. du Mouza, C., Litwin, W., Rigaux, P., and Schwarz, T. J. E. (2009). As-index: a structure for string search using n-grams and algebraic signatures. In ACM Conference on Information and Knowledge Management (CIKM), pp. 295-304.
  6. Gao, J., Goodman, J., Li, M., and Lee, K.-F. (2002). Efficient set intersection for inverted indexing. In ACM Transactions on Asian Language Information Processing, Volume 1, Number 1, pp. 3-33.
  7. Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.
  8. Holub, J. and Smyth, W. F. (2003). Algorithms on indeterminate strings. In Australasian Workshop on Combinatorial Algorithms.
  9. Holub, J., Smyth, W. F., and Wang, S. (2008). Fast patternmatching on indeterminate strings. In Journal of Discrete Algorithms, Volume 6, pp. 37-50.
  10. Iliopoulos, C. S., Makris, C., Panagis, Y., Perdikuri, K., Theodoridis, E., and Tsakalidis, A. K. (2006). The weighted suffix tree: An efficient data structure for handling molecular weighted sequences and its applications. In Fundamenta Informaticae (FUIN), Volume 71, pp. 259-277.
  11. Kaporis, A. C., Makris, C., Sioutas, S., Tsakalidis, A. K., Tsichlas, K., and Zaroliagis, C. D. (2003). Improved bounds for finger search on a ram. In ESA, Volume 2832, pp. 325-336.
  12. Kim, M.-S., Whang, K.-Y., and Lee, J.-G. (2007). ngram/2l-approximation: a two-level n-gram inverted index structure for approximate string matching. In Computer Systems: Science and Engineering, Volume 22, Number 6.
  13. Kim, M.-S., Whang, K.-Y., Lee, J.-G., and Lee, M.-J. (2005). n-gram/2l: A space and time efficient twolevel n-gram inverted index structure. In International Conference on Very Large Databases (VLDB), pp. 325-336.
  14. Lee, J. H. and Ahn, J. S. (1996). Using n-grams for korean text retrieval. In ACM SIGIR, pp. 216-224.
  15. Makris, C. and Theodoridis, E. (2011). Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications. Wiley Series in Bioinformatics.
  16. Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  17. Marsan, L. and Sagot, M.-F. (2000). Extracting structured motifs using a suffix tree - algorithms and application to promoter consensus identification. In International Conference on Research in Computational Molecular Biology (RECOMB), pp. 210-219.
  18. Mayfield, J. and McNamee, P. (2003). Single n-gram stemming. In ACM SIGIR, pp. 415-416.
  19. McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. In Journal of the ACM (JACM), Volume 23, pp. 262-272.
  20. Millar, E., Shen, D., Liu, J., and Nicholas, C. K. (2000). Performance and scalability of a large-scale n-gram based information retrieval system. In Journal of Digital Information, Volume 1, Number 5.
  21. Navarro, G. and Baeza-Yates, R. A. (1998). A practical qgram index for text retrieval allowing errors. In CLEI Electronic Journal, Volume 1, Number 2.
  22. Navarro, G., Baeza-Yates, R. A., Sutinen, E., and Tarhio, J. (2001). Indexing methods for approximate string matching. In IEEE Data Engineering Bulletin, Volume 24, Number 4, pp. 19-27.
  23. Navarro, G., Sutinen, E., Tanninen, J., and Tarhio, J. (2000). Indexing text with approximate q-grams. In Combinatorial Pattern Matching (CPM), pp. 350-363.
  24. Ogawa, Y. and Iwasaki, M. (1995). A new characterbased indexing organization using frequency data for japanese documents. In ACM SIGIR, pp. 121-129.
  25. Puglisi, S. J., Smyth, W. F., and Turpin, A. (2006). Inverted files versus suffix arrays for locating patterns in primary memory. In String Processing and Information Retrieval (SPIRE), pp. 122-133.
  26. Sun, Z., Yang, J., and Deogun, J. S. (2004). Misae: A new approach for regulatory motif extraction. In Computational Systems Bioinformatics Conference (CSB), pp. 173-181.
  27. Tang, N., Sidirourgos, L., and Boncz, P. A. (2009). Spaceeconomical partial gram indices for exact substring matching. In ACM Conference on Information and Knowledge Management (CIKM), pp. 285-294.
  28. Yang, S., Zhu, H., Apostoli, A., and Cao, P. (2007). Ngram statistics in english and chinese: Similarities and differences. In International Conference on Semantic Computing (ICSC), pp. 454-460.
  29. Zhang, H., Guo, Q., and Iliopoulos, C. S. (2010a). An algorithmic framework for motif discovery problems in weighted sequences. In International Conference on Algorithms and Complexity (CIAC), pp. 335-346.
  30. Zhang, H., Guo, Q., and Iliopoulos, C. S. (2010b). Varieties of regularities in weighted sequences. In Algorithmic Aspects in Information and Management (AAIM), pp. 271-280.
Download


Paper Citation


in Harvard Style

Diamanti K., Kanavos A., Makris C. and Tokis T. (2014). Handling Weighted Sequences Employing Inverted Files and Suffix Trees . In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-758-024-6, pages 231-238. DOI: 10.5220/0004788502310238


in Bibtex Style

@conference{webist14,
author={Klev Diamanti and Andreas Kanavos and Christos Makris and Thodoris Tokis},
title={Handling Weighted Sequences Employing Inverted Files and Suffix Trees},
booktitle={Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2014},
pages={231-238},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004788502310238},
isbn={978-989-758-024-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - Handling Weighted Sequences Employing Inverted Files and Suffix Trees
SN - 978-989-758-024-6
AU - Diamanti K.
AU - Kanavos A.
AU - Makris C.
AU - Tokis T.
PY - 2014
SP - 231
EP - 238
DO - 10.5220/0004788502310238