a more compressed index. The second technique can
obviously offer higher efficiency especially when
handling a big amount of data. Moreover this new
approach for handling DNA sequences as a
geometrical problem could possibly lead in future to
new and efficient ideas about DNA algorithms.
REFERENCES
Alatabbi, A., Crochemore, M., Iliopoulos, C. S., and
Okanlawon, T. A. (2012). Overlapping repetitions in
weighted sequence. In International Information
Technology Conference (CUBE), pp. 435-440.
Bernstein, Y., & Zobel, J. (2004, January). A scalable
system for identifying co-derivative documents.
In String Processing and Information Retrieval (pp. 55-
67). Springer Berlin Heidelberg.
Christodoulakis, M., Iliopoulos, C. S., Mouchard,
L.,Perdikuri, K., Tsakalidis, A. K., and Tsichlas,
K.(2006). Computation of repetitions and regularities
of biologically weighted sequences. In Journal of
Computational Biology (JCB), Volume 13, pp. 1214-
1231.
Diamanti, K., Kanavos, A., Makris, C., & Tokis, T.(2014)
Handling Weighted Sequences Employing Inverted
Files and Suffix Trees,
Grechko, V. V. (2011). Repeated DNA sequences as an
engine of biological diversification. Molecular
Biology, 45(5), 704-727.
Grumbach, S. and Tahi, F., A new challenge for
compression algorithms: genetic sequences, J.
Information Processing and Management, 30(6):875-
866, 1994.
Kim, M.-S., Whang, K.-Y., and Lee, J.-G. (2007).
ngram/2l-approximation: a two-level n-gram inverted
index structure for approximate string matching. In
Computer Systems: Science and Engineering, Volume
22, Number 6.
Kim, M.-S., Whang, K.-Y., Lee, J.-G., and Lee, M.-J.
(2005). n-gram/2l: A space and time efficient twolevel.
n-gram inverted index structure. In International.
Conference on Very Large Databases (VLDB),
pp. 325-336.
Krawinkel, U., Zoebelein, G., & Bothwell, A. L. M. (1986).
Palindromic sequences are associated with sites of
DNA breakage during gene conversion.Nucleic acids
research, 14(9), 3871-3882.
Kurtz, S., & Schleiermacher, C. (1999). REPuter: fast
computation of maximal repeats in complete genomes.
Bioinformatics, 15(5), 426-427.
Lee, J. H. and Ahn, J. S. (1996). Using n-grams for korean.
text retrieval. In ACM SIGIR, pp. 216-224.
Mayfield, J. and McNamee, P. (2003). Single n-gram
stemming.In ACM SIGIR, pp. 415-416.
Millar, E., Shen, D., Liu, J., & Nicholas, C. (2006).
Performance and scalability of a large-scale n-gram
based information retrieval system. Journal of digital
information, 1(5).
Navarro, G., & Baeza-Yates, R. (1998). A practical q-gram
index for text retrieval allowing errors. CLEI Electronic
Journal, 1(2), 1.
Ogawa, Y. and Iwasaki, M. (1995). A new characterbased
indexing organization using frequency data for
japanese documents. In ACM SIGIR, pp. 121-129.
Rivals, E., Delahaye, J.-P., Dauchet, M., and Delgrange, O.,
A Guaranteed Compression Scheme for ´ Repetitive
DNA Sequences, LIFL Lille I University, technical
report IT-285, 1995.
Smith, T. F., & Waterman, M. S. (1981). Identification of
common molecular subsequences. Journal of
molecular biology
, 147(1), 195-197.
Sun, Z., Yang, J., and Deogun, J. S. (2004). Misae: A new
approach for regulatory motif extraction. In
Computational Systems Bioinformatics Conference
(CSB), pp.173-181.
Welch, T. A. (1984). A technique for high-performance
data compression computer, 6(17), 8-19..
Ziv, J., & Lempel, A. (1977). A universal algorithm for
sequential data compression. IEEE Transactions on
information theory, 23(3), 337-343.