Handling Weighted Sequences Employing Inverted Files and Sufﬁx Trees

Klev Diamanti

, Andreas Kanavos

, Christos Makris

and Thodoris Tokis

Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Sweden

Department of Computer Engineering and Informatics, University of Patras, Greece

Keywords:

Searching and Browsing, Web Information Filtering and Retrieval, Text Mining, Indexing Structures, In-

verted Files, n-gram Indexing, Sequence Analysis and Assembly, Weighted Sequences, Weighted Sufﬁx Trees.

Abstract:

In this paper, we address the problem of handling weighted sequences. This is by taking advantage of the

inverted ﬁles machinery and targeting text processing applications, where the involved documents cannot

be separated into words (such as texts representing biological sequences) or word separation is difﬁcult and

involves extra linguistic knowledge (texts in Asian languages). Besides providing a handling of weighted

sequences using n-grams, we also provide a study of constructing space efﬁcient n-gram inverted indexes.

The proposed techniques combine classic straightforward n-gram indexing, with the recently proposed two-

level n-gram inverted ﬁle technique. The ﬁnal outcomes are new data structures for n-gram indexing, which

perform better in terms of space consumption than the existing ones. Our experimental results are encouraging

and depict that these techniques can surely handle n-gram indexes more space efﬁciently than already existing

methods.

1 INTRODUCTION

In this paper we focus on handling weighted se-

quences (Makris and Theodoridis, 2011). The diffe-

rence between weighted sequences and regular strings

is that in the former, we permit in each position the

appearance of more than one character, each with a

certain probability (Makris and Theodoridis, 2011).

Speciﬁcally, a weighted word w = w

· · · w

is a

sequence of positions, where each position w

consists

of a set of couples; each couple has the form (s, π

(s)),

where π

(s) is the probability of having the character

s at position i. Also, for every position w

, 1 ≤ i ≤ n,

∑

(s) = 1. Moreover, it is usually assumed that a

possible subword is worth the effort to be examined if

the probability of its existence is larger than 1/k; with

k being a user deﬁned parameter. In order to han-

dle weighted sequences the Weighted Sufﬁx Tree data

structure was implemented (Iliopoulos et al., 2006).

We consider this speciﬁc data strusture as a proper

sufﬁx tree generalization.

The novelty in our approach is that for the ﬁrst

time, we exploit inverted ﬁles and n-grams in the han-

dling of weighted sequences, thus providing an inter-

esting alternative to weighted sufﬁx trees for a variety

of applications that involve weighted sequences. Our

approach is interesting since it offers interesting al-

ternatives to approaches using sufﬁx arrays and sufﬁx

trees with inverted ﬁles. This lacked in the bibliog-

raphy in contrast to traditional pattern search appli-

cations such as in search engines where both alter-

natives were offered (see for example (Puglisi et al.,

2006)). We do not delve into details of various pat-

tern matching operations but merely focus on how to

space efﬁciently transform weighted sequences into

normal and then handle them using the well known

technique of n-grams. Our target is not only at bio-

logical, but also at natural language applications. n-

grams are sequences of consecutive text elements (ei-

ther words or symbols); they are widely used in In-

formation Retrieval (Ogawa and Iwasaki, 1995), (Lee

and Ahn, 1996), (Navarro and Baeza-Yates, 1998),

(Millar et al., 2000), (Navarro et al., 2000), (Navarro

et al., 2001), (Gao et al., 2002), (Mayﬁeld and Mc-

Namee, 2003), (Kim et al., 2007), (Yang et al., 2007),

especially in applications employing text that cannot

be separated into words.

The indexes produced with the n-gram inverted in-

dex technique, have a number of advantages. One of

them is that they work on any kind of sequences, even

if the sequence consists of words which have no prac-

tical meaning, such as DNA and protein sequences.

Moreover, the n-gram technique is language neutral

since it can be applied on different languages. Ano-

231

Diamanti K., Kanavos A., Makris C. and Tokis T..

Handling Weighted Sequences Employing Inverted Files and Sufﬁx Trees.

DOI: 10.5220/0004788502310238

In Proceedings of the 10th International Conference on Web Information Systems and Technologies (WEBIST-2014), pages 231-238

ISBN: 978-989-758-024-6

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

ther major beneﬁt is that this indexing method is error-

tolerant, putting up with errors that occur during the

construction of the index; this is as it uses for its con-

struction, the 1-sliding technique.

Nevertheless, the n-gram inverted index has also

some drawbacks; the size tends to be very large and

the performance of queries tends to be inefﬁcient.

This is the reason why a wide amount of research on

how to use this technique space efﬁciently has been

performed (Kim et al., 2005), (du Mouza et al., 2009),

(Tang et al., 2009).

In (Kim et al., 2005), an efﬁcient method for con-

structing a two-level index is proposed. Speciﬁcally,

this method reduces signiﬁcantly the size of the in-

dex and improves the query performance when com-

paring to the straightforward n-gram inverted index

technique; while preserving all the advantages of the

n-gram inverted index. This technique extracts sub-

strings of ﬁxed length m from the original sequence

and then applies the classic n-gram technique on each

of those extracted substrings. As shown in (Kim et al.,

2005), this technique can provide signiﬁcant space

improvements, but as it can be observed in our ex-

perimental results, when the original sequence is not

enough repetitive, the performance of this two-level

indexing technique deteriorates.

In detail, we propose three new techniques for

handling weighted sequences using n-grams index-

ing. We additionally propose a new framework for

space compaction aiming to face the aforementioned

space shortcomings of (Kim et al., 2005). In our

space efﬁcient framework, instead of resorting to the

two-level indexing scheme, we judiciously select a set

of substrings of the initial sequences for the n-grams

of which, we employ the two-level indexing scheme;

while for the rest of them, we employ the straightfor-

ward one-level indexing scheme. The substrings are

selected based on the frequency of their appearance

in the whole document set. Also, the length of sub-

strings covering the initial sequence as well as the two

distinct variants of the algorithmic scheme (variant for

selecting these substrings employing a forest of sufﬁx

trees and a variant for the generalized sufﬁx tree) are

implemented and tested. It should be noted that these

generalized sufﬁx trees are the weighted sufﬁx trees

derived from the initial set of weighted sequences.

What is more, experiments on both synthetic and

real data are performed in order to validate the perfor-

mance of our constructions and the space reduction

that they offer. Our work can be considered both an

experimental research for the weighted sequences as

well as a survey for validating the space efﬁciency of

newly and previously proposed constructions in the

area of n-gram indexing.

The rest of the paper is organized as follows. In

section 2, the related work as well as the contribution

is presented. In section 3, we present the techniques

for handling weighted sequences. Subsequently, in

section 4, we describe our space compaction heuris-

tics. In following, section 5 presents a reference

to our experimental results. Finally, section 6 con-

cludes the paper and provides future steps and open

problems.

2 RELATED WORK AND

CONTRIBUTION

In (Christodoulakis et al., 2006), a set of efﬁcient al-

gorithms for string problems, involving weighted se-

quences arising in the computational biology area,

were presented adapting traditional pattern matching

techniques to the weighted scenario. What is more, in

order to approximately match a pattern in a weighted

sequence, a method was presented in (Amir et al.,

2006) for the multiplicative model of probability esti-

mation. In particular, two different deﬁnitions for the

Hamming as well as for the edit distance, in weighted

sequences, were given. Furthermore, we should refer

to some more recent techniques (Zhang et al., 2010a),

(Zhang et al., 2010b), (Alatabbi et al., 2012), that be-

sides extending previous approaches, they also em-

ploy the Equivalence Class Tree for the problem at

hand. From these papers, special mentioning deserves

the work in (Zhang et al., 2010a), which generalizes

the approach in (Iliopoulos et al., 2006), so as to han-

dle effectively various approximate and exact pattern

matching problems in weighted sequences.

In addition, there is a connection with the pro-

babilistic sufﬁx tree, which is basically a stochastic

model that employs a sufﬁx tree as its index struc-

ture. This connection aims to represent compactly

the conditional distribution of probabilities for a set

of sequences. Each node of the corresponding proba-

bilistic sufﬁx tree is associated with a probability vec-

tor that stores the probability distribution for the next

symbol, given the label of the node as the preceding

segment (Marsan and Sagot, 2000), (Sun et al., 2004).

In our work, we will mainly employ the pre-

processing techniques presented in (Iliopoulos et al.,

2006), where an efﬁcient data structure for comput-

ing string regularities in weighted sequences was pre-

sented; this data structure is called Weighted Sufﬁx

Tree. Our approach however can be also modiﬁed to

incorporate the techniques presented in (Zhang et al.,

2010a).

The main motivation for handling weighted se-

quences comes from Computational Molecular Bio-

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

232

logy. However, there are possible applications in

Cryptanalysis and musical texts (see for a discussion

but in this time for the related area of Indeterminate

Strings, which are strings having in positions, sets

of symbols, (Holub and Smyth, 2003), (Holub et al.,

2008)). In Cryptanalysis, undecoded symbols may

be modeled as set of letters with several probabili-

ties, while in music, single notes may match chords

or notes with several probabilities. In addition, our

representation of n-grams and our space compaction

heuristics are of general nature concerning the ef-

ﬁcient handling of multilingual documents in web

search engines and general in information retrieval

applications.

Character n-grams are used especially in CJK

(Chinese, Japanese and Korean) languages, which by

nature cannot be easily separated into words. In these

languages, 2-gram indexing seems to work well. For

example in (Manning et al., 2008), it is mentioned

that in these languages, the characters are more like

syllables than letters and that most words are small in

numbers of characters; also, the word boundaries are

small and in these cases, it is better to use n-grams.

Moreover, n-grams are helpful in Optical Character

Recognition where the text is difﬁcult to comprehend

and it is not possible to introduce word breaks. Ad-

ditionally, k-grams are useful in applications such as

wildcard queries and spelling correction.

3 ALGORITHMS

We initially describe the n-gram based techniques

for handling normal sequences, which are being pre-

sented in (Kim et al., 2005). Then we explain how

these can be adapted so that we can handle weighted

sequences. The algorithm proposed in (Kim et al.,

2005) tries to improve the straightforward inverted

ﬁle scheme that produces n-grams on the ﬂy using a

sliding window; afterwards the algorithm stores them

in an inverted ﬁle by replacing it with a two-level

scheme, which is shown to be more space efﬁcient.

In particular, this novel two-level scheme is based

on the following approach: (i) each of the initial se-

quences is processed and a set of substrings of length

m is extracted so as to overlap with each other by

n − 1 symbols, (ii) an inverted index (called back-

end index) for these substrings as well as the initial

sequence set, considering the substrings as distinct

words, are built, (iii) all the n-grams in each of the

substrings are extracted, (iv) an inverted index (called

front-index) is built, regarding the substrings as docu-

ments and the n-grams as words. This scheme, called

by its authors n-gram/2L, can be applied to any text

and in some cases, results to signiﬁcant space reduc-

tion.

If the text can be partitioned into words (natural

language text), another scheme termed n-gram/2L-

v is provided. So, the subsequences are deﬁned as

consecutive sequences of the text words, by exploi-

ting the intuitive remark that words exhibit repetitive-

ness in natural language text. Their experiments show

that when applied to natural text n-gram/2L-v, sample

space savings, compared to the initial technique, are

produced.

We attempt to adapt their techniques by present-

ing three algorithms for handling weighted sequences,

which are based in the exploitation of the technique

presented in (Kim et al., 2005); then we can adjust

them to the problem at hand.

3.1 1st Technique - Subsequences

Identiﬁcation

In the ﬁrst technique, we form separate sequences as

we split each weighted sequence into weighted sub-

strings; each one of length m. Each one of these

weighted substrings is used to produce normal sub-

strings by employing the normal substrings gene-

ration phase of (Iliopoulos et al., 2006) (p.267, algo-

rithm 2). In this phase, the generation of a substring

stops when its cumulative possibility has reached the

1/k threshold. The cumulative possibility is calcu-

lated by multiplying the relative probabilities of ap-

pearance of each character in every position. Each

produced substring is of maximum size m and for

every substring, we produce all the possible n-grams.

After this procedure, we store all the produced n-

grams in the n-gram/2L-v scheme.

Concerning the generation phase, all the positions

in the weighted sequences are thoroughly scanned

and at each branching position, a list of possible sub-

strings, starting from this position, is created. Then

moving from left to right, the current subwords are

extended by adding the same single character when-

ever a non-branching position is encountered; in con-

trast there is also a creation of new subwords at bran-

ching positions where potentially many choices are

supplied.

3.2 2nd Technique - On the ﬂy n-grams

Identiﬁcation

This technique is much simpler as we don’t need to

deploy all the generic sequences. Unlike the previous

technique, we just need to produce all the possible

n-grams and in following for each report, its corre-

sponding weighted sequences as well as their offsets.

HandlingWeightedSequencesEmployingInvertedFilesandSuffixTrees

233

As a matter of fact, we don’t have to form separate se-

quences, as in the previous approach, but instead only

split each generalized sequence into segments, each

of size m, and for each segment, just produce the re-

quested n-grams.

Hence, this particular scheme is by nature one-

level and we propose its use due its simplicity. Ho-

wever, as it will be highlighted in the experiments,

there are cases when the technique outperforms the

previous one in terms of space complexity.

4 SPACE EFFICIENT INVERTED

FILE IMPLEMENTATIONS FOR

NORMAL SEQUENCES

Our crucial remark is that, in order for the n-gram/2L

technique to provide space savings, the substrings,

where the initial sequences are separated, should ap-

pear a large number of times and should cover a broad

extent of the initial sequences, otherwise in case this

does not apply (e.g. if there is a large number of

unique substrings), then the space occupancy turns

out to increase instead of shrinking.

Hence, it would be preferable to use a hybrid

scheme instead of a two-level one; there we should

extract from the initial sequences, substrings that ap-

pear repetitively enough and cover a large extent of

the initial sequences. In following, for the speciﬁc

substrings, we will employ a two-level scheme; while

for the remaining parts of the sequences, we will use

the straightforward one-level representation. During

this separation, we elongate each selected substring

by n-1, as in (Kim et al., 2005).

So, as to achieve our goal and build a hybrid one

and two-level inverted index, we introduce three tech-

niques:

4.1 One Simple Technique

A variant of the algorithm described in (Kim et al.,

2005), called Hybrid indexing Algorithm version 0 -

hybrid(0), is implemented. In this implementation,

we decided to store the substrings of length m and of

a number of occurrences in the back-end inverted ﬁle

of the two-level scheme; provided that this number is

greater than a trigger. The user is asked to provide the

value of the trigger; the trigger is set equal to 1, for

the results presented in the corresponding section.

The substrings, occuring less or equal to the pro-

vided trigger, are just decomposed in their n-grams

and then saved in a one-level index. The substrings

stored in the two-level scheme, are also decomposed

in their n-grams, which we forward to the front-end

index of the two-level scheme.

4.2 Two Techniques based on Sufﬁx

Trees

In these techniques, we locate substrings that (in con-

trast to hybrid(0)) can be of varying size, highly repe-

titive and cover a large extent of the initial sequences.

So as to locate them, we employ sufﬁx trees (Mc-

Creight, 1976) that have been previously used in si-

milar problems (Gusﬁeld, 1997) of locating frequent

substrings. In particular, we provide two different

variants in the implementation of our space efﬁcient

heuristic schema. Those two distinct versions share a

common initial phase, while differing in their subse-

quent workings.

More analytically, we insert all the sequences in

a generalized sufﬁx tree as described in (Gusﬁeld,

1997) and in following we use this tree for counting

the repetitions of each substring of the stored docu-

ments. Note that if the sequences have been produced

by using mappings from weighted sequences, then the

produced sufﬁx tree is similar to the weighted sufﬁx

tree of the initial sequences. This operation is per-

formed during the building of the generalized sufﬁx

tree; after that, each node of the tree keeps the in-

formation concerning the repetitions of the substrings

stored in it.

Subsequently, in each repetition, our algorithm

chooses a substring and a subset of each occurrence.

These two objects are in following included in the

two-level index. The selection procedure is described

as follows:

1. The substring needs to have a length equal or

greater than s; s is the least acceptable length of

a substring and constitutes a user deﬁned parame-

ter at the start of the algorithm’s execution.

2. The substring has to be highly repetitive. This

means that it should have more than a speciﬁc

number of occurrences (trigger) in the set of in-

dexed documents; this trigger is also a user de-

ﬁned parameter.

3. The appearances of the selected substring, which

are to be included in the two-level index, should

not overlap in more than half the length of the

subsequence; i.e. if the substring has a length

of 10 characters, consecutive appearances of this

substring should not overlap on more than 5 cha-

racters. By setting this criterion, we keep only the

discrete appearances of the selected substring.

After the end of the procedure, we have selected a

collection of substrings. We then sort this collection

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

234

Figure 1: Visualizing hybrid(1) and hybrid(2) techniques.

based on the total length of the original sequences that

the distinct occurrences cover (according to criterion

3). Furthermore, we select as best the occurrences

of speciﬁc subsequence that cover the majority of the

length of the initial sequences. We extract all these

substrings from the initial sequences, thus including

them in the two-level index. As a result, we have split

the initial sequences into a set of partitions that are

not included in the two-level index. Next, we elongate

them by n − 1, so as not to miss any n-gram; where n

is the n-gram length. Finally, we keep all these elon-

gated substrings in a list. As a result, we have per-

fomed the preprocessing step that allows us to follow

one out of two methods described below (see the pro-

cedure in Fig. 1):

(i) Hybrid Indexing Algorithm version 1 - hy-

brid(1). We construct for each elongated substring,

a separate sufﬁx tree and process best utilizing the

same method as above. Then, our algorithm continues

executing the process for each sufﬁx tree constructed

as cited above. This process is repeated as many times

as the user chooses at the beginning of the algorithm

execution.

(ii) Hybrid Indexing Algorithm version 2 - hy-

brid(2). We include all elongated substrings men-

tioned in a uniﬁed generalized sufﬁx tree. In follo-

wing, our algorithm executes the process for the ge-

neralized sufﬁx tree constructed. This process is re-

peated as many times as requested. Generally, the

more recursions we made, the better results we had;

however, because of the limited system resources, we

opted for 50 recursions in our experiments.

5 EXPERIMENTS

5.1 Experimental Setting

In our experiments, we used random weighted se-

quences to test our n-gram mapping techniques as

well as one ﬁle (of size 1 GB) containing Protein data

and DNA data to test our space compaction heuris-

tics. We also performed experiments with 10MB and

100MB with similar results. Due to lack of space,

only ﬁgures and comments from the 1GB data are pre-

sented in the main body of the article. Our experimen-

tal data were downloaded from the NCBI databases

(ftp://ftp.ncbi.nih.gov/genomes/). Furthermore, we

use initials to designate both m (length of substrings)

as well as the parameter s (size in bytes) in our space

compaction heuristics.

The computer system, where the experiments

were performed, was an Intel Core i5-2410M 2.3

GHz CPU with a 3GB (1x1GB and 1x2GB in 2xDual

Channel) RAM. The techniques we implemented and

applied on the experimental data mentioned above,

were:

1. Weighted Sequences Identiﬁcation:

(i) Subsequences Identiﬁcation,

(ii) On the ﬂy n-grams Identiﬁcation and

(iii) Ofﬂine Identiﬁcation.

2. Space compaction heuristics:

(i) One-Level Inverted File (using the classic

straightforward technique),

(ii) Two-Level Inverted File (using the technique

in (Kim et al., 2005)),

(iii) Hybrid Inverted File using the Simple Tech-

nique - hybrid(0),

(iv) Hybrid Inverted File with separate sufﬁx trees

- hybrid(1) and

(v) Hybrid Inverted File with a uniﬁed generalized

sufﬁx tree - hybrid(2).

For our space compaction heuristics, we run all

techniques proposed in this paper (hybrid(0), hy-

brid(1) and hybrid(2)) in order to identify the most

space efﬁcient solution available. So as to depict the

space compaction effectiveness of our approach, we

tried our approach on real data of signiﬁcant size and

performed several experiments. As the experiments

show, our approach outstandingly reduces the space

complexity and stands by itself as a considerable im-

provement.

5.2 Weighted Sequences Results

As is depicted in Fig. 2, the ofﬂine approach is the

worst in the attained space complexity, as expected.

HandlingWeightedSequencesEmployingInvertedFilesandSuffixTrees

235

The reason is because all possible combinations of se-

quences are produced; not only those that are needed

by the two-level scheme. On the other hand, the of-

ﬂine approach is more ﬂexible since it can incorporate

different values of variables n and s.

Figure 2: Weighted Sequences 10MB for varying size of s

(a) n=2, (b) n=3 and (c) n=4.

With regards to the other two techniques, the on

the ﬂy approach is the most robust and stable in per-

formance due to its ﬁxed algorithmic behavior when

handling every possible input. The identiﬁcation of

the subsequences, although better for small values of

s, behaves worse for larger values. This can be at-

tributed to the shortage of repetitions; being a vital

ingredient of the success of this method’s heuristic,

when the value of s is increasing.

5.3 Protein Data Results

In the performed experiments, we never needed to

make more than 50 recursions, as by this number we

got the best possible results from the index method.

Moreover, we ran experiments of substrings that have

length from 4 to 10, in order to demonstrate the im-

provements that the two-level technique produces to

the inverted ﬁle size.

Our hybrid(2) technique seems to be not as efﬁ-

cient as hybrid(1) is. Although, it theoretically con-

siders the high repetitive sequence more efﬁciently

than the hybrid(1) technique, it does not seem to have

satisfactory results. A probable explanation could be

that using separate sufﬁx trees, this method permits

more choices in the sequences that will be selected

for separate indexing than the Generalized sufﬁx tree;

the latter demands the selection of the same substring

across different substrings. Furthermore, the tech-

nique is sensitive to the number of performed recur-

sions and needs a vast number of them to work effec-

tively.

Figure 3: Protein Data 1GB for varying size of s (a) n=2,

(b) n=3 and (c) n=4.

Another ﬁnding is that hybrid(0) technique is

quite similar to the two-level technique for substrings

with length 4 and 5 and after that, it is not as efﬁcient

as our hybrid(1) technique. This behavior can be ex-

plained from the fact that this technique always takes

advantage of the positive characteristics of the two-

level techniques as long as it is better than one-level;

otherwise it resorts to the one-level.

Generally, in Protein data, our methods achieve

better results due to the fact that they take advan-

tage of the repetitiveness of the initial sequence even

when the number of the repetitions is quite low. This

is something that does not hold for the two-level

scheme, where the performance is clearly degraded.

5.4 DNA Data Results

In the results shown below, the maximum number of

recursions made, was ﬁxed to 50 for each experiment.

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

236

In case of DNA data, we experimented for substrings

that have length from 4 to 13. We examined more

substring sizes so as to clarify the inefﬁciency of the

two-level technique when the repetitiveness becomes

lower. It is obvious that the two-level technique in-

creases the inverted ﬁle size produced, when the sub-

string length becomes larger than 11.

Analyzing the results presented in ﬁgure with

DNA data results, we can patently see that our hy-

brid(1) technique is not as efﬁcient as the two-level

index. The reason for this inefﬁciency is that two-

level index takes advantage of the substrings of length

from 6 to 11, which seems to be highly repetitive in

the DNA sequences examined. As soon as the size

of the substring becomes lower than 6 or larger than

11, our method becomes obviously better. This oc-

curs because the DNA data ﬁle used, is not so highly

repetitive for subsequences of length <6 or >11.

Figure 4: DNA Data 1GB for varying size of s (a) n=2, (b)

n=3 and (c) n=4.

In cases when two-level technique performs bet-

ter than hybrid(1), we use hybrid(0) to store our data.

Hybrid(0) performs very similarly to two-level tech-

nique. The differences between the ﬁles produced by

those two techniques are considered to be negligible.

The reason why this phenomenon appears is due to

the highly repetitive nature of DNA data (the limited

alphabet) on limited size sequences.

As for our hybrid(2) method, we can clearly see

that this method seems to be inefﬁcient, and works

worse than hybrid(1); this was something that was

also noted in Protein data and can be explained in a

similar way as previously mentioned. Perhaps a bet-

ter tuning of the involved algorithmic parameters and

a combination with hybrid(1) would result in a more

efﬁcient scheme; but this is left as future work.

By choosing hybrid(0) or hybrid(1) techniques to

save the DNA data in inverted indexes, we are led to

very compact inverted ﬁle sizes. These sizes gener-

ally outperform or at least approximate the two-level

index efﬁcacy.

In conclusion, our experiments clearly prove that

our techniques can signiﬁcantly reduce space comple-

xity by handling n-gram indexes and can also stand as

considerable improvements.

6 GENERAL CONCLUSIONS

AND FUTURE WORK

In this article we presented a set of algorithmic tech-

niques for efﬁciently handling weighted sequences by

using inverted ﬁles. Also, these methods deal effec-

tively with weighted sequences using the n-gram ma-

chinery. Three techniques, which act as alternatives

to other techniques that mainly use sufﬁx trees, were

presented. We furthermore completed our discussion

by presenting a general framework that can be em-

ployed so as to reduce the space complexity of the

two-level inverted ﬁles for n-grams.

In the future, we intend to experiment with var-

ious inverted ﬁle intersection algorithms (Culpepper

and Moffat, 2010), in order to test the time efﬁ-

ciency of our scheme when handling such queries. We

could perhaps incorporate some extra data structures

as those in (Kaporis et al., 2003) as a well thought

out plan. Last but not least, we also plan to apply our

technique to natural language texts.

ACKNOWLEDGEMENTS

This research has been co-ﬁnanced by the European

Union (European Social Fund - ESF) and Greek na-

tional funds through the Operational Program ”Edu-

cation and Lifelong Learning” of the National Strate-

gic Reference Framework (NSRF) - Research Fund-

ing Program: Thales. Investing in knowledge society

through the European Social Fund.

REFERENCES

Alatabbi, A., Crochemore, M., Iliopoulos, C. S., and

Okanlawon, T. A. (2012). Overlapping repetitions

HandlingWeightedSequencesEmployingInvertedFilesandSuffixTrees

237

in weighted sequence. In International Information

Technology Conference (CUBE), pp. 435-440.

Amir, A., Iliopoulos, C. S., Kapah, O., and Porat, E. (2006).

Approximate matching in weighted sequences. In

Combinatorial Pattern Matching (CPM), pp. 365376.

Christodoulakis, M., Iliopoulos, C. S., Mouchard, L.,

Perdikuri, K., Tsakalidis, A. K., and Tsichlas, K.

(2006). Computation of repetitions and regularities of

biologically weighted sequences. In Journal of Com-

putational Biology (JCB), Volume 13, pp. 1214-1231.

Culpepper, J. S. and Moffat, A. (2010). Efﬁcient set inter-

section for inverted indexing. In ACM Transactions

on Information Systems (TOIS), Volume 29, Article 1.

du Mouza, C., Litwin, W., Rigaux, P., and Schwarz, T. J. E.

(2009). As-index: a structure for string search using

n-grams and algebraic signatures. In ACM Conference

on Information and Knowledge Management (CIKM),

pp. 295-304.

Gao, J., Goodman, J., Li, M., and Lee, K.-F. (2002). Efﬁ-

cient set intersection for inverted indexing. In ACM

Transactions on Asian Language Information Pro-

cessing, Volume 1, Number 1, pp. 3-33.

Gusﬁeld, D. (1997). Algorithms on Strings, Trees and Se-

quences: Computer Science and Computational Bio-

logy. Cambridge University Press.

Holub, J. and Smyth, W. F. (2003). Algorithms on indeter-

minate strings. In Australasian Workshop on Combi-

natorial Algorithms.

Holub, J., Smyth, W. F., and Wang, S. (2008). Fast pattern-

matching on indeterminate strings. In Journal of Dis-

crete Algorithms, Volume 6, pp. 37-50.

Iliopoulos, C. S., Makris, C., Panagis, Y., Perdikuri, K.,

Theodoridis, E., and Tsakalidis, A. K. (2006). The

weighted sufﬁx tree: An efﬁcient data structure for

handling molecular weighted sequences and its appli-

cations. In Fundamenta Informaticae (FUIN), Volume

71, pp. 259-277.

Kaporis, A. C., Makris, C., Sioutas, S., Tsakalidis, A. K.,

Tsichlas, K., and Zaroliagis, C. D. (2003). Improved

bounds for ﬁnger search on a ram. In ESA, Volume

2832, pp. 325-336.

Kim, M.-S., Whang, K.-Y., and Lee, J.-G. (2007). n-

gram/2l-approximation: a two-level n-gram inverted

index structure for approximate string matching. In

Computer Systems: Science and Engineering, Volume

22, Number 6.

Kim, M.-S., Whang, K.-Y., Lee, J.-G., and Lee, M.-J.

(2005). n-gram/2l: A space and time efﬁcient two-

level n-gram inverted index structure. In Interna-

tional Conference on Very Large Databases (VLDB),

pp. 325-336.

Lee, J. H. and Ahn, J. S. (1996). Using n-grams for korean

text retrieval. In ACM SIGIR, pp. 216-224.

Makris, C. and Theodoridis, E. (2011). Algorithms in

Computational Molecular Biology: Techniques, Ap-

proaches and Applications. Wiley Series in Bioinfor-

matics.

Manning, C. D., Raghavan, P., and Schutze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

Marsan, L. and Sagot, M.-F. (2000). Extracting structured

motifs using a sufﬁx tree - algorithms and application

to promoter consensus identiﬁcation. In International

Conference on Research in Computational Molecular

Biology (RECOMB), pp. 210-219.

Mayﬁeld, J. and McNamee, P. (2003). Single n-gram stem-

ming. In ACM SIGIR, pp. 415-416.

McCreight, E. M. (1976). A space-economical sufﬁx

tree construction algorithm. In Journal of the ACM

(JACM), Volume 23, pp. 262-272.

Millar, E., Shen, D., Liu, J., and Nicholas, C. K. (2000).

Performance and scalability of a large-scale n-gram

based information retrieval system. In Journal of Dig-

ital Information, Volume 1, Number 5.

Navarro, G. and Baeza-Yates, R. A. (1998). A practical q-

gram index for text retrieval allowing errors. In CLEI

Electronic Journal, Volume 1, Number 2.

Navarro, G., Baeza-Yates, R. A., Sutinen, E., and Tarhio,

J. (2001). Indexing methods for approximate string

matching. In IEEE Data Engineering Bulletin, Volume

24, Number 4, pp. 19-27.

Navarro, G., Sutinen, E., Tanninen, J., and Tarhio, J. (2000).

Indexing text with approximate q-grams. In Combina-

torial Pattern Matching (CPM), pp. 350-363.

Ogawa, Y. and Iwasaki, M. (1995). A new character-

based indexing organization using frequency data for

japanese documents. In ACM SIGIR, pp. 121-129.

Puglisi, S. J., Smyth, W. F., and Turpin, A. (2006). Inverted

ﬁles versus sufﬁx arrays for locating patterns in pri-

mary memory. In String Processing and Information

Retrieval (SPIRE), pp. 122-133.

Sun, Z., Yang, J., and Deogun, J. S. (2004). Misae: A new

approach for regulatory motif extraction. In Computa-

tional Systems Bioinformatics Conference (CSB), pp.

173-181.

Tang, N., Sidirourgos, L., and Boncz, P. A. (2009). Space-

economical partial gram indices for exact substring

matching. In ACM Conference on Information and

Knowledge Management (CIKM), pp. 285-294.

Yang, S., Zhu, H., Apostoli, A., and Cao, P. (2007). N-

gram statistics in english and chinese: Similarities and

differences. In International Conference on Semantic

Computing (ICSC), pp. 454-460.

Zhang, H., Guo, Q., and Iliopoulos, C. S. (2010a). An al-

gorithmic framework for motif discovery problems in

weighted sequences. In International Conference on

Algorithms and Complexity (CIAC), pp. 335-346.

Zhang, H., Guo, Q., and Iliopoulos, C. S. (2010b). Varieties

of regularities in weighted sequences. In Algorithmic

Aspects in Information and Management (AAIM), pp.

271-280.

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

238