Compressing Inverted Files using Modified LZW

Vasileios Iosifidis and Christos Makris

Department of Computer Engineering and Informatics, University of Patras, Rio 26500, Patra, Greece

Keywords: Inverted File, Compression, LZ78, LZW, GZIP, Binary Interpolative Encoding, Gaps, Reorder, Searching

and Browsing, Metrics and Performance.

Abstract: In the paper, we present a compression algorithm that employs a modification of the well known Ziv

Lempel Welch algorithm (LZW); it creates an index that treats terms as characters, and stores encoded

document identifier patterns efficiently. We also equip our approach with a set of preprocessing

{reassignment of document identifiers, Gaps} and post-processing methods {Gaps, IPC encoding, GZIP} in

order to attain more significant space improvements. We used two different combinations of those discrete

steps to see which one maximizes the performance of the modification we made on the LZW algorithm.

Performed experiments in the Wikipedia dataset depict the superiority in space compaction of the proposed

technique.

1 INTRODUCTION

Inverted files are considered to be the best structures

for indexing in information retrieval search engines

(Baeza-Yates and Ribeiro-Neto, 2011, Büttcher,

Clarke and Cormack 2010). The main problem one

has to tackle with them in an information retrieval

system is that when the number of documents in the

collection increases, the size of data indices grows

significantly, hence scalability in terms of efficient

compression techniques is a mandatory requirement.

At present, in the large-scale information

retrieval systems, the term-oriented inverted index

technology is commonly used. An inverted index

consists of two parts: a search index storing the

distinct terms of the documents in the collection, and

for each term a list storing the documents that

contain these terms. Each document appears in this

list either as an identifier or it is accompanied with

extra information such as the number of appearances

of the term in the document. When only the

document identifiers appear then the list is usually

an ascending list of identifiers so that it can be easily

compressed ((Baeza-Yates and Ribeiro-Neto, 2011,

Büttcher, Clarke and Cormack 2010, Witten, Moffat

and Bell, 1999).

In this paper we try to envisage a new

compression scheme for inverted files that is based

on an elegant combination of previously published

solutions. In particular we try to find common

appearances inside the terms, and store those

common appearances as encodings that require less

space. Our basic idea comes from the most widely

used algorithm, the LZ78 (Ziv et al. 1978), LZW

(Welch and Terry, 1984) and a method which

reassigns the document identifiers of the corpus

(Arroyuelo et al. 2013).

As we will describe later, the modified LZW is

trying to find patterns inside the inverted file and

encodes them into numbers. The reassign method

(Arroyuelo et al. 2013) helps us to keep the encoded

values ‘small’ and also it produces patterns for our

method to find. After the reassignment the document

identifier values require fewer digits for

representation because the range they required

previously was larger.

In section 2 we describe related work. In section

3 we present an analysis of the methods we used. In

section 4 we present our ideas, the modifications we

made and an example of how it works. In section 5

we present the results which came from

experimentation on the Wikipedia’s dataset and also

we compare this technique with the one which

produces arithmetic progressions (Makris and

Plegas, 2013) and is considered to achieve

compression more effective than previous

techniques. In section 6 and 7 we explain our

experiments and conclude with future work and

open problems.

156

Iosiﬁdis, V. and Makris, C.

Compressing Inverted Files using Modiﬁed LZW.

In Proceedings of the 12th International Conference on Web Information Systems and Technologies (WEBIST 2016) - Volume 1, pages 156-163

ISBN: 978-989-758-186-1

2 RELATED WORK

Many compressing file methods have been proposed

in the scientific bibliography. The majority of these

compression methods use gaps, between document

identifiers (DocIds), in order to represent data with

fewer bits. The most well-known methods for

compressing integers are the Binary code, Unary

code, Elias gamma, Elias delta, Variable-byte and

the Golomb code (Witten, Moffat and Bell, 1999).

Over the past decade, there have been developed

some methods which are considered to be the most

successful. These methods are: Binary Interpolative

Coding (Moffat and Stuiver, 2000) and the OptPFD

method in (Yan et al., 2009) that is an optimized

version of the techniques appearing in (Heman,

2005; Zukowski, Heman, Nes and Boncz, 2006) and

that are known as PforDelta family (PFD).

The Interpolative Coding (IPC) (Moffat and L.

Stuiver, 2000) has the ability to code a sequence of

integers with a few bits. Instead of creating gaps

from left to right, compression and decompression

are done through recursion. Consider a list of

ascending integers < a

, a

… a

>. IPC will split the

list into two other sub lists. The middle element, a

n/2

will be binary encoded and the first and last element

of the list will be used as boundary of bits. The least

binary representation would require log (a

– a

+ 1)

bits. The method runs recursively for the two sub lists.

Some methods for performance optimization are

pruning-based (Ntoulas and Cho, 2007; Zhang et al.,

2008). Some other methods try to take advantage of

closely resembling that may exist between different

versions of the same document in order to avoid size

expansion of the inverted indexes (He, Yan and

Suel, 2009; He and Suel 2011). Another method is

storing the frequency of appearances from document

identifiers of various terms and also the term

positions (Akritidis and Bozanis, 2012; Yan et al.,

2009).

Moreover, there is a variety of researches which

focus on the family of LZ algorithms. The statistical

Lempel-Ziv algorithm (Kwong, Sam, and Yu Fan

Ho, 2001) takes into consideration

the statistical properties of the source information.

Also, there is LZO (Oberhumer, 1997/2005) which

supports overlapping compression and in-place

decompression.

Furthermore, there is one case study where the

most common compressing methods were applied

for evaluation of a hypothesis that the terms in a

page are stochastically generated (Chierichetti,

Kumar and Raghavan, 2009). In parallel, there is a

recent method which converts the lists of document

identifiers as a set of arithmetic progressions which

consist of three numbers (Makris and Plegas, 2013).

Finally, Arroyuelio et al. (2013) proposed a

reassignment method that allows someone to focus

on a subset of inverted lists and improve their

performance on queries and compressing ratio.

In our approach we used a combination of

methods which we will describe below. We

compared our method with a recent method (Makris

and Plegas, 2013) which has very good compressing

ratio. In section 3 we present two different

combinations of the methods we used so as to

evaluate the behavior of the modification we made

and to see which one achieves the maximum

compressing ratio.

3 USED TECHNIQUES

In our scheme we employed several algorithmic

tools in order to produce better compressing ratios.

The tools we used for this purpose are described

below.

3.1 LZ78 Analysis

LZ78 (Ziv et al. 1978) algorithms achieve

compression by replacing repeated occurrences of

data with references to a dictionary that is built

based on the input data stream. Each dictionary entry

is of the form dictionary[...] = {index, character},

where index is the index to a previous dictionary

entry, and character is appended to the string

represented by dictionary[index]. The algorithm

initializes last matching index = 0 and next available

index = 1. For each character of the input stream, the

dictionary is searched for a match: {last matching

index, character}. If a match is found, then the last

matching index is set to the index of the matching

entry, and nothing is output. If a match is not found,

then a new dictionary entry is created: dictionary

[next available index] = {last matching index,

character}, and the algorithm outputs last matching

index, followed by character, then resets last

matching index = 0 and increments next available

index. Once the dictionary is full, no more entries

are added. When the end of the input stream is

reached, the algorithm outputs last matching index.

3.2 LZW Analysis

LZW (Welch and Terry 1984) is an LZ78-based

algorithm that uses a dictionary pre-initialized with

all possible characters (symbols), (or emulation of a

Compressing Inverted Files using Modiﬁed LZW

157

pre-initialized dictionary). The main improvement of

LZW is that when a match is not found, the current

input stream character is assumed to be the first

character of an existing string in the dictionary

(since the dictionary is initialized with all possible

characters), so only the last matching index is output

(which may be the pre-initialized dictionary index

corresponding to the previous (or the initial) input

character).

3.3 Reassignment Analysis

DocIds which are contained inside an inverted file

may be large numbers using many bytes to be

stored. Using the reorder method (Arroyuelo et al.

2013), all the DocIds are reassigned as different

numbers in order to focus on a given subset of

inverted lists to improve their performance on

querying and compressing. For example consider a

term < T1 >, which contains DocIds: [100, 101,

1001, 1002, 1003]. So the step that is going to be

applied will re-enumerate all the DocIds inside the

term and the whole corpus of the inverted file. After

the reorder the result would be:

100 → 1, 101 → 2, 1001 →3, 1002 → 4, 1003 →5

So the term would be like:

<T1> = [1, 2, 3, 4, 5]

This step helps us reduce the Gaps between

DocIds, and reduce some space of the Inverted File.

Another use of this method is to decrease the starting

encoding value of the modified LZW. So we also

use the reassignment method to decrease the

encoded values.

We also used a modified reassignment method

(Arroyuelo et al. 2013) where we reordered the

pages based on the term intersections. For example

consider a term < T1 >, < T2 >, which contain

DocIds: [100, 105, 110, 120] and [29, 100, 105, 106,

107, 110, 120, 400] respectively. Now if we use the

first reorder method the result would be:

100 → 1, 105 → 2, 110 →3, 120 → 4, 29 → 5, 106

→ 6, 107 → 7, 400 → 8

So the terms < T1 > and< T2 > would be like:

< T1 > = [1, 2, 3, 4]

< T2 > = [5, 1, 2, 6, 7, 3, 4, 8]

Now if we apply the second reorder method the

encoding would be the same in this case but the

output of the terms would be like this:

< T1 > = [1, 2, 3, 4]

< T2 > = [1, 2, 3, 4, 5, 6, 7, 8]

We use this method (Arroyuelo et al. 2013) in

order to create more repeated patterns for the

modified LZW. The above example shows that if we

used first method < T1 > and < T2 > would not have

the same n

elements in common. The modified

LZW has a good compressing ratio when it’s

locating common sequences. So the second method

produces sequences which are common to the

previous lists. In this example, if the first method is

applied it will produce more encoding values than

the second method (assumed we have already

encoded the unique pages inside the index). After we

applied this technique, we noticed a slight

improvement between the compressed files, of the

modified LZW, for the two reordering methods.

3.4 GZIP Analysis

GZIP (Witten, Moffat and Bell, 1999) is a method of

higher-performance compression based on LZ77.

GZIP is using hash tables to locate previous

occurrences of strings. GZIP is using Deflate

algorithm (Deutsch, L. Peter, 1996) which is a mix

of Huffman (Huffman, David A. et al. 1952) and

LZ77 (Ziv et al. 1977). GZIP is “greedy”; it codes

the upcoming characters as a pointer if at all

possible. The Huffman codes for GZIP are generated

semi-statically. Because of the fast searching

algorithm and compact output representation based

upon Huffman codes, GZIP outperforms most other

Ziv-Lempel methods in terms of both speed and

compression effectiveness.

4 OUR CONTRIBUTION

We present two different schemes with a

combination of the described methods. The main

intuition behind these methods is based on

“greedily” compressing by repetitively applying

algorithmic compression schemes. We used these

methods because they are considered of being state

of the art methods in compression (gap, binary

interpolative code, GZIP), so they have been used to

achieve the maximum compressing ratio.

One of the pre-processing methods is the reorder

of the corpus where we change all the DocIds

starting for value 1 and we go on, reassigning all the

DocIds of the inverted file. The re-enumeration is

done based on one or more lists (Arroyuelo et al.

2013). When we used the re-enumeration based on

more than one list, the modified LZW was slightly

better than the re-enumeration based on one list.

Another step is, after reordering the inverted file, to

sort the reenumerated DocIds and store the intervals

of them inside the inverted file. This step has a good

WEBIST 2016 - 12th International Conference on Web Information Systems and Technologies

158

compression ratio and for it we coin the term gap

method. We also use binary interpolative encoding

(Moffat and Stuiver, 2000) and GZIP

(http://en.wikipedia.org/wiki/GZIP, Witten, Moffat

and Bell, 1999) compression.

In the first scheme we have four different

methods that we apply. First is reorder method

(Arroyuelo et al. 2013), second we apply our

modified LZW, third we use binary interpolative

coding (Moffat and L. Stuiver, 2000) and last we use

GZIP (Witten, Moffat and Bell, 1999). In the second

scheme we again use four different methods but this

time they are a bit different. Again we apply reorder

method as a first step but as a second step we

employ the gap method. As a third step we use

modified LZW and for last step we use GZIP.

We propose this combination of techniques

because they are state of the art. We experimented

with other compressing methods such as gamma,

delta, Golomb encodings but their results were not

as good as interpolation’s encoding. Also gap

method was easy to implement and it achieved great

compressing ratio. Reorder method was primarily

used to enhance our modification on LZW. GZIP on

the other hand was used to minimize the output so

we could achieve the maximum compressing ratio.

Pseudo code describes the steps below. Figure 1

is the first scheme and Figure 2 is the second

scheme.

Figure 1: 1

scheme.

Figure 2: 2

scheme.

On the first scheme, we reorder the inverted file

in order to reduce the range of the numbers and

create the patterns based on the second approach we

described in section 3.3. Then we use the modified

LZW and after that we proceed with binary

interpolation coding and GZIP in order to reduce the

inverted file even more.

On the second scheme we again reorder the

inverted file for the same reason as previously but

now we use the gap technique. Gap method

combined with the reorder method, has great

compressing ratio but it makes modified LZW

inefficient as we will explain below.

4.1 Modification of LZW

In our algorithm we are using a modification of the

LZW. Instead of characters the modified LZW reads

DocIds as characters and tries to find patterns inside

terms. Furthermore, the LZW has an index which

contains letters and numbers and their encoded

values which goes till 255, so it starts encoding after

255. We build an index, which in the start is

completely empty, that consists of patterns that are

found and their encoded number. As we know, the

DocIds are webpages which are enumerated. In

order to avoid collisions on decompression we must

locate the largest number inside the inverted file and

take the max (DocIds) + 1, as the starting encoding

number of the modified LZW (this is done by the

previous step, the re-enumerate step, so we will not

have to scan the whole file from the start).

4.2 Compression with Modified LZW

After we find the maximum document identifier, we

are ready to begin building the index. For each term

we build a list which contains the DocIds of each

term. The algorithm goes: for each DocId in the list

check if it exists inside the index. There are two

cases:

Case 1: The DocId does not exist inside the

index. In this case the DocId is inserted into the

index and it is encoded. The index contains pairs of

key-values (keys are the DocIds and values are the

encoded values of the DocIds). After the insertion to

the index the compressor will output the DocId, not

the encoded value, to the compressed inverted file,

so when the decompressor starts decoding it will

build the index the exact way the compressor did.

Case 2: The DocId of the list exists inside the

index. In this case we have two sub cases:

o Sub Case 1: The current DocId and the next

DocId of the list are being united and checked

if their union exists inside the index. If their

Compressing Inverted Files using Modiﬁed LZW

159

union does not exist then their union is encoded

and inserted into the index. The compressor

outputs the encoded value of the DocId which

exists inside the index and also the next DocId

is checked if it is inside the index. If it does not

exist then it is inserted into the index. If it

exists then we proceed with the next element in

the list.

o Sub Case 2: The current DocId in union with

the next DocId, inside the list, is already stored

inside the index. In this sub case the algorithm

checks iteratively if the union of DocIds takes

the union of the previous step in union with the

next DocId inside the list, exists inside the

index. It will go on and on till the list finishes

or when the union is not stored inside the

index. In the first case, when we reach the end

of the list, compressor just outputs the encoded

value of the union which is already stored

inside the index. If the union does not exist

then execute Sub Case 1.

So for each term we build a list which contains the

document identifiers and we check if their unions

exist inside the index.

4.3 Decompression with Modified

LZW

Decompression works the same way as the

compression, by building the index. The encoded

values begin from the maximum value of the re-

enumerate method. So modified LZW decompressor

is creating a list for every term, storing the DocIds or

the encoded values of patterns. For each element

inside the list it checks if the element is inside the

index. Again there are two cases:

Case 1: The element does not exist inside the

index and its value is smaller than the bound which

separates DocIds and encoded values. So the

decompressor will process the element as a DocId. It

will encode the element and store it to the index.

After the insertion decompressor will output the

current list element and continue with the next

element inside the list.

Case 2: The element exists inside the index and

its value is bigger than the bound’s value. In this

case the decompressor will know that the element is

the encoded value of a DocId or a union of DocIds.

Decompressor will get the DocId or the union of

DocIds from the index and output it to the file. But

the algorithm does not stop here. Decompressor

knows that the compressor outputted the encoded

value because the union with the next element of the

list did not exist into the index. So the outputted

value is united with the next element inside the list

and the union is encoded and stored into the index.

After that, decompressor continues with the next

element inside the list.

4.4. Index Creation

As we described in the section 4.1 the pattern

matching method we applied is based on building an

index. We scan the list of document identifiers of

each term and for each element we check if it exists

inside the index and then we encode it or search for

DocId unions that are not encoded.

In the below example we will show exactly how

the compression and decompression algorithms

work. Let’s assume we have 5 terms T1, T2, T3, T4,

and T5 which consist of the below DocIds:

T1: < 1, 2, 3, 4, 5, 9, 10 >

T2: < 1, 2, 3, 4, 5, 9, 10, 14, 17 >

T3: < 1, 2, 3, 4, 5, 9, 10, 17 >

T4: < 1, 2, 3, 4, 5, 6, 7, 8, 21, 23 >

T5: < 1, 2, 3, 4, 5, 6, 7, 8, 21, 23, 29 >

The bound is 29, so the encoding numbers will

begin on 30. We run the Modified LZW and we get:

T1: < 1, 2, 3, 4, 5, 9, 10 >

T2: < 30, 31, 32, 33, 34, 35, 36, 14, 17 >

T3: < 37, 32, 33, 34, 35, 36, 42 >

T4: < 43, 33, 34, 6, 7, 8, 21, 23 >

T5: < 46, 34, 48, 49, 50, 51, 52, 29 >

The encoded values of DocIds and unions:

First list

'1': 30, '2': 31, '3': 32, '4': 33, '5': 34, '9': 35, '10': 36

Second list

'1 2': 37, '3 4': 38, '5 9': 39, '10 14': 40, '14': 41, '17':

Third list

'1 2 3': 43, '4 5': 44, '9 10': 45

Fourth list

'1 2 3 4': 46, '5 6': 47, '6': 48, '7': 49, '8': 50, '21': 51,

'23': 52

Fifth list

'1 2 3 4 5': 53, '6 7': 54, '8 21': 55, '23 29': 56, '29': 57

In this case the data do not seem very

compressed because this is a small input, but if the

input was gigabytes of DocIds then we could see a

difference.

Decompression takes as an input the compressed

inverted file and with the same logic (reading the

DocIds and building the index) it restores the

original inverted file.

WEBIST 2016 - 12th International Conference on Web Information Systems and Technologies

160

5 RESULTS

We ran both the schemes to see which one was

better. The machine we used has these specs: AMD

PHENOM II X6 1100T 3.3 GHz, 16 GB ram, 1 TB

hdd, Linux 12.04 64bit.

We used as Inverted File: Wikipedia 21 GB text

file (Callan, 2009). As we noticed on the data set,

Wikipedia has almost 6.5 million pages, but the

numbers of those pages are not enumerated

sequentially. Some pages had numbers bigger than

20 million. So if we used the Wikipedia’s page

labels the modified LZW would start the

compression from the biggest number which would

require more digits to be stored inside the

compressed file. In order to start from the smallest

possible number we used the reorder method.

Table 1: First scheme.

Steps Ratio of

compression

Reorder 22%

Modified-LZW (+ above steps) 38%

IPC (+ above steps) 65%

GZIP (+ above steps) 82%

Table 2: Second scheme.

Steps Ratio of

compression

Reorder 22%

Gaps (+ above steps) 72%

Modified-LZW(+ above steps) 73%

GZIP (+ above steps) 90%

Ratio is based on the original inverted file

(Wikipedia 21 GB) for both Table1 and Table 2.

Modified LZW on the first scheme in table has a

16% ratio which could be improved if we had a

machine with more ram, because in our case we had

to split the reordered file and run modified LZW for

each sub file separately. In total the first scheme had

output a compressed file which is 82% smaller than

the original Inverted File.

On the second scheme we see that the modified

LZW has 1% compressing ratio. Also in this case

(second scheme) we had to slice the file to sub files

to run Modified LZW faster. So it may have

different results if we could build the index for the

whole inverted file and not on separately sub files.

A main drawback is the fact that we cannot

decompress a specific term. We have to go all the

way back decompressing each file, for each step so

we can obtain the initial (after reorder method)

inverted file. Another drawback of our technique is

that the decompression is extremely slow. In

compression we use dictionaries where we hash the

keys so we can search in constant time for the

patterns. In decompression we are using hashes to

values too so we can retrieve the keys which are the

original values. So in our machine, which lacks of

memory for this purpose, ram is used for hashing

keys and values. In order to avoid memory overflow,

we use external memory, hard disks, to store the

key, value pairs. After we reach 90% of ram space

we start appending key-value pairs to disk. For every

encoded value we have to search inside ram and if it

is not there we also have to search inside disk which

is very time consuming.

Now we are going to compare these results with

a new technique (Makris, and Plegas 2013). The

specific construction achieves good compressing

ratio and has been used for the same dataset

(Wikipedia). This new technique initially converts

the lists of DocIds to a set of arithmetic

progressions; in order to represent each arithmetic

progression they use three numbers. In order to do

that, they provide different identifiers to the same

document in order to fill the gaps between the

original identifiers that remain in the initial

representation. They use a secondary index in order

to handle the overhead which is produced because of

the multiple identifiers that have to be assigned to

the documents. They also use an additional

compression step (PForDelta or Interpolative

Coding) to represent it. The tables 3 and 4 show the

experiments which have been done to the

Wikipedia’s dataset.

Table 3: The compression ratio achieved by the (Makris

and Plegas, 2013) algorithms, with the secondary index

uncompressed.

Base

Multiple

Sequences

IPC PFD

Wikipedia 78% 70% 44% 42%

Table 4: The compression ratio achieved by the (Makris

and Plegas, 2013) proposed algorithms, when compressing

the secondary index.

Base

IPC

Base

PFD

Wikipedia 40% 38% 39% 38%

Table 3 shows the compression ratio which was

achieved in relation to the original size for the

proposed techniques (when the secondary index is

uncompressed) and the existing techniques. Table 4

depicts the compression ratio which was achieved by

the compression techniques in relation to the original

Compressing Inverted Files using Modiﬁed LZW

161

size when combining the proposed methods with the

existing techniques for compressing the secondary

index.

As we can see, the algorithm in this paper has a

better compression ratio than the algorithm from the

recent technique (Makris, and Plegas, 2013). The

main difference in these two papers is that in

(Makris and Plegas, 2013), they try to find

numerical sequences and they are compressing them

with the use of PForDelta or Binary Interpolative

Encoding. In this paper we are trying to find patterns

and then compress them with Binary Interpolative

Encoding and GZIP. The reason this paper is better

than the previous is because in this paper the

algorithm is “greedy” and we are using all the state

of the art compression techniques and also we are

using GZIP which is not used in the other method

(Makris and Plegas 2013). Without GZIP we get

65% and 73% compressing ratios on Wikipedia’s

dataset. We also tested the dataset using GZIP as the

only compressing method, in order to see if it has

greater compressing ratio than our 2 schemes. GZIP

compressed the dataset by 72%. It is clear that in

both schemes we achieved greater compressing ratio

than GZIP alone.

6 CONCLUSIONS

In our schemes we employed a set of pre-processing

and compression steps in order to achieve more

compression gains than previous algorithms. Let’s

explain why we used this order. The reorder step

minimizes the range of DocIds so the new inverted

file has smaller DocIds inside. Furthermore this step

is also helping the modified LZW. If we used the

modified LZW on the original inverted file, without

the reorder step, then the initial value of the codes

would be a bigger number than the reordered

inverted file. After, we use the modified LZW to

look for ‘word’ patterns inside the inverted file,

which has better results on our first scheme rather

than the second scheme. More analytically modified

LZW ran better in the first scheme because the Gap

method changed the structure of the whole inverted

file, gaps between the DocIds are not constant, plus

codes that modified LZW produces are longer than

the actual pattern gaps which are compressed. So in

many cases the lists that the Gap method produced

had numbers with fewer digits than the encoded

values of the modified LZW.

The last two steps, Binary Interpolative

Encoding and GZIP are used for a greedy approach.

Binary Interpolative Encoding is a very good integer

compression method and GZIP is the best

compression technique, using Deflate algorithm, so

we used them in order to find how much smaller

inverted file we can get.

Furthermore, a general disadvantage of the

modified LZW is that it demands machines with

large amount of main memory. In our experiments

we had to slice the re-enumerated inverted file into

smaller sub-files because we had memory overflow

problems if we used the whole re-enumerated

inverted file.

7 FUTURE WORK AND OPEN

PROBLEMS

We presented a set of steps that achieve a good

compression when handling inverted files. In a

further analysis we would like to test the schematics

to a larger set of data using stronger machines

because we believe that the modified LZW will have

better compressing ratio if we could store more

patterns inside the index. Furthermore we would like

to modify the algorithm so we can compress and

decompress separately terms and not the whole set

of data. We also want to implement the PForDelta

method as a third step instead of Binary

Interpolative Encoding; this seems worthwhile since

it is expected to improve the performance of our

techniques.

REFERENCES

Akritidis, L., Bozanis, P., 2012, Positional data

organization and compression in web inverted indexes,

DEXA 2012, pp. 422-429.

Anh, Vo Ngoc, and Alistair Moffat. "Inverted index

compression using word-aligned binary codes."

Information Retrieval 8.1 (2005): 151-166.

Arroyuelo D., S. González, M. Oyarzún, V. Sepulveda,

Document Identifier Reassignment and Run-Length-

Compressed Inverted Indexes for Improved Search

Performance, ACM SIGIR 2013.

Baeza-Yates, R., Ribeiro-Neto, B. 2011, Modern

Information Retrieval: the concepts and technology

behind search, second edition, Essex: Addison

Wesley.

Büttcher, S. Clarke, C. L. A., Cormack, G. V.,

2010, Information retrieval: implementing and

evaluating search engines , MIT Press, Cambridge,

Mass.

Callan, J. 2009, The ClueWeb09 Dataset. available at

http://boston.lti.cs.cmu.edu/clueweb09 (accessed 1st

August 2012).

WEBIST 2016 - 12th International Conference on Web Information Systems and Technologies

162

Chierichetti, F., Kumar, R., Raghavan, P., 2009.

Compressed web indexes. In: 18th Int. World Wide

Web Conference, pp. 451–460.

Deutsch, L. Peter. "DEFLATE compressed data format

specification version 1.3." (1996).

He, J., Suel, T., 2011. Faster temporal range queries over

versioned text, In the 34th Annual ACM SIGIR

Conference, China, pp. 565-574.

He, J., Yan, H., Suel, T., 2009. Compact full-text indexing

of versioned document collections, Proceedings of the

18th ACM Conference on Information and knowledge

management, November 02-06, Hong Kong, China.

Heman, S. 2005. Super-scalar database compression

between RAM and CPU-cache. MS Thesis, Centrum

voor Wiskunde en Informatica, Amsterdam.

Huffman, David A., et al. A method for the construction of

minimum redundancy codes. proc. IRE, 1952, 40.9:

1098-1101.

Jean-Loup Gailly and Mark Adler, GZIP Wikipedia

[http://en.wikipedia.org/wiki/GZIP]

Kwong, Sam, and Yu Fan Ho. "A statistical Lempel-Ziv

compression algorithm for personal digital assistant

(PDA)." Consumer Electronics, IEEE Transactions on

47.1 (2001): 154-162.

Makris, Christos, and Yannis Plegas. "Exploiting

Progressions for Improving Inverted Index

Compression." WEBIST. 2013.

Moffat and L. Stuiver. Binary interpolative coding for

effective index compression. Information Retrieval,

3(1):25–47, 2000.

Ntoulas A., Cho J., 2007. Pruning policies for two-tiered

inverted index with correctness guarantee, Proceedings

of the 30th Annual International ACM SIGIR

conference on Research and development in

Information Retrieval, July 23-27, Amsterdam, The

Netherlands.

Oberhumer, M. F. X. J. "LZO real-time data compression

library." User manual for LZO version 0.28, URL:

http://www. infosys. tuwien. ac.

at/Staff/lux/marco/lzo. html (February 1997) (2005).

Welch, Terry (1984). "A Technique for High-Performance

Data Compression". Computer 17 (6): 8–19.

doi:10.1109/MC.1984.1659158.

Witten, Ian H., Alistair Moffat, and Timothy C. Bell.

Managing gigabytes: compressing and indexing

documents and images. Morgan Kaufmann, 1999.

Yan, H., Ding, S., Suel, T., 2009, Compressing term

positions in Web indexes, pp. 147-154, Proceedings

of the 32nd Annual International ACM SIGIR

Conference on Research and Development in

Information Retrieval.

Zhang, J., Long, X., and Suel, T. 2008. Performance of

compressed inverted list caching in search engines. In

the 17th International World Wide Web Conference

WWW.

Ziv, Jacob, and Abraham Lempel. "A universal algorithm

for sequential data compression." IEEE Transactions

on information theory 23.3 (1977): 337-343.

Ziv, Jacob; Lempel, Abraham (September 1978).

"Compression of Individual Sequences via Variable-

Rate Coding". IEEE Transactions on Information

Theory 24 (5): 530–536.

doi:10.1109/TIT.1978.1055934.

Zukowski, M., Heman, S., Nes, N., and Boncz, P. 2006.

Super-scalar RAM-CPU cache compression. In the

22nd International Conference on Data Engineering

(ICDE) 2006.

Compressing Inverted Files using Modiﬁed LZW

163