union does not exist then their union is encoded
and inserted into the index. The compressor
outputs the encoded value of the DocId which
exists inside the index and also the next DocId
is checked if it is inside the index. If it does not
exist then it is inserted into the index. If it
exists then we proceed with the next element in
the list.
o Sub Case 2: The current DocId in union with
the next DocId, inside the list, is already stored
inside the index. In this sub case the algorithm
checks iteratively if the union of DocIds takes
the union of the previous step in union with the
next DocId inside the list, exists inside the
index. It will go on and on till the list finishes
or when the union is not stored inside the
index. In the first case, when we reach the end
of the list, compressor just outputs the encoded
value of the union which is already stored
inside the index. If the union does not exist
then execute Sub Case 1.
So for each term we build a list which contains the
document identifiers and we check if their unions
exist inside the index.
4.3 Decompression with Modified
LZW
Decompression works the same way as the
compression, by building the index. The encoded
values begin from the maximum value of the re-
enumerate method. So modified LZW decompressor
is creating a list for every term, storing the DocIds or
the encoded values of patterns. For each element
inside the list it checks if the element is inside the
index. Again there are two cases:
Case 1: The element does not exist inside the
index and its value is smaller than the bound which
separates DocIds and encoded values. So the
decompressor will process the element as a DocId. It
will encode the element and store it to the index.
After the insertion decompressor will output the
current list element and continue with the next
element inside the list.
Case 2: The element exists inside the index and
its value is bigger than the bound’s value. In this
case the decompressor will know that the element is
the encoded value of a DocId or a union of DocIds.
Decompressor will get the DocId or the union of
DocIds from the index and output it to the file. But
the algorithm does not stop here. Decompressor
knows that the compressor outputted the encoded
value because the union with the next element of the
list did not exist into the index. So the outputted
value is united with the next element inside the list
and the union is encoded and stored into the index.
After that, decompressor continues with the next
element inside the list.
4.4. Index Creation
As we described in the section 4.1 the pattern
matching method we applied is based on building an
index. We scan the list of document identifiers of
each term and for each element we check if it exists
inside the index and then we encode it or search for
DocId unions that are not encoded.
In the below example we will show exactly how
the compression and decompression algorithms
work. Let’s assume we have 5 terms T1, T2, T3, T4,
and T5 which consist of the below DocIds:
T1: < 1, 2, 3, 4, 5, 9, 10 >
T2: < 1, 2, 3, 4, 5, 9, 10, 14, 17 >
T3: < 1, 2, 3, 4, 5, 9, 10, 17 >
T4: < 1, 2, 3, 4, 5, 6, 7, 8, 21, 23 >
T5: < 1, 2, 3, 4, 5, 6, 7, 8, 21, 23, 29 >
The bound is 29, so the encoding numbers will
begin on 30. We run the Modified LZW and we get:
T1: < 1, 2, 3, 4, 5, 9, 10 >
T2: < 30, 31, 32, 33, 34, 35, 36, 14, 17 >
T3: < 37, 32, 33, 34, 35, 36, 42 >
T4: < 43, 33, 34, 6, 7, 8, 21, 23 >
T5: < 46, 34, 48, 49, 50, 51, 52, 29 >
The encoded values of DocIds and unions:
First list
'1': 30, '2': 31, '3': 32, '4': 33, '5': 34, '9': 35, '10': 36
Second list
'1 2': 37, '3 4': 38, '5 9': 39, '10 14': 40, '14': 41, '17':
42
Third list
'1 2 3': 43, '4 5': 44, '9 10': 45
Fourth list
'1 2 3 4': 46, '5 6': 47, '6': 48, '7': 49, '8': 50, '21': 51,
'23': 52
Fifth list
'1 2 3 4 5': 53, '6 7': 54, '8 21': 55, '23 29': 56, '29': 57
In this case the data do not seem very
compressed because this is a small input, but if the
input was gigabytes of DocIds then we could see a
difference.
Decompression takes as an input the compressed
inverted file and with the same logic (reading the
DocIds and building the index) it restores the
original inverted file.