scription to refer to the 2nd and 3rd bit of the first byte
of the compressed data, which is defined in the deflate
compression specifications.
IV Check 1. If possible (when B is 2), after decrypt-
ing 11 bytes of the IV, we can check if it matches
one of the second bytes of the provided check-
sums (CS and TC). If not, we abort.
IV Check 2. After decrypting 12 bytes of the IV, we
can check if it matches one of the first bytes of the
checksums (CS and TC). If not, we abort.
CODE0. If the encoding method is 0 (raw/stored
block), the only bit which is allowed to be set, is
the first one. We abort if any other bit is set.
CODE1. If the encoding method is 1 (static Huff-
man), we can check if there is a proper encoding
present in the next 36 bytes. If not, we abort.
CODE2. If the encoding method is 2 (dynamic Huff-
man), we check if the next 10 bytes contain valid
encoded data, otherwise we abort.
CODE3. If the encoding method is 3, we abort as
such value is reserved and should not be used.
Inflate. If the inflation algorithm reports any problem
in reading the data, we abort.
CRC32. We calculate the CRC32 of the inflated full
file data and check if it matches the checksum pro-
vided. If yes, we have found the password.
Empirical results of our implementation have
shown the proportion of candidates for which we can
abort at an earlier stage because of invalid checks. We
have to distinguish between the two cases where we
either have one or two checksum bytes. Table 1 shows
the rejected percentages when having two checksum
bytes, Table 2 shows the rejected percentages when
only having one checksum byte. When the hash pro-
vides two checksum bytes, it allows to reject more
candidates at an earlier stage of the process, which
avoids all the more costly checks.
4.3 OpenCL Implementations
The first challenge to have the PKZIP cracking pro-
cess in OpenCL was the ability to inflate data. In
CPU implementations the libzip library bindings
5
can
be used to achieve this, but in OpenCL this is not
possible. The libzip implementation is quite com-
plex and heavily dependent on all the components in
5
The libzip library is widely used to han-
dle/modify/create zip archives and is provided as a
C implementation which can be used by many other
languages and applications.
Table 1: Number of rejected candidates per check for 2-byte
hashes (Total 543’257’459 candidates).
Check Candidates Rejected
IV Check 1 543257459 541136146 (99.6%)
IV Check 2 2121313 2113156 (99.6%)
CODEx 8157 8069 (89.9%)
CODE0 2024 1948 (96.2%)
CODE1 2054 2050 (99.8%)
CODE2 2026 2018 (99.6%)
CODE3 2053 2053 (100%)
Inflate 88 83 (94.9%)
CRC32 5 4 (80.0%)
Table 2: Number of rejected candidates per check for 1-byte
hashes (Total 543’257’459 candidates).
Check Candidates Rejected
IV Check 1 543257459 – (–%)
IV Check 2 543257459 541136764 (99.6%)
CODEx 2120695 2099858 (99.0%)
CODE0 530845 514353 (96.8%)
CODE1 530247 527827 (99.5%)
CODE2 529821 527896 (99.6%)
CODE3 529782 529782 (100%)
Inflate 20837 18984 (91.1%)
CRC32 1853 1852 (94.8%)
the library. Miniz (Miniz, 2018) is an alternative im-
plementation that is very compact and licensed with
MIT. Only the functions which were required to do an
inflation were extracted and included in our OpenCL
kernel. We had to do some minor modifications to the
code to get it working in OpenCL to overcome some
specific casting of pointers which crashed the kernel.
Due to the variety of inputs (from the hash format)
output by zip2john, the implementation looks differ-
ent in OpenCL, as some checks/parts are only needed
for specific cases. To always benefit from the most
optimal implementation for the type and get the best
possible speed, we decided to split all the hash vari-
ants into three families.
Compressed. If there is a single compressed file in
the archive the normal attack described in Figure 3
and Section 4.2 is performed.
Uncompressed. If there is a single uncompressed file
in the archive, we cannot benefit from the CODEx
checks but we also do not have to include the inflate
code in the kernel. Therefore, the kernel is running
slower (depending on the file size) but is also cleaner
and does not require the CODEx lookup tables.
Multifile. The process can use the 1 (or 2) byte check-
sums of the files without knowing more than the first
few bytes of each file. John The Ripper already pro-
posed a similar approach for exactly 3 files with 2 byte
checksums available. The size of the output space,
namely (2
8
× 2
8
)
3
= 2
48
, is relatively small. There-
fore, this approach would quickly lead to collisions
with our GPU implementation. For example, a col-
ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy
334