value of BUC
MAX
is determined according to the size
of the cache memory so that during the bucket-pop
operation, cache-miss rates are minimized. During
each insertion, we keep a check whether any bucket
exceeds its capacity. If it so happens, we pop all
the elements from that bucket, and update the siev-
ing array at the stored locations. This strategy also
eliminates the need of malloc and free operations
of bucket entries after individual insert and pop op-
erations. These memory operations are atomic, so
avoiding them inside the loop boosts parallelism in
multi-threaded implementations.
Algorithm 2 summarizes these implementation
ideas. The workings of the pseudo-functions used in
this algorithm are explained in Table 2.
4 EXPERIMENTAL RESULTS
4.1 Hardware and Software Setup
We use Intel’s Xeon Gold Series (Model No. 6130)
processor clocked at 2.10 GHz with an L3 Cache of
size 22 MB. The gcc compiler (version 9.2.0), GMP
library (version 6.1.2)and OpenMP API (version 4.5)
is used. For calculating the prime ideals, we use
Victor Shoup’s NTL library (version 11.3.2) (Shoup
et al., 2020). The optimization flag -O3 and the in-
trinsic flag -mavx=native are used. In the multi-core
implementations, we use all of the 16 cores of a sin-
gle processor. The operating system is CentOS Linux
release 7.4.1708 (Core).
4.2 Data Setup
As a test bench, we here consider the two numbers
RSA-512 and RSA-768 which are factored as re-
ported in (Cavallar et al., 2000) and (Kleinjung et al.,
2010). In each of the cases, we consider the same
polynomials that are used in the actual factorization
attempts. Suitable partitioning of the factor base be-
tween block- and bucket-sieving primes has a ma-
jor impact on the overall running time. We vary the
small-versus-large demarcation boundary FB
S
based
on the sieving range MAX
A
across various test cases.
For our multi-threaded implementation, we use
the OpenMP directive #pragma omp parallel for
to launch 16 threads expected to map to the individ-
ual cores. We allocate different segments of b values
to the different threads in order to avoid concurrent
writes. The read-only p and log p arrays are shared
by all the threads, so that they can stay loaded in the
cache. We have chosen the same limiting values (up-
per) for both the factor bases: B
r
= B
a
= MAX
FB
.
4.3 Timing Results
Table 3 reports the timings T
±v
±b
of our implemen-
tations of sieving. The subscript indicates whether
cache-friendly (block/bucket) sieving is used (+b) or
not (−b), whereas the superscript indicates whether
vectorization is used (+v) or not (−v). For exam-
ple, T
−v
+b
indicates the timing of our non-vectorized
implementation with block and bucket sieving. All
the times are in seconds, and stand for the com-
bined times of rational sieving and algebraic siev-
ing. Each sieving includes the time taken by the pre-
computation of initial indices, index increments and
log subtractions, and locating potential sieving loca-
tions. The time for final trial divisions (relation gen-
eration) is excluded here. The number of threads uti-
lized is denoted as N
θ
. Each of the reported times is
the average over 100 test cases.
Based on these four sets of timings, we calculate
four relevant sets of speedup figures. The speedup
of [T+] over [T−] is calculated as
[T−] − [T+]
[T−]
×
100%, where both the signs ± appear either in the
subscript or in the superscript with the other kept
unchanged. For example, ψ
+b
=
T
−v
+b
− T
+v
+b
T
−v
+b
!
×
100% indicates the speedup obtained by vec-
torization on cache-friendly sieving, and ψ
−v
=
T
−v
−b
− T
−v
+b
T
−v
−b
!
× 100% indicates the speedup ob-
tained by cache-friendly sieving without vectoriza-
tion.
The experimental data establishes two facts. First,
AVX-512-based vectorization achieves a speedup of
up to 56% in non-cache-friendly sieving and up to
25% in cache-friendly block and bucket sieving over
non-vectorized implementations. Second, the effec-
tiveness of cache-friendly sieving is manifested by a
speedup of up to 63% both with and without vector-
ization. In particular, the best running times are ob-
tained with both cache-friendly sieving and vectoriza-
tion (the column headed T
+v
+b
).
5 CONCLUSION
In this paper, we report the practical effectiveness
of block and bucket sieving and AVX-512-based
vectorization. This study establishes the usefulness of
exploiting latest hardware features for implementing
time-consuming algorithms like the GNFSM for fac-
toring integers. There are several ways in which our
study can be extended. Both cache-friendly sieving
SECRYPT 2021 - 18th International Conference on Security and Cryptography
656