GMP library (gmp, ). Curve operations simply fol-
low the formulas provided in (hyp, ) corresponding
to inverted projective coordinates in twisted Edwards
curves.
4.2 Implementations for E(F
2
m
)
Our implementations deal with NIST curves B233
and B409 defined over the fields F
2
233
= F[x]/(x
233
+
x
74
+ 1) and F
2
409
= F[x]/(x
409
+ x
87
+ 1), respec-
tively. For a field multiplication, we apply a small
number of recursions of the Karatsuba algorithm
which breaks the m bit polynomial multiplication into
several 64 bit polynomial multiplications. Such 64
bit multiplication are computed with the PCLMUL in-
struction, available on Intel Core i7 processors. Due
to the special form of the irreducible polynomials,
the reduction is done with a small number of shifts
and bitwise XORs on 64 bit words. We compute the
field inversion with the Itoh-Tsujii algorithm, that is
a sequence of field multiplications and multisquar-
ings performed with look-up table. For field squar-
ing, square root and quadratic solver (needed in halv-
ings), we also use a look-up table method, which is
the fastest way according to our tests. For the curve
operations, we use the projective lambda coordinates
with the corresponding formulas provided in (Oliveira
et al., 2014).
4.3 Timing Results for the Proposed
Parallel Approaches
Table 4 reports the timings obtained for the three
parallel approaches discussed in Section 3. We
provide also the timings of the two-thread paral-
lel (double,halve)-and-add approach with w = 4 for
B233 and B409 and the timings of non-parallelized
double-and-add approach with w = 2,3 and 4 for
E(F
p
). For each parallel scalar multiplication we give
the split value s (and s
0
for the three-thread case). Ad-
ditionally we provide timings found in the literature
over the same processor and for similar curves and
fields.
Concerning the curve B233, the proposed par-
allelization does not show any speed-up compared
to the two-thread (double,halve)-and-add approach.
This could be explained by the cost induced by the
thread management. On the other hand, the approach
is clearly effective for the curve B409: it even shows a
timing which is better than all timings found in the lit-
erature for (double,halve)-and-add approach (cf. Sec-
tion 2).
In the case of Curve25519, the proposed optimiza-
tions behave as expected: the two-thread with w = 2
Table 4: Timings (in 10
3
clock-cycles (CC)) of parallel ap-
proaches over E(F
2
m
) ad E(F
p
)
Curve Method
NAF
#CC
10
3
splits nb
size
s s
0
of
w core
proposed B233 three-thread 4 106 110 83 3
our code B233 (db,hv)-&-add 4 104 98 − 2
Taverne et al. B233 (db,hv)-&-add 4 100 - - 2
Negre et al. B233 (db,hv)-&-add 4 117 - - 2
proposed B409 three-thread 4 303 187 143 3
our code B409 (db,hv)-&-add 4 338 175 − 2
Taverne et al. B409 (db,hv)-&-add 4 349 - - 2
Negre et al. B409 (db,hv)-&-add 4 452 - - 2
proposed C25519 two-thread 2 186 185 - 2
proposed C25519 opt-two-thd 2 180 168 - 2
our code C25519 db-&-add 4 239 - - 1
our code C25519 db-&-add 3 219 - - 1
our code C25519 db-&-add 2 221 - - 1
Langley
(?)
C25519 Montg. ladder - 229 - - 1
Bernsetin C25519 Montg. ladder - 194 - - 1
Hamburg Mtg251 Montg. ladder - 153 - - 1
(?) Compiled and run on our platform
is 14% faster than the double-and-add with w = 2, the
optimized-two-thread has a speed-up of 17%. The
speed-up is smaller than the one expected provided
by the value of α. But this might be due to the thread
managements and to the penalty of the costly square-
root computation in the case of the optimized-two-
thread approach. Our approach compares favorably
with the code of Langley and Bernstein, but it does
not compare favorably with the timings of Hamburg.
But the approach of Hamburg involves a smaller field
and also a smaller key length.
5 CONCLUSION
We have presented in this paper parallel approaches
to speed-up the scalar multiplication in E(F
2
m
) and
E(F
p
). The proposed parallelization split the scalar
into two parts or three parts. Then each part of
the scalar multiplication is performed in parallel, the
upper part requiring an additional sequence of dou-
blings or halving. These approaches have been im-
plemented on an Intel Core i7 and the resulting tim-
ings shows that the proposed parallelizations is effec-
tive for curves for NIST curve B409 and for curve the
twisted Edwards curve Curve25519 defined over F
p
with p = 2
255
− 19.
REFERENCES
Explicit formula database. http://www.hyperelliptic.org/
EFD/.
SECRYPT2015-InternationalConferenceonSecurityandCryptography
208