Parallel Approaches for Efficient Scalar Multiplication over Elliptic
Curve
Christophe Negre and Jean-Marc Robert
Equipe DALI, Universit
´
e de Perpignan, Perpignan, France
LIRMM, Univ. de Montpellier, CNRS, Montpellier, France
Keywords:
Elliptic Curve, Parallel Implementation, Scalar Multiplication, Prime Field, Binary Field.
Abstract:
This paper deals with parallel implementation of scalar multiplication over an elliptic curve. We present
parallel approaches which split the scalar into two parts for E(F
p
) or three parts for E(F
2
m
) and perform in
parallel the scalar multiplication with each part of the scalar. We present timing results of these approaches
implemented over an Intel Core i7 for NIST binary curves B233, B409 and for the twisted Edwards curve
Curve25519 (Bernstein, 2006). For the curves B409 and Curve25519 the proposed approaches improve by at
least 10% the computation time of the scalar multiplication.
1 INTRODUCTION
Elliptic curve cryptographic protocols generally in-
volve one or two scalar multiplications on the curve.
A scalar multiplication consists in computing kP =
P + ··· + P where k is an integer of several hundreds
of bits and P is a point on the curve. This is generally
performed through a sequence of hundreds of dou-
blings and additions on the curve using the so-called
double-and-add algorithm. It is thus a time consum-
ing part of the cryptographic protocols.
Actual and future processors on personal com-
puters and embedded devices will include an in-
creasing number of cores, enabling more parallelism.
The implementation of scalar multiplication might be
adapted to these new platforms. A recent interesting
work of Taverne et al. (Taverne et al., 2011) on bi-
nary elliptic curves showed that a curve scalar multi-
plication can be parallelized through a (double,halve)-
and-add approach resulting in an interesting speed-
up. GLV (Gallant et al., 2001) is another popular
approach to perform a scalar multiplication in par-
allel fashion (Longa and Sica, 2014; Oliveira et al.,
2014), but unfortunately GLV cannot be used on stan-
dard curves (P. Gallagher and Furlani, 2009).
We explore in this paper an alternative parallel ap-
proach. We split the scalar k in an upper and lower
part, and we compute scalar multiplication of each
part in parallel. This requires an additional sequence
of doublings or halvings to finish the upper part com-
putations before adding the two computed points. We
have experimented this approach and the resulting
timings show some interesting speed-up for the NIST
curve B409 and also for the twisted Edward curve
Curve25519 (Bernstein, 2006).
The remaining of the paper is organized as fol-
lows: in Section 2 we review basic results on elliptic
curves over binary and prime fields. In Section 3 we
present our approaches for the parallelization of scalar
multiplication. In Section 4 we provide experimental
results and we end the paper with some concluding
remarks in Section 5.
2 REVIEW OF ELLIPTIC CURVE
SCALAR MULTIPLICATION
An elliptic curve over a finite field F
q
is the set of
points (x, y) F
2
q
satisfying a smooth equation of de-
gree 3 plus a point at infinity O . A commutative group
law can be defined on this set of points using the chord
and tangent method. This provides doubling and ad-
dition formulas consisting in a number of field opera-
tions on the coordinates of the points. The complexity
of these formulas can be reduced by choosing appro-
priate curve equation and system of coordinates. In
the remaining of this section, we review curve equa-
tion and point operation formulas in the case of ellip-
tic curve defined over binary field and prime field. We
also review classical algorithms for scalar multiplica-
tion.
202
Negre C. and Robert J..
Parallel Approaches for Efficient Scalar Multiplication over Elliptic Curve.
DOI: 10.5220/0005512502020209
In Proceedings of the 12th International Conference on Security and Cryptography (SECRYPT-2015), pages 202-209
ISBN: 978-989-758-117-5
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
Table 1: Complexity of curve operations in E(F
p
).
Curve form Coordinates Doubling mixed Addition Addition
Weierstrass (a = 3) Jacobian 4M + 4S 8M + 3S 12M + 4S
Twisted Edwards Inverted 3M + 4S + M
a
+ M
2d
8M + 1S + M
a
+ M
d
9M + 1S + M
a
+ M
d
Jacobi quartic XXY ZZ coord. 3M + 4S 6M + 3S + M
a1
7M + 4S + M
a1
Montgomery Projective 2M + M
(a+2)/4
+ 2S - 4M + 2S
2.1 Point Operations in E(F
p
)
An elliptic curve E(F
p
) over a prime field F
p
is gen-
erally defined by a short Weierstrass equation:
E : y
2
= x
3
+ ax + b with (a,b) F
2
p
.
In this case addition and doubling formulas given by
the chord and tangent method are as follows: let P
1
=
(x
1
,y
1
),P
2
= (x
2
,y
2
) and P
3
= (x
3
,y
3
), be three points
on E(F
p
) such that P
3
= P
1
+ P
2
, then we have:
x
3
= λ
2
x
1
x
2
,
y
3
= λ(x
1
x
3
) y
1
,
where
(
λ =
y
2
y
1
x
2
x
1
if P
1
6= P
2
,
λ =
3x
2
1
+a
2y
1
+ x
1
if P
1
= P
2
.
(1)
In the sequel we will use the two following alter-
native elliptic curve equations:
Twisted Edwards (Hisil et al., 2008). The twisted
Edwards elliptic curve equation is as follows:
ax
2
+ y
2
= 1 +dx
2
y
2
where a and d are two elements in F
p
. In this case
there is an unified addition and doubling formula:
if P
1
= (x
1
,y
2
) and P
2
= (x
2
,y
2
) are in E(F
p
), then
the affine coordinates (x
3
,y
3
) of P
3
= P
1
+ P
2
are
as follows:
x
3
=(x
1
y
2
+ y
1
x
2
)/(1 +dx
1
x
2
y
1
y
2
),
y
3
=(y
1
y
2
ax
1
x
2
)/(1 dx
1
x
2
y
1
y
2
).
(2)
Montgomery curves (Montgomery, 1987). We
consider Montgomery elliptic curves which have
the following form:
E : by
2
= x
3
+ ax
2
+ x with (a,b) F
2
p
.
Montgomery noticed in (Montgomery, 1987) that
for Q = (x
Q
,y
Q
) the x-coordinate of 2Q depends
only on x
Q
:
x
2Q
=
(x
2
Q
1)
2
4x
Q
(x
2
Q
+ ax
Q
+ 1)
. (3)
Moreover, Montgomery noticed that given two
points Q = (x
Q
,y
Q
) and P = (x
P
,y
P
) and if we
know the difference R = Q P, then we can com-
pute Q +P as follows:
x
Q+P
=
(x
Q
x
P
1)
2
x
R
(x
P
x
Q
)
2
. (4)
Montgomery curves and twisted Edwards curves are
isomorphic. A conversion of the point coordinates be-
tween these two curves requires only a few field op-
erations.
The formulas in (1), (2) and (3) involve a field in-
version which is generally a costly operation. Alter-
native coordinate systems have been proposed in the
case of Weierstrass, Montgomery, twisted Edwards
and Jacobi quartic in order to avoid field inversions
and reduce the number of field multiplications:
Extended coordinate system where a quadruplet
(X,Y,Z,T ) represents the points x = X /Z,y =
Y /Z and T /Z = xy.
Regular projective coordinates where a triplet
(X,Y,Z) represents a point x = X/Z,y = Y /Z.
Inverted coordinates : a triplet (X ,Y,Z) corre-
sponds to a point x = Z/X and y = Z/Y .
The cost of a doubling, an addition and a mixed
addition in these coordinate systems are given in Ta-
ble 1.
2.2 Operations in Binary Elliptic
Curves
An elliptic curve over a binary field is generally de-
fined by the following short Weierstrass equation:
E : y
2
+ xy = x
3
+ ax
2
+ b with a,b F
2
m
. (5)
The formulas for addition, doubling and halving on
E(F
2
m
) are as follows:
Addition and doubling. We consider two points
P
1
= (x
1
,y
1
) and P
2
= (x
2
,y
2
) on the curve
E(F
2
m
). The coordinates (x
3
,y
3
) of P
3
= P
1
+ P
2
are given by the following formula:
x
3
= λ
2
+ λ + x
1
+ x
2
+ a,
y
3
= (x
1
+ x
3
)λ +x
3
+ y
1
,
where λ =
(
y
1
+y
2
x
1
+x
2
if P
1
6= P
2
,
y
1
x
1
+ x
1
if P
1
= P
2
.
(6)
Point halving. In (Knudsen, 1999), Knudsen no-
ticed that if a point Q = (u, v) on E(F
2
m
) is of
odd order, then the point P =
1
2
Q in the subgroup
ParallelApproachesforEfficientScalarMultiplicationoverEllipticCurve
203
Table 2: Complexity of curve operation in E(F
2
m
) when a = 1.
Coordinates Doubling Halving Mixed-addition Doubling-mixed-addition Addition
Affine 2M + I 1M + 1SqRt + 1QS - 4M + 2I 2M + I
Lopez-Dahab 4M + 5S - 9M + 4S 13M + 4S 13M + 9S
Kim-Kim 4M + 5S - 8M + 4S 12M + 9S 12M + 4S
Lambda proj. 4M + 4S - 8M + 2S 10M + 6S 11M + 4S
generated by Q is well defined. Knudsen also no-
ticed that the computation of the coordinates of
P in terms of the coordinates of Q can be com-
puted efficiently. Indeed, since Q = 2P and we
can use (6) to derive the coordinates of P = (x, y)
in terms of Q = (u,v) as follows:
λ = x + y/x (7)
u = λ
2
+ λ + a (8)
v = x
2
+ u(λ + 1) (9)
Knudsen proposed to first solve the quadratic
equation (8) in λ, then to compute x =
p
v +u(λ + 1) from (9) and to finally com-
pute y = λx + x from (7). A point halving on
E(F
2
m
) has thus a cost of 2M plus one square
root (SqRt) and one quadratic solver (QS). It be-
comes more efficient if P and Q are represented
in lambda coordinates P = (x
P
,λ
P
=
y
P
x
P
+x
P
) and
Q = (y
Q
,λ
Q
=
y
Q
x
Q
+x
Q
) since, in this case, a point
halving requires only 1M + 1SqRt + 1QS.
The following projective coordinate systems are
proposed in the literature in order to efficiently im-
plement point operations in E(F
2
m
):
Lopez-Dahab coordinate system (L
´
opez and Da-
hab, 1998). A point is given by a triplet (X ,Y,Z)
corresponding to the affine point x = X/Z,y =
X/Z
2
.
Kim-Kim coordinate system (Kim and Kim, ). A
point is given by a quadruplet (X,Y,Z, T ), where
T = Z
2
, which corresponds to the affine point x =
X/Z,y = X/T .
Lambda projective coordinate system (Oliveira
et al., 2014) . A point is represented by a triplet
(X,L,Z) such that x = X/Z, λ = L/Z = y/x + x
and y = (L/Z + X /Z)X/Z.
For explicit point addition and doubling formulas in
these projective coordinate systems the reader may
refer to (Hankerson et al., 2004; Kim and Kim, ;
Oliveira et al., 2014). In Table 2 we report the cor-
responding cost of each point operation on the curve
when a = 1. We do not report the complexity of a
point halving in projective coordinate systems, since
in this case the quadratic solver requires a field inver-
sion and makes the halving inefficient. We can no-
tice that Lambda coordinates provide the most effi-
cient curve operations.
2.3 Scalar Multiplication Algorithms
Double-and-add scalar multiplication. A scalar mul-
tiplication kP is generally performed with a sequence
of doublings and additions. The scalar k is first re-
coded with the NAF
w
recoding as k =
`
i=0
k
i
2
i
where
k
i
is 0 or an odd integer in [2
w1
,2
w1
] and w is
the window size. The 2
w2
points k
i
P for k
i
odd
in [0,2
w1
] are computed at the beginning of the al-
gorithm and then R = kP is computed through a se-
quence of doublings and additions R 2R + k
i
P for
i = `,` 1,...,0. This method is reported in Algo-
rithm 1. The complexity of this approach is, on aver-
age, ` + 1 doublings and `/(w + 1) + 2
w2
1 addi-
tions (see (Hankerson et al., 2004) for a detailed anal-
ysis).
Algorithm 1: Double-and-add.
Require: P E(F
2
m
) and a scalar k [0,N 1] where
N is the order of P.
Ensure: Q = k · P
1: Compute NAF
w
(k) =
`
i=0
k
i
2
i
2: Compute T [i] = i · P for all odd positive integers
i [0, 2
w1
]
3: Q O
4: for i from ` downto 0 do
5: Q 2 · Q + sign(k
i
)T [ |k
i
|]
6: end for
7: return (Q)
Halve-and-add scalar multiplication. In the case of
binary elliptic curve, the halving operation makes it
possible to use a halve-and-add approach instead of a
double-and-add approach. We first recode the scalar
as k =
`
i=0
k
0
i
2
i
where k
0
i
is odd in [2
w1
,2
w1
].
Then the scalar multiplication kP is computed by first
setting R P and Q
j
O, j {1, 3, . . . , 2
w1
1}.
Then we perform sequence of halvings and additions
showns in steps 4-9 in Algorithm 2. and at the end
the result is Q =
j∈{1,3,...,2
w1
1}
jQ
j
. This method
is depicted in Algorithm 2.
The post-computation in the return statement
(Step 10) is computed with the technique credited to
Knuth: Q
i
Q
i
+ Q
i+2
for i from 2
w1
3 to 1, then
SECRYPT2015-InternationalConferenceonSecurityandCryptography
204
Q is given by Q Q
1
+ 2
j∈{3,...,2
w1
1}
Q
j
. The
total complexity of the halve-and-add scalar multipli-
cation is on average 2
w1
+ `/(w + 1) point additions
and `1 halvings and one doubling. Using the cost of
Table 2, we obtain the following complexity in terms
of field operations:
#Op. = `(M + SqRt +QS)
+(8M + 4S)(`/(w + 1)+ 2
w1
)
+4M + 4S
(10)
Algorithm 2: Halve-and-add.
Require: P E(F
2
m
) of odd order N and a scalar k
[0,N 1] and ` = dlog
2
(N)e+ 1 and w a window
size.
Ensure: Q = k · P.
1: Recode k =
`
i=0
k
0
i
2
i
with k
0
i
is an odd integer in
[2
w1
,2
w1
]
2: for j J = {1, 3, . . . ,2
w1
1} do Q
j
O
3: R P
4: for i from 0 to ` do
5: if k
0
i
6= 0 then
6: Q
|k
0
i
|
Q
|k
0
i
|
+ sign(k
0
i
)R
7: end if
8: R R/2
9: end for
10: return Q
jJ
jQ
j
Parallel (double,halve)-and-add scalar multiplica-
tion. In the case of binary elliptic curve, the halve-
and-add and double-and-add methods can be used in
parallel in order to speed-up the computation of the
scalar multiplication. The scalar k is recoded as
k =
s
i=0
k
0
i
2
i
!
| {z }
k
0
+
`s
i=1
k
00
i
2
i
!
| {z }
k
00
.
Then the computation can be split into two threads:
one double-and-add thread computing Q
0
= k
0
P and
one halve-and-add thread computing Q
00
= k
00
P, the
result is obtained with a final addition Q = Q
0
+ Q
00
.
For further details on this method the reader may refer
to (Taverne et al., 2011).
3 PROPOSED PARALLEL
SCALAR MULTIPLICATION
We consider the case of curves where the paralleliza-
tion based on GLV (Gallant et al., 2001) is not ap-
plicable. This is the case of the NIST curves B233,
B409 and this is also the case of twisted Edwards
curve Curve25519 (Bernstein, 2006).
3.1 Two-thread Parallelization over
E(F
p
)
The proposed approach to parallelize part of the com-
putations involved in double-and-add scalar multipli-
cation is the following: we split the scalar into two
parts k = k
1
+k
2
2
s
with k
1
< 2
s
and then we divide the
computations of kP = k
1
P +2
s
k
2
P into two threads:
Thread1 computes Q
1
= k
1
P using a double-and-
add approach of length s.
Thread2 computes 2
s
k
2
P by first performing k
2
P
using the double-and-add algorithm and then a se-
quence of doublings 2(k
2
P),4(k
2
P),...,2
s
(k
2
P)
and finally it outputs Q
2
= 2
s
(k
2
P).
At the end, the two points Q
1
and Q
2
are added and
then Q = Q
2
+ Q
1
is output. This method is depicted
in Fig. 1.
thread
final
addition
Thread2
joining
s doublings
k
2
P computed
s
w+1
additions
and
Q
2
= 2
s
k
2
P
with s doublings
s doublings
Q
1
+ Q
2
Thread1
Q
1
= k
1
P with double-and-add
with double-and-add
s
w+1
additions
and s 1 doublings
Figure 1: Two-thread parallelization of double-and-add ap-
proach.
Complexity. The proposed parallelization is opti-
mal if the computation load of the two threads are the
same. This means that
s
w +1
A +sD
| {z }
Thread1
=
` s
w +1
A +`D
| {z }
Thread2
where A and D represent an addition and a doubling,
respectively. This implies that, for a balanced com-
putation load, the split s of the scalar k have to be as
follows:
s
=
A +(w + 1)D
2A +(w + 1)D
`
In other words, the proposed parallelization reduces
the computation time by a ratio of α =
A+(w+1)D
2A+(w+1)D
.
We use the complexity of the curve operations given
in Table 1 to derive explicit values of the ratio α =
A+(w+1)D
2A+(w+1)D
for the three cases w = 2, 3 and 4. For the
sake of simplicity, we assume that M
a
and M
a1
,M
d
ParallelApproachesforEfficientScalarMultiplicationoverEllipticCurve
205
are negligible compared to M and S = 0.8M. We re-
port the resulting values of α in Table 3.
Table 3: Estimated values for the ratio α
Value of α
w = 2 w = 3 w = 4
Weierstrass 0.74 0.77 0.80
Twisted Edwards 0.75 0.79 0.81
Jacobi quartic 0.76 0.79 0.82
Table 3 shows that the timing of the execution is
expected to be reduced by a factor of 25% for w = 2
and for the two other cases the improvement might be
around 20%.
3.2 Optimized Two-thread
Parallelization over E(F
p
)
We present in this subsection an optimized version of
the approach of Subsection 3.1 when the curve is a
twisted Edwards curves. A twisted Edwards curve
can alternatively be set in a Montgomery form, and
then we can use the efficiency of the doubling on a
Montgomery curve: from Table 1 a doubling on a
Montgomery curve requires only 2M + 2S. Specifi-
cally, we propose to perform the sequence of s dou-
blings of Thread2, in the parallelization of Subsec-
tion 3.1, with the Montgomery doubling. This results
in the following optimized parallelization:
Thread1 computes Q
1
= k
1
P using a double-and-
add approach of length s on the Edward curve.
Thread2 first computes the coordinates of P in the
Montgomery curve. Then it computes P
s
= 2
s
P
on the Montgomery curve with a sequence of s
doublings. Then it computes the coordinates of P
s
in the Edwards curves. Finally it computes k
2
P
s
=
2
s
k
2
P using the double-and-add algorithm on the
twisted Edwards curve.
This modified two-thread parallelization induces
a problem: we do not know the coordinate y
s
of
P
s
= 2
s
P at the end of the sequence of doublings
in the Montgomery curve. Indeed the Montgomery
doubling formula computes only the x coordinate of
a point (or, equivalently, X and Z projective coordi-
nates). We can compute ±y
s
by solving a quadratic
equation in y given be the curve equation. But we
do not know the sign of the solution of the quadratic
equation, i.e., we are left with the two possible values
±y
s
. We propose to deal with this problem as follows:
We pick an arbitrary sign for y
s
and we let
Thread2 compute k
2
· (±P
s
).
We use a right-to-left double-and-add scalar mul-
tiplication in Thread1. At some point, Thread1
computes P
s
= 2
s
P and then at this point we know
the correct sign for y
s
.
When the two threads are joined, knowing the correct
sign for y
s
enables us to pick the correct value of Q
s
=
2
s
k
2
P and then we can compute k
1
P +2
s
k
2
P.
This proposed optimized two-thread paralleliza-
tion is depicted in Fig. 2.
Complexity. We denote D
M
the cost of a Mont-
gomery doubling and A
E
and D
E
the respective costs
of the addition and doubling on a twisted Edwards
curve and A
0
E
the cost of a non mixed addition (in-
volved in the right-to-left double-and-add). Then the
work load of the two threads are equal if the following
identity holds
s
w +1
A
0
E
+ sD
E
| {z }
Thread1
=
` s
w +1
A
E
+ (` s)D
E
+ sD
M
| {z }
Thread2
.
This implies that
s
=
A
E
+ (w + 1)D
E
A
0
E
+ A
E
+ (w + 1)(2D
E
D
M
)
| {z }
α
`.
With the complexity of the curve operations given in
Table 1 we derive explicit values for the ratio α: for
w = 2 we obtain α = 0.62, for w = 3 we have α = 0.63
and for w = 4 we have α = 0.64.
In other words, with the proposed optimization,
the theoretical saving is around 35% of the computa-
tion time compared to the non-parallelized version.
3.3 Three-thread Parallelization over
E(F
2
m
)
Our goal in this subsection is to increase the level of
parallelism of the two-thread (double,halve)-and-add
scalar multiplication in E(F
2
m
) (reviewed in Subsec-
tion 2.3). To reach this goal, we recode and split the
scalar k as follows:
k =
s
0
i=0
k
0
i
2
i
| {z }
k
0
+
s
i=1
k
00
i
2
i
| {z }
k
00
+2
s
`s
0
s
i=1
k
000
i
2
i
!
| {z }
k
000
where k
0
i
,k
00
i
and k
000
i
are in 1,±3,±5,. . . , ±2
w2
3}.
We split the computation of the scalar multiplica-
tion into three threads:
Thread1 computing k
0
P using a double-and-add
method of length s
0
.
SECRYPT2015-InternationalConferenceonSecurityandCryptography
206
Twisted Edwards curve
Twisted Edwards curveMontgomery curve
and curve
conversion
correct
Pick
the
sign
to Montgomery
s
w+1
additions
s doublings
and s doublings
recovery of ±y
s
2
s
P with s
Montogmery doublings
s
w+1
additions and s 1 doublings
y
s
coordinates of 2
s
P
Q
1
= k
1
P
Q
2
= 2
s
k
2
P
Thread1
Q
1
= k
1
P computed with right-to-left double-and-add
±2
s
k
2
P
±2
s
k
2
P computed with
left-to-right double-and-add
Q = Q
1
+ Q
2
Thread2 converts P
Figure 2: Optimized two-thread parallelization of double-and-add approach.
Thread2 computing k
00
P using a halve-and-add
method of length s.
Thread3 computing 2
s
k
000
P by performing a
halve-and-add scalar multiplication followed by a
sequence of s halvings to obtain 2
s
k
000
P.
This approach is depicted in Fig. 3.
joining
joining
Thread1
Thread3
Q
= k
P with double-and-add
Q
′′′
= 2
s
k
′′′
P
with s halvings
s s
halvings s halvings
Q
′′
= k
′′
P with halve-and-add
s
w+1
additions and s
doublings
Q
+ Q
′′
+ Q
′′′
and
ss
w+1
additions
k
′′′
P with
halve-and-add
s
w+1
additions and s halvings
Thread2
Figure 3: Three-thread version of the (double,halve)-and-
add approach.
Complexity. Let us evaluate the reducing ratio of
the computations in this case. The work load of the
three threads are well-balanced if the following equa-
tions hold
s
w+1
A +sH
=
s
0
w+1
A +s
0
D,
s
w+1
A +sH
=
`s
0
s
w+1
A +(` s
0
)H,
where A represents an addition, D a doubling and H
a halving on the curve E(F
2
m
). We solve the above
equations and we obtain that the work load is well
balanced when s = α` where
α =
(A +(w + 1)H)(A + (w + 1)D)
((2A +(w + 1)H)(A + (w + 1)D) + (A + (w +1)H)
2
)
.
The resulting values for the ratio α is given below in
the cases w = 2,3 and 4
for w = 2 we have α
=
0.46,
for w = 3 we have α
=
0.37,
for w = 4 we have α
=
0.37.
4 IMPLEMENTATION RESULTS
In this section, we present our experimental results for
the parallel approaches of in Section 3.
The platform used for the experimentations is
an Optiplex 990 DELL running an Ubuntu 12.04.
The processor is an Intel Core i7-2600 Sandy Bridge
3.4GHz which has four physical cores. Our code is
written in C language and compiled with gcc 4.6.3.
The timings are obtained with turbo mode and hyper-
threading deactivated as recommended in (Bernstein
and Lange, 2012).
4.1 Implementations for E(F
p
)
We consider the twisted Edwards curve Curve25519
introduced by Bernstein in (Bernstein, 2006) defined
over the prime field F
p
, with p = 2
255
19. For field
operations, we reuse the publicly available code of
Adam Langley in (Langley, 2008). In this code, a
field element is stored in a array of five 64 bit words,
each word containing 51 bits of the 255 bit field ele-
ment. This allows a better management of carries in
field addition and subtraction operations. The multi-
plications and squarings are performed with school-
book method. Squaring is optimized with the usual
trick which reduces the number of word multiplica-
tions. The reduction modulo p = 2
255
19 consists
in multiplying by 19 the 255 most significant bits and
then adding the result to the lower 255 bits. For the
inversion of a field element we use the extended Eu-
clidean algorithm with the lower level function of the
ParallelApproachesforEfficientScalarMultiplicationoverEllipticCurve
207
GMP library (gmp, ). Curve operations simply fol-
low the formulas provided in (hyp, ) corresponding
to inverted projective coordinates in twisted Edwards
curves.
4.2 Implementations for E(F
2
m
)
Our implementations deal with NIST curves B233
and B409 defined over the fields F
2
233
= F[x]/(x
233
+
x
74
+ 1) and F
2
409
= F[x]/(x
409
+ x
87
+ 1), respec-
tively. For a field multiplication, we apply a small
number of recursions of the Karatsuba algorithm
which breaks the m bit polynomial multiplication into
several 64 bit polynomial multiplications. Such 64
bit multiplication are computed with the PCLMUL in-
struction, available on Intel Core i7 processors. Due
to the special form of the irreducible polynomials,
the reduction is done with a small number of shifts
and bitwise XORs on 64 bit words. We compute the
field inversion with the Itoh-Tsujii algorithm, that is
a sequence of field multiplications and multisquar-
ings performed with look-up table. For field squar-
ing, square root and quadratic solver (needed in halv-
ings), we also use a look-up table method, which is
the fastest way according to our tests. For the curve
operations, we use the projective lambda coordinates
with the corresponding formulas provided in (Oliveira
et al., 2014).
4.3 Timing Results for the Proposed
Parallel Approaches
Table 4 reports the timings obtained for the three
parallel approaches discussed in Section 3. We
provide also the timings of the two-thread paral-
lel (double,halve)-and-add approach with w = 4 for
B233 and B409 and the timings of non-parallelized
double-and-add approach with w = 2,3 and 4 for
E(F
p
). For each parallel scalar multiplication we give
the split value s (and s
0
for the three-thread case). Ad-
ditionally we provide timings found in the literature
over the same processor and for similar curves and
fields.
Concerning the curve B233, the proposed par-
allelization does not show any speed-up compared
to the two-thread (double,halve)-and-add approach.
This could be explained by the cost induced by the
thread management. On the other hand, the approach
is clearly effective for the curve B409: it even shows a
timing which is better than all timings found in the lit-
erature for (double,halve)-and-add approach (cf. Sec-
tion 2).
In the case of Curve25519, the proposed optimiza-
tions behave as expected: the two-thread with w = 2
Table 4: Timings (in 10
3
clock-cycles (CC)) of parallel ap-
proaches over E(F
2
m
) ad E(F
p
)
Curve Method
NAF
#CC
10
3
splits nb
size
s s
0
of
w core
proposed B233 three-thread 4 106 110 83 3
our code B233 (db,hv)-&-add 4 104 98 2
Taverne et al. B233 (db,hv)-&-add 4 100 - - 2
Negre et al. B233 (db,hv)-&-add 4 117 - - 2
proposed B409 three-thread 4 303 187 143 3
our code B409 (db,hv)-&-add 4 338 175 2
Taverne et al. B409 (db,hv)-&-add 4 349 - - 2
Negre et al. B409 (db,hv)-&-add 4 452 - - 2
proposed C25519 two-thread 2 186 185 - 2
proposed C25519 opt-two-thd 2 180 168 - 2
our code C25519 db-&-add 4 239 - - 1
our code C25519 db-&-add 3 219 - - 1
our code C25519 db-&-add 2 221 - - 1
Langley
(?)
C25519 Montg. ladder - 229 - - 1
Bernsetin C25519 Montg. ladder - 194 - - 1
Hamburg Mtg251 Montg. ladder - 153 - - 1
(?) Compiled and run on our platform
is 14% faster than the double-and-add with w = 2, the
optimized-two-thread has a speed-up of 17%. The
speed-up is smaller than the one expected provided
by the value of α. But this might be due to the thread
managements and to the penalty of the costly square-
root computation in the case of the optimized-two-
thread approach. Our approach compares favorably
with the code of Langley and Bernstein, but it does
not compare favorably with the timings of Hamburg.
But the approach of Hamburg involves a smaller field
and also a smaller key length.
5 CONCLUSION
We have presented in this paper parallel approaches
to speed-up the scalar multiplication in E(F
2
m
) and
E(F
p
). The proposed parallelization split the scalar
into two parts or three parts. Then each part of
the scalar multiplication is performed in parallel, the
upper part requiring an additional sequence of dou-
blings or halving. These approaches have been im-
plemented on an Intel Core i7 and the resulting tim-
ings shows that the proposed parallelizations is effec-
tive for curves for NIST curve B409 and for curve the
twisted Edwards curve Curve25519 defined over F
p
with p = 2
255
19.
REFERENCES
Explicit formula database. http://www.hyperelliptic.org/
EFD/.
SECRYPT2015-InternationalConferenceonSecurityandCryptography
208
The GNU Multiple Precision Arithmetic Library (GMP).
http://gmplib.org/.
Bernstein, D. (2006). Curve25519: New Diffie-Hellman
Speed Records. In PKC 2006, volume 3958 of LNCS,
pages 207–228.
Bernstein, D. and Lange, T. (2012). eBACS:
ECRYPT Benchmarking of Cryptograhic Systems.
http://bench.cr.yp.to/. accessed May 25th, 14.
Gallant, R., Lambert, R., and Vanstone, S. (2001). Faster
Point Multiplication on Elliptic Curves with Efficient
Endomorphisms. In CRYPTO 2001, volume 2139 of
LNCS, pages 190–200.
Hamburg, M. Fast and compact elliptic-curve cryptogra-
phy. Cryptology ePrint Archive, Report 2012/309.
http://eprint.iacr.org/.
Hankerson, D., Menezes, A., and Vanstone, S. (2004).
Guide to Elliptic Curve Cryptography. Springer.
Hisil, H., Wong, K. K.-H., Carter, G., and Dawson, E.
(2008). Twisted Edwards Curves Revisited. In ASI-
ACRYPT, pages 326–343.
Kim, K. and Kim, S. A New Method for Speeding Up Arith-
metic on Elliptic Curves over Binary Fields. Technical
report. http://eprint.iacr.org/.
Knudsen, E. W. (1999). Elliptic Scalar Multiplication Using
Point Halving. In ASIACRYPT’99, volume 1716 of
LNCS, pages 135–149.
Langley, A. (2008). C25519 code. http://code.google.com/
p/curve25519-donna/.
Longa, P. and Sica, F. (2014). Four-Dimensional Gallant-
Lambert-Vanstone Scalar Multiplication. J. Cryptol-
ogy, 27(2):248–283.
L
´
opez, J. and Dahab, R. (1998). Improved Algorithms for
Elliptic Curve Arithmetic in GF(2
n
). In SAC’98, vol-
ume 1556 of LNCS, pages 201–212. Springer.
Montgomery, P. (1987). Speeding up the Pollard and ellip-
tic curve methods of factorization. Math. of Comp.,
48:263–264.
N
`
egre, C. and Robert, J.-M. (2013). Impact of Optimized
Field Operations ab, ac and ab + cd in Scalar Multipli-
cation over Binary Elliptic Curve. In AFRICACRYPT,
pages 279–296.
Oliveira, T., L
´
opez, J., Aranha, D., and Rodr
´
ıguez-
Henr
´
ıquez, F. (2014). Two is the fastest prime: lambda
coordinates for binary elliptic curves. J. Crypt. Eng.,
4(1):3–17.
P. Gallagher, D. D. and Furlani, C. (2009). Digital Signature
Standard (DSS). In FIPS Publications, volume FIPS
186-3, page 93. NIST.
Taverne, J., Faz-Hern
´
andez, A., Aranha, D. F., Rodr
´
ıguez-
Henr
´
ıquez, F., Hankerson, D., and L
´
opez, J. (2011).
Speeding Scalar Multiplication over Binary Elliptic
Curves using the New Carry-Less Multiplication In-
struction. J. Crypt. Eng., 1(3):187–199.
ParallelApproachesforEfficientScalarMultiplicationoverEllipticCurve
209