Parallel Approaches for Efﬁcient Scalar Multiplication over Elliptic

Curve

Christophe Negre and Jean-Marc Robert

Equipe DALI, Universit

e de Perpignan, Perpignan, France

LIRMM, Univ. de Montpellier, CNRS, Montpellier, France

Keywords:

Elliptic Curve, Parallel Implementation, Scalar Multiplication, Prime Field, Binary Field.

Abstract:

This paper deals with parallel implementation of scalar multiplication over an elliptic curve. We present

parallel approaches which split the scalar into two parts for E(F

) or three parts for E(F

) and perform in

parallel the scalar multiplication with each part of the scalar. We present timing results of these approaches

implemented over an Intel Core i7 for NIST binary curves B233, B409 and for the twisted Edwards curve

Curve25519 (Bernstein, 2006). For the curves B409 and Curve25519 the proposed approaches improve by at

least 10% the computation time of the scalar multiplication.

1 INTRODUCTION

Elliptic curve cryptographic protocols generally in-

volve one or two scalar multiplications on the curve.

A scalar multiplication consists in computing kP =

P + ··· + P where k is an integer of several hundreds

of bits and P is a point on the curve. This is generally

performed through a sequence of hundreds of dou-

blings and additions on the curve using the so-called

double-and-add algorithm. It is thus a time consum-

ing part of the cryptographic protocols.

Actual and future processors on personal com-

puters and embedded devices will include an in-

creasing number of cores, enabling more parallelism.

The implementation of scalar multiplication might be

adapted to these new platforms. A recent interesting

work of Taverne et al. (Taverne et al., 2011) on bi-

nary elliptic curves showed that a curve scalar multi-

plication can be parallelized through a (double,halve)-

and-add approach resulting in an interesting speed-

up. GLV (Gallant et al., 2001) is another popular

approach to perform a scalar multiplication in par-

allel fashion (Longa and Sica, 2014; Oliveira et al.,

2014), but unfortunately GLV cannot be used on stan-

dard curves (P. Gallagher and Furlani, 2009).

We explore in this paper an alternative parallel ap-

proach. We split the scalar k in an upper and lower

part, and we compute scalar multiplication of each

part in parallel. This requires an additional sequence

of doublings or halvings to ﬁnish the upper part com-

putations before adding the two computed points. We

have experimented this approach and the resulting

timings show some interesting speed-up for the NIST

curve B409 and also for the twisted Edward curve

Curve25519 (Bernstein, 2006).

The remaining of the paper is organized as fol-

lows: in Section 2 we review basic results on elliptic

curves over binary and prime ﬁelds. In Section 3 we

present our approaches for the parallelization of scalar

multiplication. In Section 4 we provide experimental

results and we end the paper with some concluding

remarks in Section 5.

2 REVIEW OF ELLIPTIC CURVE

SCALAR MULTIPLICATION

An elliptic curve over a ﬁnite ﬁeld F

is the set of

points (x, y) ∈ F

satisfying a smooth equation of de-

gree 3 plus a point at inﬁnity O . A commutative group

law can be deﬁned on this set of points using the chord

and tangent method. This provides doubling and ad-

dition formulas consisting in a number of ﬁeld opera-

tions on the coordinates of the points. The complexity

of these formulas can be reduced by choosing appro-

priate curve equation and system of coordinates. In

the remaining of this section, we review curve equa-

tion and point operation formulas in the case of ellip-

tic curve deﬁned over binary ﬁeld and prime ﬁeld. We

also review classical algorithms for scalar multiplica-

tion.

202

Negre C. and Robert J..

Parallel Approaches for Efﬁcient Scalar Multiplication over Elliptic Curve.

DOI: 10.5220/0005512502020209

In Proceedings of the 12th International Conference on Security and Cryptography (SECRYPT-2015), pages 202-209

ISBN: 978-989-758-117-5

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

Table 1: Complexity of curve operations in E(F

Curve form Coordinates Doubling mixed Addition Addition

Weierstrass (a = −3) Jacobian 4M + 4S 8M + 3S 12M + 4S

Twisted Edwards Inverted 3M + 4S + M

+ M

8M + 1S + M

+ M

9M + 1S + M

+ M

Jacobi quartic XXY ZZ coord. 3M + 4S 6M + 3S + M

a−1

7M + 4S + M

a−1

Montgomery Projective 2M + M

(a+2)/4

+ 2S - 4M + 2S

2.1 Point Operations in E(F

)

An elliptic curve E(F

) over a prime ﬁeld F

is gen-

erally deﬁned by a short Weierstrass equation:

E : y

= x

+ ax + b with (a,b) ∈ F

In this case addition and doubling formulas given by

the chord and tangent method are as follows: let P

),P

= (x

) and P

= (x

), be three points

on E(F

) such that P

= P

+ P

, then we have:



= λ

− x

= λ(x

− x

) −y

where

(

λ =

−y

−x

if P

6= P

λ =

+ x

if P

= P

(1)

In the sequel we will use the two following alter-

native elliptic curve equations:

• Twisted Edwards (Hisil et al., 2008). The twisted

Edwards elliptic curve equation is as follows:

+ y

= 1 +dx

where a and d are two elements in F

. In this case

there is an uniﬁed addition and doubling formula:

if P

= (x

) and P

= (x

) are in E(F

), then

the afﬁne coordinates (x

) of P

= P

+ P

are

as follows:

=(x

+ y

)/(1 +dx

=(y

− ax

)/(1 −dx

(2)

• Montgomery curves (Montgomery, 1987). We

consider Montgomery elliptic curves which have

the following form:

E : by

= x

+ ax

+ x with (a,b) ∈ F

Montgomery noticed in (Montgomery, 1987) that

for Q = (x

) the x-coordinate of 2Q depends

only on x

− 1)

+ ax

+ 1)

. (3)

Moreover, Montgomery noticed that given two

points Q = (x

) and P = (x

) and if we

know the difference R = Q −P, then we can com-

pute Q +P as follows:

Q+P

− 1)

− x

)

. (4)

Montgomery curves and twisted Edwards curves are

isomorphic. A conversion of the point coordinates be-

tween these two curves requires only a few ﬁeld op-

erations.

The formulas in (1), (2) and (3) involve a ﬁeld in-

version which is generally a costly operation. Alter-

native coordinate systems have been proposed in the

case of Weierstrass, Montgomery, twisted Edwards

and Jacobi quartic in order to avoid ﬁeld inversions

and reduce the number of ﬁeld multiplications:

• Extended coordinate system where a quadruplet

(X,Y,Z,T ) represents the points x = X /Z,y =

Y /Z and T /Z = xy.

• Regular projective coordinates where a triplet

(X,Y,Z) represents a point x = X/Z,y = Y /Z.

• Inverted coordinates : a triplet (X ,Y,Z) corre-

sponds to a point x = Z/X and y = Z/Y .

The cost of a doubling, an addition and a mixed

addition in these coordinate systems are given in Ta-

ble 1.

2.2 Operations in Binary Elliptic

Curves

An elliptic curve over a binary ﬁeld is generally de-

ﬁned by the following short Weierstrass equation:

E : y

+ xy = x

+ ax

+ b with a,b ∈ F

. (5)

The formulas for addition, doubling and halving on

E(F

) are as follows:

• Addition and doubling. We consider two points

= (x

) and P

= (x

) on the curve

E(F

). The coordinates (x

) of P

= P

+ P

are given by the following formula:



= λ

+ λ + x

+ x

+ a,

= (x

+ x

)λ +x

+ y

where λ =

(

if P

6= P

+ x

if P

= P

(6)

• Point halving. In (Knudsen, 1999), Knudsen no-

ticed that if a point Q = (u, v) on E(F

) is of

odd order, then the point P =

Q in the subgroup

ParallelApproachesforEfficientScalarMultiplicationoverEllipticCurve

203

Table 2: Complexity of curve operation in E(F

) when a = 1.

Coordinates Doubling Halving Mixed-addition Doubling-mixed-addition Addition

Afﬁne 2M + I 1M + 1SqRt + 1QS - 4M + 2I 2M + I

Lopez-Dahab 4M + 5S - 9M + 4S 13M + 4S 13M + 9S

Kim-Kim 4M + 5S - 8M + 4S 12M + 9S 12M + 4S

Lambda proj. 4M + 4S - 8M + 2S 10M + 6S 11M + 4S

generated by Q is well deﬁned. Knudsen also no-

ticed that the computation of the coordinates of

P in terms of the coordinates of Q can be com-

puted efﬁciently. Indeed, since Q = 2P and we

can use (6) to derive the coordinates of P = (x, y)

in terms of Q = (u,v) as follows:

λ = x + y/x (7)

u = λ

+ λ + a (8)

v = x

+ u(λ + 1) (9)

Knudsen proposed to ﬁrst solve the quadratic

equation (8) in λ, then to compute x =

v +u(λ + 1) from (9) and to ﬁnally com-

pute y = λx + x from (7). A point halving on

E(F

) has thus a cost of 2M plus one square

root (SqRt) and one quadratic solver (QS). It be-

comes more efﬁcient if P and Q are represented

in lambda coordinates P = (x

,λ

) and

Q = (y

,λ

) since, in this case, a point

halving requires only 1M + 1SqRt + 1QS.

The following projective coordinate systems are

proposed in the literature in order to efﬁciently im-

plement point operations in E(F

• Lopez-Dahab coordinate system (L

opez and Da-

hab, 1998). A point is given by a triplet (X ,Y,Z)

corresponding to the afﬁne point x = X/Z,y =

X/Z

• Kim-Kim coordinate system (Kim and Kim, ). A

point is given by a quadruplet (X,Y,Z, T ), where

T = Z

, which corresponds to the afﬁne point x =

X/Z,y = X/T .

• Lambda projective coordinate system (Oliveira

et al., 2014) . A point is represented by a triplet

(X,L,Z) such that x = X/Z, λ = L/Z = y/x + x

and y = (L/Z + X /Z)X/Z.

For explicit point addition and doubling formulas in

these projective coordinate systems the reader may

refer to (Hankerson et al., 2004; Kim and Kim, ;

Oliveira et al., 2014). In Table 2 we report the cor-

responding cost of each point operation on the curve

when a = 1. We do not report the complexity of a

point halving in projective coordinate systems, since

in this case the quadratic solver requires a ﬁeld inver-

sion and makes the halving inefﬁcient. We can no-

tice that Lambda coordinates provide the most efﬁ-

cient curve operations.

2.3 Scalar Multiplication Algorithms

Double-and-add scalar multiplication. A scalar mul-

tiplication kP is generally performed with a sequence

of doublings and additions. The scalar k is ﬁrst re-

coded with the NAF

recoding as k =

∑

i=0

where

is 0 or an odd integer in [−2

w−1

] and w is

the window size. The 2

w−2

points k

P for k

odd

in [0,2

w−1

] are computed at the beginning of the al-

gorithm and then R = kP is computed through a se-

quence of doublings and additions R ← 2R + k

P for

i = `,` − 1,...,0. This method is reported in Algo-

rithm 1. The complexity of this approach is, on aver-

age, ` + 1 doublings and `/(w + 1) + 2

w−2

− 1 addi-

tions (see (Hankerson et al., 2004) for a detailed anal-

ysis).

Algorithm 1: Double-and-add.

Require: P ∈ E(F

) and a scalar k ∈ [0,N −1] where

N is the order of P.

Ensure: Q = k · P

1: Compute NAF

(k) =

∑

i=0

2: Compute T [i] = i · P for all odd positive integers

i ∈ [0, 2

w−1

]

3: Q ← O

4: for i from ` downto 0 do

5: Q ← 2 · Q + sign(k

)T [ |k

6: end for

7: return (Q)

Halve-and-add scalar multiplication. In the case of

binary elliptic curve, the halving operation makes it

possible to use a halve-and-add approach instead of a

double-and-add approach. We ﬁrst recode the scalar

as k =

∑

i=0

−i

where k

is odd in [−2

w−1

Then the scalar multiplication kP is computed by ﬁrst

setting R ← P and Q

← O, j ∈ {1, 3, . . . , 2

w−1

− 1}.

Then we perform sequence of halvings and additions

showns in steps 4-9 in Algorithm 2. and at the end

the result is Q =

∑

j∈{1,3,...,2

w−1

−1}

. This method

is depicted in Algorithm 2.

The post-computation in the return statement

(Step 10) is computed with the technique credited to

Knuth: Q

← Q

+ Q

i+2

for i from 2

w−1

− 3 to 1, then

SECRYPT2015-InternationalConferenceonSecurityandCryptography

204

Q is given by Q ← Q

+ 2

∑

j∈{3,...,2

w−1

−1}

. The

total complexity of the halve-and-add scalar multipli-

cation is on average 2

w−1

+ `/(w + 1) point additions

and `−1 halvings and one doubling. Using the cost of

Table 2, we obtain the following complexity in terms

of ﬁeld operations:

#Op. = `(M + SqRt +QS)

+(8M + 4S)(`/(w + 1)+ 2

w−1

)

+4M + 4S

(10)

Algorithm 2: Halve-and-add.

Require: P ∈ E(F

) of odd order N and a scalar k ∈

[0,N − 1] and ` = dlog

(N)e+ 1 and w a window

size.

Ensure: Q = k · P.

1: Recode k =

∑

i=0

−i

with k

is an odd integer in

[−2

w−1

]

2: for j ∈ J = {1, 3, . . . ,2

w−1

− 1} do Q

← O

3: R ← P

4: for i from 0 to ` do

5: if k

6= 0 then

6: Q

← Q

+ sign(k

7: end if

8: R ← R/2

9: end for

10: return Q ←

∑

j∈J

Parallel (double,halve)-and-add scalar multiplica-

tion. In the case of binary elliptic curve, the halve-

and-add and double-and-add methods can be used in

parallel in order to speed-up the computation of the

scalar multiplication. The scalar k is recoded as

k =

∑

i=0

| {z }

`−s

∑

i=1

−i

| {z }

Then the computation can be split into two threads:

one double-and-add thread computing Q

= k

P and

one halve-and-add thread computing Q

= k

P, the

result is obtained with a ﬁnal addition Q = Q

+ Q

For further details on this method the reader may refer

to (Taverne et al., 2011).

3 PROPOSED PARALLEL

SCALAR MULTIPLICATION

We consider the case of curves where the paralleliza-

tion based on GLV (Gallant et al., 2001) is not ap-

plicable. This is the case of the NIST curves B233,

B409 and this is also the case of twisted Edwards

curve Curve25519 (Bernstein, 2006).

3.1 Two-thread Parallelization over

E(F

)

The proposed approach to parallelize part of the com-

putations involved in double-and-add scalar multipli-

cation is the following: we split the scalar into two

parts k = k

with k

< 2

and then we divide the

computations of kP = k

P +2

P into two threads:

• Thread1 computes Q

= k

P using a double-and-

add approach of length s.

• Thread2 computes 2

P by ﬁrst performing k

using the double-and-add algorithm and then a se-

quence of doublings 2(k

P),4(k

P),...,2

and ﬁnally it outputs Q

= 2

P).

At the end, the two points Q

and Q

are added and

then Q = Q

+ Q

is output. This method is depicted

in Fig. 1.

thread

final

addition

Thread2

joining

ℓ − s doublings

P computed

ℓ−s

w+1

additions

and

= 2

with s doublings

s doublings

+ Q

Thread1

= k

P with double-and-add

with double-and-add

w+1

additions

and s − 1 doublings

Figure 1: Two-thread parallelization of double-and-add ap-

proach.

Complexity. The proposed parallelization is opti-

mal if the computation load of the two threads are the

same. This means that

w +1

A +sD

| {z }

Thread1

∼

` −s

w +1

A +`D

| {z }

Thread2

where A and D represent an addition and a doubling,

respectively. This implies that, for a balanced com-

putation load, the split s of the scalar k have to be as

follows:

∼

A +(w + 1)D

2A +(w + 1)D

In other words, the proposed parallelization reduces

the computation time by a ratio of α =

A+(w+1)D

2A+(w+1)D

We use the complexity of the curve operations given

in Table 1 to derive explicit values of the ratio α =

A+(w+1)D

2A+(w+1)D

for the three cases w = 2, 3 and 4. For the

sake of simplicity, we assume that M

and M

a−1

ParallelApproachesforEfficientScalarMultiplicationoverEllipticCurve

205

are negligible compared to M and S = 0.8M. We re-

port the resulting values of α in Table 3.

Table 3: Estimated values for the ratio α

Value of α

w = 2 w = 3 w = 4

Weierstrass 0.74 0.77 0.80

Twisted Edwards 0.75 0.79 0.81

Jacobi quartic 0.76 0.79 0.82

Table 3 shows that the timing of the execution is

expected to be reduced by a factor of 25% for w = 2

and for the two other cases the improvement might be

around 20%.

3.2 Optimized Two-thread

Parallelization over E(F

)

We present in this subsection an optimized version of

the approach of Subsection 3.1 when the curve is a

twisted Edwards curves. A twisted Edwards curve

can alternatively be set in a Montgomery form, and

then we can use the efﬁciency of the doubling on a

Montgomery curve: from Table 1 a doubling on a

Montgomery curve requires only 2M + 2S. Speciﬁ-

cally, we propose to perform the sequence of s dou-

blings of Thread2, in the parallelization of Subsec-

tion 3.1, with the Montgomery doubling. This results

in the following optimized parallelization:

• Thread1 computes Q

= k

P using a double-and-

add approach of length s on the Edward curve.

• Thread2 ﬁrst computes the coordinates of P in the

Montgomery curve. Then it computes P

= 2

on the Montgomery curve with a sequence of s

doublings. Then it computes the coordinates of P

in the Edwards curves. Finally it computes k

P using the double-and-add algorithm on the

twisted Edwards curve.

This modiﬁed two-thread parallelization induces

a problem: we do not know the coordinate y

= 2

P at the end of the sequence of doublings

in the Montgomery curve. Indeed the Montgomery

doubling formula computes only the x coordinate of

a point (or, equivalently, X and Z projective coordi-

nates). We can compute ±y

by solving a quadratic

equation in y given be the curve equation. But we

do not know the sign of the solution of the quadratic

equation, i.e., we are left with the two possible values

±y

. We propose to deal with this problem as follows:

• We pick an arbitrary sign for y

and we let

Thread2 compute k

· (±P

• We use a right-to-left double-and-add scalar mul-

tiplication in Thread1. At some point, Thread1

computes P

= 2

P and then at this point we know

the correct sign for y

When the two threads are joined, knowing the correct

sign for y

enables us to pick the correct value of Q

P and then we can compute k

P +2

This proposed optimized two-thread paralleliza-

tion is depicted in Fig. 2.

Complexity. We denote D

the cost of a Mont-

gomery doubling and A

and D

the respective costs

of the addition and doubling on a twisted Edwards

curve and A

the cost of a non mixed addition (in-

volved in the right-to-left double-and-add). Then the

work load of the two threads are equal if the following

identity holds

w +1

+ sD

| {z }

Thread1

∼

` −s

w +1

+ (`− s)D

+ sD

| {z }

Thread2

This implies that

∼

+ (w + 1)D

+ A

+ (w + 1)(2D

− D

)

| {z }

With the complexity of the curve operations given in

Table 1 we derive explicit values for the ratio α: for

w = 2 we obtain α = 0.62, for w = 3 we have α = 0.63

and for w = 4 we have α = 0.64.

In other words, with the proposed optimization,

the theoretical saving is around 35% of the computa-

tion time compared to the non-parallelized version.

3.3 Three-thread Parallelization over

E(F

)

Our goal in this subsection is to increase the level of

parallelism of the two-thread (double,halve)-and-add

scalar multiplication in E(F

) (reviewed in Subsec-

tion 2.3). To reach this goal, we recode and split the

scalar k as follows:

k =

∑

i=0

| {z }

∑

i=1

−i

| {z }

−s

`−s

−s

∑

i=1

000

−i

| {z }

000

where k

and k

000

are in {±1,±3,±5,. . . , ±2

w−2

−

3}.

We split the computation of the scalar multiplica-

tion into three threads:

• Thread1 computing k

P using a double-and-add

method of length s

SECRYPT2015-InternationalConferenceonSecurityandCryptography

206

Twisted Edwards curve

Twisted Edwards curveMontgomery curve

and curve

conversion

correct

Pick

the

sign

to Montgomery

ℓ−s

w+1

additions

s doublings

and ℓ − s doublings

recovery of ±y

P with s

Montogmery doublings

w+1

additions and s − 1 doublings

coordinates of 2

= k

= 2

Thread1

= k

P computed with right-to-left double-and-add

±2

P computed with

left-to-right double-and-add

Q = Q

+ Q

Thread2 converts P

Figure 2: Optimized two-thread parallelization of double-and-add approach.

• Thread2 computing k

P using a halve-and-add

method of length s.

• Thread3 computing 2

−s

000

P by performing a

halve-and-add scalar multiplication followed by a

sequence of s halvings to obtain 2

−s

000

This approach is depicted in Fig. 3.

joining

Thread1

Thread3

′

= k

′

P with double-and-add

′′′

= 2

−s

′′′

with s halvings

ℓ − s − s

′

halvings s halvings

′′

= k

′′

P with halve-and-add

′

w+1

additions and s

′

doublings

′

+ Q

′′

+ Q

′′′

and

ℓ−s−s

′

w+1

additions

′′′

P with

halve-and-add

w+1

additions and s halvings

Thread2

Figure 3: Three-thread version of the (double,halve)-and-

add approach.

Complexity. Let us evaluate the reducing ratio of

the computations in this case. The work load of the

three threads are well-balanced if the following equa-

tions hold

w+1

A +sH

∼

w+1

A +s

w+1

A +sH

∼

`−s

−s

w+1

A +(` − s

)H,

where A represents an addition, D a doubling and H

a halving on the curve E(F

). We solve the above

equations and we obtain that the work load is well

balanced when s = α` where

α =

(A +(w + 1)H)(A + (w + 1)D)

((2A +(w + 1)H)(A + (w + 1)D) + (A + (w +1)H)

)

The resulting values for the ratio α is given below in

the cases w = 2,3 and 4

• for w = 2 we have α

∼

0.46,

• for w = 3 we have α

∼

0.37,

• for w = 4 we have α

∼

0.37.

4 IMPLEMENTATION RESULTS

In this section, we present our experimental results for

the parallel approaches of in Section 3.

The platform used for the experimentations is

an Optiplex 990 DELL running an Ubuntu 12.04.

The processor is an Intel Core i7-2600 Sandy Bridge

3.4GHz which has four physical cores. Our code is

written in C language and compiled with gcc 4.6.3.

The timings are obtained with turbo mode and hyper-

threading deactivated as recommended in (Bernstein

and Lange, 2012).

4.1 Implementations for E(F

)

We consider the twisted Edwards curve Curve25519

introduced by Bernstein in (Bernstein, 2006) deﬁned

over the prime ﬁeld F

, with p = 2

255

− 19. For ﬁeld

operations, we reuse the publicly available code of

Adam Langley in (Langley, 2008). In this code, a

ﬁeld element is stored in a array of ﬁve 64 bit words,

each word containing 51 bits of the 255 bit ﬁeld ele-

ment. This allows a better management of carries in

ﬁeld addition and subtraction operations. The multi-

plications and squarings are performed with school-

book method. Squaring is optimized with the usual

trick which reduces the number of word multiplica-

tions. The reduction modulo p = 2

255

− 19 consists

in multiplying by 19 the 255 most signiﬁcant bits and

then adding the result to the lower 255 bits. For the

inversion of a ﬁeld element we use the extended Eu-

clidean algorithm with the lower level function of the

ParallelApproachesforEfficientScalarMultiplicationoverEllipticCurve

207

GMP library (gmp, ). Curve operations simply fol-

low the formulas provided in (hyp, ) corresponding

to inverted projective coordinates in twisted Edwards

curves.

4.2 Implementations for E(F

)

Our implementations deal with NIST curves B233

and B409 deﬁned over the ﬁelds F

233

= F[x]/(x

233

+ 1) and F

409

= F[x]/(x

409

+ x

+ 1), respec-

tively. For a ﬁeld multiplication, we apply a small

number of recursions of the Karatsuba algorithm

which breaks the m bit polynomial multiplication into

several 64 bit polynomial multiplications. Such 64

bit multiplication are computed with the PCLMUL in-

struction, available on Intel Core i7 processors. Due

to the special form of the irreducible polynomials,

the reduction is done with a small number of shifts

and bitwise XORs on 64 bit words. We compute the

ﬁeld inversion with the Itoh-Tsujii algorithm, that is

a sequence of ﬁeld multiplications and multisquar-

ings performed with look-up table. For ﬁeld squar-

ing, square root and quadratic solver (needed in halv-

ings), we also use a look-up table method, which is

the fastest way according to our tests. For the curve

operations, we use the projective lambda coordinates

with the corresponding formulas provided in (Oliveira

et al., 2014).

4.3 Timing Results for the Proposed

Parallel Approaches

Table 4 reports the timings obtained for the three

parallel approaches discussed in Section 3. We

provide also the timings of the two-thread paral-

lel (double,halve)-and-add approach with w = 4 for

B233 and B409 and the timings of non-parallelized

double-and-add approach with w = 2,3 and 4 for

E(F

). For each parallel scalar multiplication we give

the split value s (and s

for the three-thread case). Ad-

ditionally we provide timings found in the literature

over the same processor and for similar curves and

ﬁelds.

Concerning the curve B233, the proposed par-

allelization does not show any speed-up compared

to the two-thread (double,halve)-and-add approach.

This could be explained by the cost induced by the

thread management. On the other hand, the approach

is clearly effective for the curve B409: it even shows a

timing which is better than all timings found in the lit-

erature for (double,halve)-and-add approach (cf. Sec-

tion 2).

In the case of Curve25519, the proposed optimiza-

tions behave as expected: the two-thread with w = 2

Table 4: Timings (in 10

clock-cycles (CC)) of parallel ap-

proaches over E(F

) ad E(F

)

Curve Method

NAF

#CC

splits nb

size

s s

w core

proposed B233 three-thread 4 106 110 83 3

our code B233 (db,hv)-&-add 4 104 98 − 2

Taverne et al. B233 (db,hv)-&-add 4 100 - - 2

Negre et al. B233 (db,hv)-&-add 4 117 - - 2

proposed B409 three-thread 4 303 187 143 3

our code B409 (db,hv)-&-add 4 338 175 − 2

Taverne et al. B409 (db,hv)-&-add 4 349 - - 2

Negre et al. B409 (db,hv)-&-add 4 452 - - 2

proposed C25519 two-thread 2 186 185 - 2

proposed C25519 opt-two-thd 2 180 168 - 2

our code C25519 db-&-add 4 239 - - 1

our code C25519 db-&-add 3 219 - - 1

our code C25519 db-&-add 2 221 - - 1

Langley

(?)

C25519 Montg. ladder - 229 - - 1

Bernsetin C25519 Montg. ladder - 194 - - 1

Hamburg Mtg251 Montg. ladder - 153 - - 1

(?) Compiled and run on our platform

is 14% faster than the double-and-add with w = 2, the

optimized-two-thread has a speed-up of 17%. The

speed-up is smaller than the one expected provided

by the value of α. But this might be due to the thread

managements and to the penalty of the costly square-

root computation in the case of the optimized-two-

thread approach. Our approach compares favorably

with the code of Langley and Bernstein, but it does

not compare favorably with the timings of Hamburg.

But the approach of Hamburg involves a smaller ﬁeld

and also a smaller key length.

5 CONCLUSION

We have presented in this paper parallel approaches

to speed-up the scalar multiplication in E(F

) and

E(F

). The proposed parallelization split the scalar

into two parts or three parts. Then each part of

the scalar multiplication is performed in parallel, the

upper part requiring an additional sequence of dou-

blings or halving. These approaches have been im-

plemented on an Intel Core i7 and the resulting tim-

ings shows that the proposed parallelizations is effec-

tive for curves for NIST curve B409 and for curve the

twisted Edwards curve Curve25519 deﬁned over F

with p = 2

255

− 19.

REFERENCES

Explicit formula database. http://www.hyperelliptic.org/

EFD/.

SECRYPT2015-InternationalConferenceonSecurityandCryptography

208

The GNU Multiple Precision Arithmetic Library (GMP).

http://gmplib.org/.

Bernstein, D. (2006). Curve25519: New Difﬁe-Hellman

Speed Records. In PKC 2006, volume 3958 of LNCS,

pages 207–228.

Bernstein, D. and Lange, T. (2012). eBACS:

ECRYPT Benchmarking of Cryptograhic Systems.

http://bench.cr.yp.to/. accessed May 25th, 14.

Gallant, R., Lambert, R., and Vanstone, S. (2001). Faster

Point Multiplication on Elliptic Curves with Efﬁcient

Endomorphisms. In CRYPTO 2001, volume 2139 of

LNCS, pages 190–200.

Hamburg, M. Fast and compact elliptic-curve cryptogra-

phy. Cryptology ePrint Archive, Report 2012/309.

http://eprint.iacr.org/.

Hankerson, D., Menezes, A., and Vanstone, S. (2004).

Guide to Elliptic Curve Cryptography. Springer.

Hisil, H., Wong, K. K.-H., Carter, G., and Dawson, E.

(2008). Twisted Edwards Curves Revisited. In ASI-

ACRYPT, pages 326–343.

Kim, K. and Kim, S. A New Method for Speeding Up Arith-

metic on Elliptic Curves over Binary Fields. Technical

report. http://eprint.iacr.org/.

Knudsen, E. W. (1999). Elliptic Scalar Multiplication Using

Point Halving. In ASIACRYPT’99, volume 1716 of

LNCS, pages 135–149.

Langley, A. (2008). C25519 code. http://code.google.com/

p/curve25519-donna/.

Longa, P. and Sica, F. (2014). Four-Dimensional Gallant-

Lambert-Vanstone Scalar Multiplication. J. Cryptol-

ogy, 27(2):248–283.

opez, J. and Dahab, R. (1998). Improved Algorithms for

Elliptic Curve Arithmetic in GF(2

). In SAC’98, vol-

ume 1556 of LNCS, pages 201–212. Springer.

Montgomery, P. (1987). Speeding up the Pollard and ellip-

tic curve methods of factorization. Math. of Comp.,

48:263–264.

egre, C. and Robert, J.-M. (2013). Impact of Optimized

Field Operations ab, ac and ab + cd in Scalar Multipli-

cation over Binary Elliptic Curve. In AFRICACRYPT,

pages 279–296.

Oliveira, T., L

opez, J., Aranha, D., and Rodr

ıguez-

Henr

ıquez, F. (2014). Two is the fastest prime: lambda

coordinates for binary elliptic curves. J. Crypt. Eng.,

4(1):3–17.

P. Gallagher, D. D. and Furlani, C. (2009). Digital Signature

Standard (DSS). In FIPS Publications, volume FIPS

186-3, page 93. NIST.

Taverne, J., Faz-Hern

andez, A., Aranha, D. F., Rodr

ıguez-

Henr

ıquez, F., Hankerson, D., and L

opez, J. (2011).

Speeding Scalar Multiplication over Binary Elliptic

Curves using the New Carry-Less Multiplication In-

struction. J. Crypt. Eng., 1(3):187–199.

ParallelApproachesforEfficientScalarMultiplicationoverEllipticCurve

209