Speeding Up the Computation of Elliptic Curve Scalar Multiplication

based on CRT and DRM

Mohammad Anagreh

1,2

, Eero Vainikko

and Peeter Laud

Institute of Computer Science, University of Tartu, J. Liivi 2, Tartu, Estonia

Cybernetica, M

aealuse 2/1, Tallinn, Estonia

Keywords:

ECC, Parallel Computing, CRT, DRM.

Abstract:

In this paper, we study the parallel implementations of elliptic curve scalar multiplication over prime ﬁelds

using signed binary representations. Our implementation speeds up the calculation of scalar multiplication

in comparison with the standard case. We introduce parallel algorithms for computing elliptic curve scalar

multiplication based on representing the scalar by the Complementary Recoding Technique (CRT) and the

Direct Recording Method (DRM). Both implementations of the proposed algorithms show speed-ups reaching

up to 60% in comparison with execution time for sequential cases of the algorithms. We ﬁnd that ECC-DRM

is faster than ECC-CRT in both parallel and sequential counterparts.

1 INTRODUCTION

Elliptic curve cryptosystems (ECC) were indepen-

dently proposed by Koblitz (Koblitz, 1987) and Miller

(Miller, 1986). They are widely used in many cryp-

tographic primitives and protocols such as asymmet-

ric encryption, digital signature and key exchange.

One of the most important advantages of ECC is its

suitability for using it in case of limited memory re-

sources, such as portable devices, because it has a

shorter key size. ECC shows a high-level of security

with shorter key sizes in comparison with other ex-

isting algorithms like RSA (Rivest et al., 1978). The

minimum key size of the ECC is 160-bits having the

same security level as a standard key size of RSA of

1024-bits (Gura et al., 2004). Computing the scalar

multiplication is an expensive operation in the ellip-

tic curve cryptosystem. Elliptic curve scalar multipli-

cation is the operation of successively adding an EC

point along an elliptic curve to itself d times repeat-

edly: Q = dP, where P = (x,y) is a given point on the

elliptic curve. The multiplication algorithms typically

consider the binary representation of d. Therefore,

many researchers have focused to enhance the calcu-

lation of scalar multiplication by proposing new re-

lated algorithms such as signed binary representation,

as well as by enhancing the calculation method itself

such as using a parallel calculation. The Hamming

Weight (HW) of a (signed) binary representation of d

is the number of non-zero bits in it. The number of

adding and doubling operations on an elliptic curve

scalar multiplication is based on the length n of the

binary representation of d.

Reducing the number of non-zero bits in the scalar

representation d will reduce the number of adding

operations in the ECC scalar multiplication. There-

fore, lower HW is preferred to be used in the ECC

scalar multiplication. Several researchers have pro-

posed new methods to convert the binary representa-

tion to some signed binary representation in order to

reduce the Hamming Weight of the representation of

d. These representations are Mutual Opposite Form

(MOF) (Okeya et al., 2004), Joint Sparse Form (JSF)

(Solinas, 2001), Non-Adjacent Form (NAF) (Booth,

1951). In this paper, we consider Complementary

Recoding Technique (CRT) (Balasubramaniam and

Kathikeyan, 2007), which enhanced by Direct Recod-

ing method (DRM) (HK and Sanghi, 2010) and other

methods (Huang et al., 2010). On the other hand,

there are several methods proposed to accelerate the

calculation of the ECC scalar multiplication by paral-

lel computing (Azarderakhsh and Reyhani-Masoleh,

2015) (Asif and Kong, 2017) (Gutub, 2010).

In this paper, we propose algorithms to acceler-

ate the performance of computing elliptic curve scalar

multiplication by parallelizing the scalar multiplica-

tion algorithm. The proposed algorithms are based

on combining the Add-subtract scalar multiplication

algorithm and transforming the scalar d from the bi-

nary representation to the signed binary representa-

176

Anagreh, M., Vainikko, E. and Laud, P.

Speeding Up the Computation of Elliptic Curve Scalar Multiplication based on CRT and DRM.

DOI: 10.5220/0009129501760184

In Proceedings of the 6th International Conference on Information Systems Security and Privacy (ICISSP 2020), pages 176-184

ISBN: 978-989-758-399-5; ISSN: 2184-4356

tion. One of our algorithms makes use of the Comple-

mentary Recoding Technique (CRT), while the other

one is based on the Direct Recoding method (DRM).

For both representations, we consider different ways

of scheduling the computation on two processors. Our

implementation of the two algorithms shows that the

proposed methods are faster than the sequential cal-

culation of the ECC scalar multiplication.

This paper is organized as follows: Section 2

brieﬂy presents the preliminaries. Section 3 shows

some related work while section 4 is the proposed

work and the algorithms. Section 5 shows the re-

sults and presents the experiments. The last section

concludes the proposed method and discusses future

work.

2 PRELIMINARIES

2.1 Elliptic Curves over Prime Fields F

In this paper, we focus on the curves over prime ﬁelds

. These curves are deﬁned through the cubic equa-

tion as identiﬁed in Equation (2) with Cartesian co-

ordinate variables (x, y) and coefﬁcients (a, b) as ele-

ments of F

. All the values can be considered integers

that are computed modulo the prime number p. The

cubic equation with coefﬁcients (a, b) and variables

(x,y) for the elliptic curves over F

is the following:

= (x

+ ax + b) mod p (1)

let the point P = (x

) and point Q = (x

) be in

the elliptic curve over F

, deﬁned by the coefﬁcients

(a,b). In addition, let O be the point at inﬁnity. The

rules for addition operation in the EC is as follows:

P + O = P

(2)

Given point P and point Q, if x

= x

and y

= −y

then

P + Q = 0

(3)

In general, R = Q + P, where the result R = (x

) is

deﬁned as follows:

= λ

− x

mod p

(4)

= λ(x

− x

) − y

mod p

(5)

λ =









−y

−x



mod p, if P 6= Q





mod p, if P = Q

(6)

In summary, for any two points P, Q on a given el-

liptic curve, there are two main operations. The op-

eration R = P + Q when P 6= Q is called point addi-

tion and R = 2P is called point doubling. Addition

operation has 5 sub-operations: 2 squaring, 2 mul-

tiplications and 1 inversion. Consequently, for non-

negative integer number d, it is possible to deﬁne

the scalar point multiplication Q = dP on the elliptic

curve through the application of doubling and adding

operations, illustrated in Figure 1.

Figure 1: Adding and doubling points on EC.

2.2 Signed Binary Presentation

A signed binary representation of d is a vector

,...,d

n−1

), where

∑

n−1

i=0

= d and each d

an element of {−1,0,1}. Aiming to reduce the

Hamming weight of the representation, a number

of different signed binary representations have been

proposed, including MOF (Okeya et al., 2004),

NAF (Booth, 1951), CRT (Balasubramaniam and

Kathikeyan, 2007), DRM (HK and Sanghi, 2010) and

others.

In the following, we denote 1 = −1, 0 = 0, and

−1 = 1.

2.2.1 Complementary Recoding Technique

(CRT)

CRT is one of the techniques to convert a number

to a canonical signed binary representation that re-

duces the Hamming weight (Balasubramaniam and

Kathikeyan, 2007). If d denotes an n-bit integer, as

well as its (usual) binary representation, then its CRT

representation is

∑

n−1

i=0

= (100...0)

(n+1) bits

−

d −

1, where

d = 2

− 1 − d denotes the binary comple-

ment of d. This conversion is very simple, efﬁcient

and low time complexity in comparison with other

methods (HK and Sanghi, 2010).

Example 1: Let d = 7327, its binary representation

is (1110010011111)

. Converting the binary repre-

Speeding Up the Computation of Elliptic Curve Scalar Multiplication based on CRT and DRM

177

sentation to signed binary representation by applying

CRT is d =

∑

n−1

i=0

= (100...0)

(n+1)bits

−

d − 1 = (10000000000000)

(0001101100000)

- 1 = (10001101100001)

. In-

deed, converting the signed binary representation to

decimal, we get (10001101100001)

= 8192 − 512 −

256 − 64 − 32 − 1 = 7327 = d.

The Hamming weight for the binary representa-

tion of 7327 is 9, while the Hamming weight for

signed binary representation using CRT is 6. Smaller

hamming weight will save the number of operations

of calculating the EC scalar multiplication.

2.2.2 Direct Recoding Method (DRM)

DRM is another converting method to signed binary

representation (HK and Sanghi, 2010). This method

is based on the CRT but with time complexity less

than CRT because it uses only the single operation

of bitwise subtraction with 0 − 1 = 1. Also, DRM

generally results in smaller Hamming weight of the

result than CRT (HK and Sanghi, 2010).

The procedure to convert d to the signed binary

representation using DRM is the following. Let p be

the integer satisfying 2

≥ d > 2

p−1

. then d = (2

)

−

− k)

, where the subtraction of the bit 1 from the

bit 0 results in

Example 2: Let d = 248. The binary representation

of d is (11111000)

. Converting the binary represen-

tation to the signed binary representation by applying

DRM as follows:

= (100000000)

and (2

− 248) = (1000)

. Then

d = (100000000)

− (1000)

= (100001000)

Indeed, let us convert the signed binary representation

(100001000) we got by applying the DRM to deci-

mal, d = 256 −8 = 248. The hamming weight for the

binary representation of 248 is 5, while the hamming

weight for signed binary representation using DRM

is 2. So, the conversion will bring savings during the

calculation of the EC scalar multiplication. Note that

the signed binary representation of 248 using CRT is

(1000001111)

, which has the Hamming weight 5.

2.3 ECC Scalar Multiplication

The scalar multiplication is one of the main opera-

tions in the ECC. Scalar multiplication is built up

from two main operations — the addition of points,

and the doubling of a point. The scalar d is an in-

teger that has to be represented in (signed) binary.

The occurrence of a bit 1 in the representation cor-

responds to the operation of adding two points. There

are approximately n/2 such additions in a scalar mul-

tiplication. On the other hand, the number of dou-

bling operations is n − 1. In the case of signed bi-

nary representation, the third digit which is 1 will be

processed by the subtracting operations. Algorithm

1 is an Adding-Subtracting Scalar Multiplication Al-

gorithm, which is used to compute the elliptic curve

scalar multiplication based for a scalar d =

∑

n−1

i=0

represented either in binary (d

∈ {0,1}) or in signed

binary (d

∈ {1, 0, 1}).

Algorithm 1: Adding-Subtracting Scalar Multiplication.

Data: Point on EC P, a string of signed bits

,...,d

n−1

)

Result: Q = dP

begin

Q ← 0, R ← P

for i = 0 to n − 1 do

if (d

= 1) then

Q ← Q + R

else if (d

= 1) then

Q ← Q - R

end

R ← 2R

end

return Q

end

The example below shows how to ﬁnd the ECC scalar

multiplication for a small scalar d.

Example 3: Finding the ECC scalar multiplication

for d = 115 .

First, convert the integer d to the binary, so d =

(1001101)

Then, ﬁnd the ECC scalar multiplication based on

scalar d from right to left as illustrated in Figure 2.

Figure 2: Finding ECC Scalar Multiplication.

3 RELATED WORK

Many researchers have been working to enhance the

ECC by enhancing the calculation in the scalar multi-

plication. The improvement of the scalar multiplica-

tion can be achieved by improving or proposing some

related algorithms in scalar multiplication. Applying

the signed binary representation algorithms to ﬁnd the

scalar multiplication is an efﬁcient way to reduce the

number of non-zero bits in the key. Hamming Weight

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

178

is a big player to reduce the number of adding opera-

tions in computing the scalar multiplication.

In 1951 Booth proposed a new scalar representa-

tion called signed binary representation. There are

many methods to represent integers in signed binary

such as NAF, JSF, and MOF, Also in 2003 a new

method to compute general multiplication was pro-

posed by Chang et al. (Chang et al., 2003) which

is the result of using NAF, MOF, and JSF. Different

researchers proposed methods to calculate the scalar

multiplication in parallel computing using the binary

or signed binary representation.

Anagreh et al. (Anagreh et al., 2014), proposed a par-

allel method to compute scalar multiplication based

on the mutual opposite form (MOF). They extracted

a new algorithm that combined Adding- Subtracting

Scalar Multiplication Algorithm and Mutual Oppo-

site Form (MOF). They used two processors to per-

form the parallel calculation, the Method calculates

the doubling operation in a processor and adding op-

eration in another processor at the same time. The

proposed method computes the scalar multiplication

without performing the MOF conversion. The pro-

posed method is performing the comparison operation

of the given bit-string d to decide where the second

processor has to add or subtract the doubled point in

case of non-zero bits {1, 1}. The proposed method

achieves the speed-up 90% faster than the sequential

version of the ECC scalar multiplication with MOF.

Negre et al. (Negre and Robert, 2015) proposed a

new parallel approach for ﬁnding the scalar multipli-

cation. They split the scalar multiplication based on

NAF into two parts for the prime ﬁeld F

and three

parts for the binary ﬁeld F

. In their method, both

operations doubling and (addition or subtraction) will

be performed in a separate thread. In the case of

prime ﬁelds, the operations of scalar multiplication

are split into two sections, based on representing d as

d = k

+ 2

. The ﬁrst section Q

= k

P will be per-

formed in the ﬁrst thread. The second part Q

= 2

will be performed in the second thread. Finding the

scalar multiplication in their proposed job given by Q

= Q

+ Q

, the two points Q

and Q

are added to

get the scalar multiplication Q. The proposed method

achieved an improvement by at least 10% the compu-

tation time of the scalar multiplication.

Software implementation proposed by Robert

(Robert, 2014) for ﬁnding ECC scalar multiplica-

tion. In their proposed method, they used two threads

to perform the parallel calculation. As well as, for

various elliptic curves over the prime F

used four

threads. Two algorithms are used in their job Double-

and-add and Half-and-add algorithms. In this work,

putting the doubling operations into one thread (pro-

ducer) while additions and subtractions operation into

another thread (consumer). One single mutex at the

beginning of the computation is used to avoid using

the mutex synchronization as much as possible. The

goal of using the mutex is to keep the consumer in in-

active state at the beginning of the processing while

the producer processes the doubling operation. The

method shows some violation of read-after-write de-

pendency. The memory violation might happen be-

cause of the size of the ﬁrst batch of points which is

before releasing the mutex was too small. As well as,

in the case of the long sequence of zeros in the bi-

nary or NAF scalar representation. The results show

that there is an error rate that is limited to less than

1% but is not acceptable. To eliminate this problem, a

variable in a global memory as a loop counter is used.

An extra operation is added to the scheme that will

cause the reduction of the execution time in the par-

allel version. The NAF conversion is not a part of the

parallel section. The result shows that the enhance-

ment reached to 15% in comparison with the sequen-

tial version.

Phalakarn et al. (Phalakarn et al., 2018) proposed

a new representation for right-to-left parallel elliptic

curve scalar multiplication. The mathematical model

reduced the calculation time for ﬁnding ECC scalar

multiplication. Authors proposed algorithms that will

generate the representations which will reduce the ex-

ecution time of the scheme. Three processors are used

to perform the whole calculation in the scheme. Two

processors are for performing the doubling P and Q.

The third processor is for performing the addition op-

eration using two binary representations m and n. The

issue of the communication between the processors in

the model is still opened and may it cause an increas-

ing time complexity because it is an extra operation.

Anagreh et al. (Anagreh et al., 2019) introduced

an algorithm to ﬁnd the ECC scalar multiplication

based on NAF representation. They used two pro-

cessors to perform the whole calculation in Parallel

computing. The ﬁrst processor performs the doubling

operations while the second processor performs the

NAF conversion and (addition or subtraction) oper-

ations at the same time. Shared memory is used to

transmit the doubled points from the ﬁrst processor

to the second processor. They performed the NAF

conversion by the second Processor before starting to

calculate the addition or subtraction operations. This

method eliminates the use of mutexes, as the con-

sumption of doubled points by the second processor

will not overtake their production by the ﬁrst pro-

cessor. The result shows an enhancement is 60%

faster than the standard version of the ECC calcula-

tion based on NAF.

Speeding Up the Computation of Elliptic Curve Scalar Multiplication based on CRT and DRM

179

4 PARALLEL ALGORITHM

Reducing the execution time of the scalar multiplica-

tion by applying some efﬁcient method is desired.

In this work, we propose and compare two parallel

algorithms to calculate the scalar multiplication based

on signed binary representations. We extract both al-

gorithms by combining the Add-Subtract Scalar Mul-

tiplication Algorithm and Converting Methods for

ﬁnding the signed binary representation. The convert-

ing methods from binary representation to signed bi-

nary representation are CRT and DRM respectively.

The ﬁrst algorithm based on circular buffers and the

second is based on the delayed consumption of dou-

bled points. The ﬁrst algorithm optimizes the inter-

processor communication costs, while the second al-

gorithm optimizes the synchronization costs.

4.1 Algorithm based on Circular

Buffers

In our ﬁrst parallel algorithm, we use a circular buffer

to transmit the processed data among the two proces-

sors in the scheme. The circular buffer is considered a

shared memory. The processors can access the shared

memory at any time to perform both operation read

and write. Processor-1 can write the doubled point P

and the scalar d

in a speciﬁc location in the circu-

lar buffer. Processor-2 can read the doubled point P

and the scalar d

from the circular buffer to perform

the addition or subtraction operations. Circular buffer

has two pointers front and rear to organize the read-

ing and the writing operations. In each iteration in

the scheme, writing should be in a location pointed

by a front pointer Push

f ront

. The reading in the cir-

cular buffer should be in a location pointed by a rear

pointer Pull

rear

, where f ront > rear for all writing

and reading operations in the scheme. Such read-

ing and writing operation is the most important issue

to avoid any corruption in the calculation. As well

as, we use two attributes for performing the reading

and writing operations which are is-full() and is-not-

empty(). The main goal of using the attributes is to

check the situation of the circular buffer before per-

forming the reading or the writing operations. In case

the circular buffer is full, then keep cycling without

performing any operation until there is an empty lo-

cation in the circular buffer, then Processor-1 write

the point and scalar in the empty location in the cir-

cular buffer. The second attribute will be used by

Processor-2 before performing the addition or sub-

traction operations. The number of writing operations

in the scheme that will be performed in the Processor-

1 is based on the number of the bits n in the scalar d.

Moreover, the number of reading operations that will

be performed by the Processor-2 is based on the num-

ber of non-zero bits {1,1} in the scalar d.

Task decomposition strategy is applied in our par-

allel implementation of the scalar multiplication Al-

gorithm 2. We use two Processors to perform the mul-

tiplication. Processor-1 is responsible for perform-

ing three subtasks, see Processor-1 section in Algo-

rithm 2.

Algorithm 2: Parallel Scalar Multiplication based on cir-

cular buffers and signed binary representations.

Data: Integer d, Point in EC P

Result: Q = dP, based on a signed binary repre-

sentation

begin

Processor 1 signed binary conversion, Dou-

bling Operations

begin

R ← P

REP = Convert to signed binary(d)

for i = 0 to n − 1 do

repeat

until ¬ buffer is full()

if REP

6= 0 then

Push(R,REP

)

end

R ← 2R

end

Push(0,0)

end

Processor 2 Addition Operations

begin

Q ← 0

repeat

if buffer is not empty() then

Pull(R,d

)

if d

= 1 then

Q ← Q + R

else

Q ← Q - R

end

until d

= 0

return Q

end

The ﬁrst task, is the conversion of the scalar d to

a signed binary representation with digits {1,0,1},

using one of the conversion algorithms discussed in

Sec. 2.2. In our experiments, we have considered

the CRT and DRM representations. The second task

is calculating the doubling operations in the elliptic

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

180

curve based on the number of bits n in the scalar

d, where the point in elliptic curve P = (x,y) is

given. Performing the doubling operation by calling

the function n times, where n is the number of bits

in the signed binary representation. Regardless, is it

a 1, 0 or 1. The last task performed by Processor-

1 is writing the doubled point R and the digit REP

in an empty location in the circular buffer. As we

explained above, the circular buffer is shared mem-

ory and both Processors can access the shared data

for performing reading or writing operations. To indi-

cate that no more points will be pushed into the buffer,

Processor-1 will ﬁnish by pushing the pair (0,0).

Processor-2 is responsible for performing three

sub-task as well, see Processor-2 section in Algorithm

2. The ﬁrst task is reading the doubled point R and

the digit d

from the circular buffer. Note, that each

doubled point has a speciﬁc digit d

, that will be stored

together in the circular buffer to keep the sequence

of the doubling operations P,2P,4P,8P,....,2

P. The

second task is performing the addition or subtraction

operations based on the non-zero bits of the scalar

d. If the bit d

is 1, Processor-2 has to perform the

addition operation. If the bit d

in the scalar d is

1, Processor-2 has to perform the subtraction opera-

tion which is the third task Processor-2 has to per-

form. Calculating the addition operation or/and sub-

tracting operation will be saved in the accumulator Q

which is the ﬁnal result of ﬁnding EC scalar mul-

tiplication. The circular buffer is used to organize

transmitting the data between two processors in the

whole scheme. The data which has to transmit from

Processor-1 to Processor-2 is located in the shared

memory. Processor-1 writes in the circular buffer

while Processor-2 reads the stored data from the cir-

cular buffer. Every time Processor-1 is going to write

in the circular buffer, Processor-1 has to check that

circular is not full and there is an empty location to

the doubled point R and the scalar d

In case the circular buffer is full, Processor-1 has

to keep looping until there is an available location in

the circular buffer. Processor-2 has to check every

time that there is new data stored in the circular buffer

by Processor-1. Then, Read the data and perform-

ing the addition or subtraction operation based on the

scalar d.

4.2 Algorithm based on Delayed

Consumption

Compared to Alg. 2, the proposed Algorithm 3 moves

the task of doing the signed binary conversion of d

from Processor-1 to Processor-2. Hence Processor-1

only computes the point doublings. These are stored

in the array R = (R

,...,R

n−1

), which has to be kept

in the shared memory. All the points of R will be

doubled regardless of where is the d

is zeros or ones.

Processor-2 reads the elements of R and either

adds or subtracts them from the accumulated value

Q, according to the signed bit representation of the

scalar d. Processor-2 will perform the signed binary

conversion ﬁrst while the Processor-1 performs the

doubling operations and save the R

in circular Buffer.

Once Processor-2 ﬁnishes performing the conversion,

Processor-2 will start reading R

to perform the addi-

tion and subtraction operations.

Algorithm 3: Parallel Scalar Multiplication based on the

delayed consumption of doubled points.

Data: Integer d, Point in EC P

Result: Q = dP, based on a signed binary repre-

sentation

begin

Processor 1 Doubling Operations

begin

← P

for i = 1 to n − 1 do

← 2R

i−1

end

Processor 2 signed binary conversion, Addi-

tion Operations

begin

Q ← 0

,...,d

n−1

) =

Convert to signed binary(d)

for i = 0 to n − 1 do

if d

= 1 then

Q ← Q + R

else

Q ← Q - R

end

return Q

end

Again, in our experiments, we have considered

both the CRT and DRM conversion methods in order

to compute a signed binary representation of d.

Speeding Up the Computation of Elliptic Curve Scalar Multiplication based on CRT and DRM

181

5 EXPERIMENTAL EVALUATION

5.1 Algorithm based on Circular

Buffers

We can summarize that the proposed method is ex-

tracting a new algorithm that combines two algo-

rithms: Add-Subtract Scalar Multiplication, and a

method to give a signed binary representation of the

scalar. It performs the parallel computing on the ex-

tracted algorithm, given in Algorithm 2. We realized

the algorithm with either the CRT or the DRM method

in two versions of the code, Parallel and Sequential.

The evaluation of the algorithm is based on the paral-

lel and sequential versions for both the CRT and the

DRM method.

As with almost all parallel applications, it is im-

portant to produce the best sequential code before

starting to parallelize the code. Task decomposition

strategy is used to divide the work into two Proces-

sors to perform the overall scheme to get the best re-

sult. Both sequential and parallel codes are written in

Visual C++.Net. We use the Open MP library that is

supported in the Visual C++.Net package in order to

write the parallel section in the parallel version. As

well as, we used a ttmath library under C++ to deﬁne

a big integer number (bigger than or equal 1024-bits).

It is important to note that we use an Intel Core i5

7th-Gen machine to test both versions (Parallel and

Sequential) using Windows 10. We performed each

key size 10 times and the average execution time is

taken for all key sizes as shown in Figures 3 and 4.

In the implementation, we tested six different key

sizes for both algorithms in both cases parallel and

sequential: 160-bits, 192-bits, 224-bits, 256-bits,384-

bits, and 521-bits. We generated a big integer number

randomly for all key sizes we use in the implementa-

tion. Each number used in both parallel and sequen-

tial versions to determine the number of addition and

subtraction operations.

Figure 3: Execution time for Algorithm 2 using CRT.

The execution times for serial and parallel versions

are taken as shown in the ﬁgures for the different key

sizes of the ECC. In the case of the CRT encoding

method, the differences between serial time and par-

allel time are a big difference in the case of key size

521-bits, 192-bits and 160-bits as shown in ﬁgure 3.

The speed-up reaches 60% in comparison with the se-

rial version of the same key size.

Figure 4: Execution time for Algorithm 2 using DRM.

In the case of the DRM encoding method, the differ-

ence between serial time and the execution time in the

parallel version is signiﬁcant in 192-bits and 160-bits

key size. The speedup is 60% in comparison with the

execution time of the serial version of the same key

size.

The testing is according to a random key gener-

ated to perform the scalar multiplication. The same

key is used to perform the calculation of the scalar

multiplication in both version parallel and serial for

each key size.

Figure 5: Speed up and Efﬁciency for CRT.

The number of non-zero bits in the key effect in the

calculation of the ECC scalar multiplication. The

occurrence of the bit 1 or/and 1 means perform-

ing the adding or/and subtraction operations by the

Processor-2. The average number of the non-zero bits

in the key is around 50% or less because of using the

signed binary representation.

Figure 6: Speed up and Efﬁciency for DRM.

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

182

The execution time of one adding operation (or sub-

traction) is around two times and half of execution

time of the doubling operations. Adding operation is

much costly in comparison with doubling operation.

Therefore, the occurrence of non-zero bits in the key

even its around 50% doesn’t mean that the Processor-

1 will process the operation more than Processor-2.

In this case, it is important to note, that one adding

operation has a 5 sub-operations which are 2 squar-

ing, 2 multiplications and 1 inversion, that make an

adding operation is an expansive operation in compar-

ison with doubling operation. Therefore, performing

the whole calculation of the scalar multiplication by

this method ensures some kind of balancing. We can

see the efﬁciency of the whole calculation of the dif-

ferent key size is around 70% to 80%, see both ﬁgures

5 and 6.

5.2 Algorithm based on Delayed

Consumption

In Alg. 2, Processor-1 has to ﬁnd a signed binary rep-

resentation of d and perform the doubling operations.

Then, for non-zero bits in the signed binary represen-

tation, save the doubled points in the circular buffer.

The doubled points that have been saved in the circu-

lar buffer will be readable by the Processor-2 to per-

form addition (or subtraction) operations. In Alg. 3,

the ﬁnding of a signed binary representation is done

by Processor-2. In this method, Processor-1 has to

perform the doubling operations and save all doubled

points in shared memory, no matter whether the cor-

responding bit in the signed binary representation is

zero or non-zero. The number of writing operations

in the shared memory is the same as the length of the

scalar d. Processor-2 has to read all saved points from

the shared memory. It also has to decide whether the

doubled point should be added or subtracted, based

on the CRT or DRM representation. Therefore, in

case the bit is 1, it performs the addition operation,

in case the bit is -1, it performs the subtraction oper-

ation, while in case the bit is 0, it drops the point and

keeps reading. Figure 7, shows the benchmarking re-

sult of Alg. 3 for both implementation of the CRT and

DRM. In general, the results show that the second al-

gorithm is less efﬁcient than the ﬁrst, especially when

using a small key size.

DRM is a low cost operation in comparison with

CRT and another conversion method. In DRM, the

time complexity of the conversion is less than the time

complexity of conversion by applying the CRT. As

well as, the number of non-zero bits in the signed bi-

nary converted by DRM is less than the signed binary

converted by CRT and other standard methods. As

Figure 7: Execution time for the second method.

Figure 8: Execution time for CRT and DRM.

mentioned above in example 2. The hamming weight

of DRM representation is 2, which is less than the

hamming weight of CRT representation. Less ham-

ming weight will save the calculation time of ﬁnding

ECC scalar multiplication in comparison of using an-

other representation. In ﬁgure 5, we can recognize

the difference in the execution time of both DRM and

CRT for both serial and Parallel version. The calcu-

lation of the ECC scalar multiplication using DRM

Representation is faster than using CRT representa-

tion. Overall Key sizes and in both serial and parallel

versions, ﬁnding scalar multiplication based on DRM

representation is faster than ﬁnding the scalar multi-

plication based on CRT.

6 CONCLUSION

In this work, we proposed two algorithms to calcu-

late the ECC scalar multiplication based on CRT and

DRM representation. The ﬁrst algorithm based on

CRT representation and the second algorithm based

on DRM representation. We proposed a parallel al-

gorithm to perform both calculations of the two pro-

posed algorithms separately using two processors.

The results show speed-up reach to 60% in compari-

Speeding Up the Computation of Elliptic Curve Scalar Multiplication based on CRT and DRM

183

son with a serial version for both algorithms. As well

as, we introduced the difference in execution time for

both DRM and CRT. Future work includes using three

threads to perform the calculation in case the number

of non-zero bits in the key is more than usual, which

will make the calculation of adding point more costly

− the third thread will help to reduce the execution

time in this case.

REFERENCES

Anagreh, M., Samsudin, A., and Omar, M. A. (2014). Par-

allel method for computing elliptic curve scalar mul-

tiplication based on mof. In Int. Arab J. Inf. Technol,

11(6).

Anagreh, M., Vainikko, E., and Laud, P. (2019). Acceler-

ate performance for elliptic curve scalar multiplication

based on naf by parallel computing. In ICISSP 2019

- 5th International Conference on Information System

Security and Privacy. SITEPRESS.

Asif, S. and Kong, Y. (2017). Highly parallel modular mul-

tiplier for elliptic curve cryptography in residue num-

ber system. In Circuits, Systems, and Signal Process-

ing, 26(6).

Azarderakhsh, R. and Reyhani-Masoleh, A. (2015). Parallel

and high-speed computations of elliptic curve cryp-

tography using hybrid-double multipliers. In IEEE

Transactions on Parallel and Distributed Systems,

26(6).

Balasubramaniam, P. and Kathikeyan, E. (2007). Elliptic

curve scalar multiplication algorithm using comple-

mentary recoding. In Applied mathematics and com-

putation, 1(190).

Booth, A. (1951). A signed binary multiplication technique.

In Journal of Applied Mathematics, 4.

Chang, C. C., Kuo, Y. T., and Lin, C. H. (2003). Fast al-

gorithms for common-multiplicand multiplication and

exponentiation by performing complements. In In

17th International Conference on Advanced Informa-

tion Networking and Applications, pages 807–811.

IEEE.

Gura, N., Patel, A., Wander, A., Eberle, H., and Shantz,

S. C. (2004). Comparing elliptic curve cryptography

and rsa on 8-bit cpus. In In International workshop

on cryptographic hardware and embedded systems,

pages 119–132. Springer.

Gutub, A. (2010). Remodeling of elliptic curve cryptog-

raphy scalar multiplication architecture using parallel

jacobian coordinate system. In International Journal

of Computer Science and Security (IJCSS), 4(4).

HK, P. and Sanghi, M. (2010). Speeding up computation of

scalar multiplication in elliptic curve cryptosystem. In

International Journal on Computer Science and Engi-

neering, 4(2).

Huang, X., Shah, P. G., and D, S. (2010). Minimizing ham-

ming weight based on 1’s complement of binary num-

bers over gf (2 m). In In 2010 The 12th International

Conference on Advanced Communication Technology

(ICACT), volume 2, pages 1226–1230. IEEE.

Koblitz, N. (1987). Elliptic curve cryptosystems, volume

48(177): 203-209. Mathematics of computation.

Miller, V. (1986). Use of elliptic curves in cryptography. In

In Conference on the theory and application of crypto-

graphic techniques, number 108 in LNCS, pages 417–

426, Berlin, Heidelberg. Springer.

Negre, C. and Robert, J.-M. (2015). Parallel approaches for

efﬁcient scalar multiplication over elliptic curve. In

In- SECRYPT: International Conference on Security

and Cryptography, pages 202–209. IEEE.

Okeya, K., Schmidt-Samoa, K., Spahn, C., and Takagi, T.

(2004). Signed binary representations revisited. In In

Annual International Cryptology Conference, pages

123–139. Springer.

Phalakarn, K., Phalakarn, K., and Suppakitpaisarn, V.

(2018). Optimal representation for right-to-left par-

allel scalar and multi-scalar point multiplication. In

International Journal of Networking and Computing,

8(2).

Rivest, R., Shamir, A., and Adleman, L. (1978). A method

for obtaining digital signatures and public-key cryp-

tosystems. Communications of the acm, 21(2).

Robert, J.-M. (2014). Parallelized software implementation

of elliptic curve scalar multiplication. In In Interna-

tional Conference on Information Security and Cryp-

tology, pages 445–262. Springer.

Solinas, J. (2001). Low-weight binary representations for

pairs of integers. In technical report corr 2001-41,

Center for Applied Cryptographic Research, Univer-

sity of Waterloo, Canada.

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

184