LOW AREA SCALABLE MONTGOMERY INVERSION

OVER GF(2

)

Mohamed N. Hassan and Mohammed Benaissa

Electronic and Electrical Department, University of Sheffield, Mappin Street, Sheffield, U.K.

Keywords: Cryptography, Finite Fields, ECC over GF(2

), Montgomery Inversion, FPGA.

Abstract: In this work, an improved algorithm for Montgomery modular inversion over GF(2

) is proposed.

Moreover, A novel scalable hardware architecture for the proposed algorithm is presented which is

parameterizable and amenable to interfacing to special purpose processors such as microcontrollers. The

architecture supports operations over finite fields GF(2

) up to m 571

≤

without the need to reconfigure

the hardware. The results show that, this work can be exploited to construct low resource elliptic curve

cryptosystems (ECC).

1 INTRODUCTION

Since their introduction by Miller and independently

Koblitz in 1985 (D. Hankerson, 2004), Elliptic curve

cryptosystems are considered the best compromise

between the required security and the attainable

performance for many low resource constrained

security systems. Scalability versus performance, in

particular for low resources applications, has always

been a challenging trade-off in ECC hardware

implementations. The efficiency of of this trade-off

depends significantly on the efficient

implementation and scalability of the modular

arithmetic of the underlying field. The computation

of the modular inversion is the most challenging

from this perspective. Hence, the contribution of the

work presented in this paper.

In the literature, several algorithms for

computing the multiplicative inverse in GF(2

) have

been proposed (D. Hankerson, 2004). Some of them

are based on performing modular multiplication like

Fermat’s little theorem. In contrast, others apply the

greatest common divisor algorithm GCD which has

many variants. All these variants can compute the

modular inverse in about 2m iterations. However,

the Montgomery inversion algorithm (B. Kaliski,

1995) offers better performance and can perform the

inversion in less than 2m iterations. Consequently,

this work investigates Montgomery modular

inversion and develops algorithmic modifications

that reduce the hardware complexity whilst offering

scalable and parameterized inversion with low area

architecture over FPGAs.

A modified algorithm for Montgomery is

therefore proposed and implemented on the smallest

and lowest cost Xilinx FPGA. The architecture is

parameterized to support variable word lengths and

has been implemented with 8, 16, 32 and 64 bit

word lengths for finite field lengths m=163 and

m=571. The results obtained show that the 32-bit

data path designs are the best compromise between

the low area requirements and the practical

performance in terms of throughput (4.63 Mbps for

m = 163).

This paper is organized as follows: section 2

presents a theoretical background about ECC over

GF(2

). Section 3 gives an overview about the

Montgomery modular inversion. The proposed

improved algorithm is presented in section 4. The

description of the circuit operation and the FPGA

implementation are in Section 5. Finally, Section 6

shows the performance and results of the

implementation on a state of the art FPGA.

2 ELLIPTIC CURVE ITHMETIC

OVER GF(2

)

Briefly, a cryptosystem based on an elliptic curve E

over finite fields GF(2

) is mainly used for

encipherment of point P by key K such that, Q=K.P.

YXYXaXb

=++

(1)

363

N. Hassan M. and Benaissa M. (2008).

LOW AREA SCALABLE MONTGOMERY INVERSION OVER GF(2m).

In Proceedings of the International Conference on Security and Cryptography, pages 363-367

DOI: 10.5220/0001923503630367

 SciTePress

This operation is called scalar multiplication

(N.Koblitz, 1984). Practically, P is a point lies on the

curve E or equally the data to be encrypted.

Multiplying K by P can be achieved by many

methods e.g. double and add or shift and add, etc.

Actually, this operation dominates the execution

time. Both Q and P must satisfy the equation that

represents the elliptic curve E over GF(2

), namely,

Where

a,b GF(2 )∈

. Equation (1) is called the

simplified Weierstarss equation over the finite field

GF(2

) with characteristic

in the affine

(Euclidean) coordinates .

For two points

P and

GF(2 )∈ .

Point addition

(

)

12 33

P P (x ,y )+= :

Let

111

P (x,y)= ,

222

P (x,y)=

and

(

)

12 33

PP (x,y)+=

Where

PP≠±

and

(

)

12 1 2

P,P , P P GF(2 )+∈ . Then,

Point doubling

Let

111

P (x,y)=

GF(2 )∈

PP≠−

and

133

2P (x , y )= .

Thus, we can observe from equations (2,3) that one

inversion is involved in both point addition and

point doubling over the elliptic curve E.

Where

112

xpp

⎧

⎛⎞

≠

⎪

⎜⎟

⎪

⎝⎠

⎨

⎪

⎩

…



3 MONTGOMERY INVERSION

AND ITS VARIANT

Based on the extended binary algorithm and

Montgomery trick for computing the modular

multiplication (L. Montgomery, 1985). B.Kaliski

was the first to propose the Montgomery inverse

algorithm for a given irreducible polynomial p(x)

and for any element a(x)∈ GF(p) or GF(2

Montgomery inversion for element a(x) is defined

as,

MonInv(a(x)) a .2 mod p(x)

−

B.Kaliski proposed two phases to compute the

inverse of a(x). The first phase is dedicated to

compute the GCD of both a(x) and p(x) and

concurrently, calculates the number of halvings L.

This phase produces the partial Montgomery

inverse.

Par.MonInv(a(x)) a .2 mod p(x)

−

The number L is in the range

mL2m≤≤ .

Then, the second phase performs L-m right shifting

on the partial Montgomery inverse to produce the

final inverse in the Montgomery domain. The

number of iterations can be adjusted to right shifts or

left shifts of the output from the first phase to get the

inverse in the required domain (Montgomery or

residue). The Montgomery inversion Algorithm uses

4 vectors to hold the intermediate calculations

between the successive iterations. Although the

B.Kaliski Algorithm is simple, it has no fixed

number of iterations which makes it difficult to be

mapped into hardware efficiently. M.Shieh, J.Chen,

and C.Ming (M.Shieh, 2006) developed a new

modification to the Kaliski’s algorithm, as shown in

algorithm 1, in which only one phase is required.

Consequently, by this improvement the data

dependency between the first phase and the second

phase has been eliminated. Moreover, this also

avoided the zero comparison operation required by

the original algorithm.

Algorithm 1.

Input : (). ,()

ax x px

Output:

().2.mod()

−

ax px

Initialize :

u p, v a, r 0, s 1, L 0

←

←←← ←

For 0 : 2

{If (u is even) then

u u/2, s 2s.mod p(x),

else if (v is even) then

v v/2 , r 2r mod p(x),

else if (u > v) then

u (u v)/2, r r+s, s 2s mod p(x

Lm→

←←

←⊕ ← ← )

else u (u v)/2, s r + s, r 2r mod p(x)←⊕ ← ←

L L + l }

return r;

←

4 MODIFIED MONTGOMERY

MODULAR INVERSE

ALGORITHM

In this section, based on algorithm 1, an

improvement for the Montgomery inversion

algorithm over GF(2

) is represented. Kim and

Hong (C.H.Kim, 2003) introduced a development

based on a modified version of the binary extended

great common divisor algorithm BGCD. Their

algorithm is suitable for realizing a compact and fast

inverters over GF(2

). Simply, they replaced the

degree comparison employed by the BGCD with a

counter and state indicator bit. By applying the same

idea to Algorithm 1 we can define a new modified

version of the Montgomery inversion algorithm,

312

xxxa

λλ

=++++&

(

)

31331

yxxxy

+++

(2)

31 33

yx xx

=+ +

(3)

SECRYPT 2008 - International Conference on Security and Cryptography

364

Algorithm 2 below, which will save m XORs used

to perform the degree comparison. Besides, these m

XORs lie on the critical path of the data path. Hence,

we can have great savings not only in terms of area

but also in terms of reducing the delays caused by

degree comparison in algorithm 1.

The algorithm 2 proceeds as follows:

At the beginning, the counter, the state bit, the

vectors u, v, s and r are initialized. Thus, we have u

> v at the beginning. This means that degree of u has

to be decremented according to the BGCD

algorithm. Further, at the start of the algorithm the

value of u

always equals to 1.

Algorithm 2.

Input : (). ,()

ax x px

Output :

().2.mod (),

rax px

−

Initialize :

up,va,r0,s1,L0

State 0,Count 0

←←←←←

←←

For 1 : 2

{ If (sta te= 0 ) th e n

if (u is even ) th en

{ u u/2, s 2s.m od p(x)

if (c o u n t = 0 ) th e n

c o u n t= co u n t + 1 ; s ta te = 1

e n d if

Lm←

←←

}

else if (v is even) then

{ v v/2, r 2r.m od p(x) , count=count +1}

e ls e

{ u (u v)/2, r r+ s, s 2 s m o d p (x ),

count=count -1

if (count = 0 )then

c o u n t= co u n t + 1 ; s t

←←

←⊕ ← ←

ate = 1

e n d if } }

else if (state= 1 ) th en

if (u is even ) th en

{ u u/2, s 2s.m od p(x),count=count +1}

else if (v is even) then

{ v v/2, r 2r.m od p(x),count=coun

←←

←← t - 1

if (count = 0) then

count=count +1 ; state = 0

e n d if }

e ls e

{ v (u v)/2, s r+s, r 2r m od p(x),

count=count -1

if (count = 0 )then

c o u n t=

←⊕←←

count +1 ; state = 0

en d if } }

}

retu rn r ;

We have two possible conditions for the vector

v. If v

=1, hence, in the second iteration the counter

will be incremented by one and the state bit will

equal one. The procedure for decreasing the degree

of u is performed by XORring u and v, dividing the

value by 2 and saving the result in u. In parallel,

vector s is XORed with r and vector

s is doubled.

The results of the two operations will be stored in r

and s respectively. The other possible condition is

=0. Thus, the vector v is even. Hence, the counter

will increment by one but the state bit will remain

zero. Next, the vector v will be divided by two and

the vector r will be doubled. Accordingly, the value

of the state bit =0 and the counter >0. For the state

bit = 1, If u

= 1 and v

=1. This means that the

degree of v

>u. Hence, the degree of v has to be

reduced. Thus, the vector v is XORed with u and the

result will be stored back in v. In parallel, vector r is

XORed with s and vector

r is doubled; the results of

the two operations will be stored in s and r

respectively and the counter value will be

decremented by one. If the value of the counter

becomes zero the state bit will be equal to zero

otherwise will remain one. The algorithm keeps

track as the procedures in algorithm 2 until 2m

iterations. After 2m iterations, the value of the

vectors u converges to one. Meanwhile, the values

of the vectors v and s converge to zero. Finally, the

inverse of the vector a(x) represented in the

Montgomery domain will be the value in the vector r

5 CIRCUIT DESIGN

Figure 1 depicts the new architecture for the

Montgomery inverter. The data path consists of two

blocks, namely, u-v block and s-r block. The first is

to compute the intermediate values for vectors u and

v and the second to compute in parallel the

intermediate values for vectors s and r. A control

block is designed for, interfacing the dual block

RAMs (DBRAM), decisions required by the

algorithm and the operations necessary for

computing the inverse (shifting operation, reduction,

checking the even-non even condition. etc).

As shown in figure 2 and figure 3, both u-v and

s-r blocks have a (DBRAM) that acts to hold the

vectors u, v, s, and r. The (DBRAM) in each block is

addressed by a counter controlled by the control

block. Counters are scalable and they accommodate

addressing the (DBRAM) up to

2*(

(m-m.modWord-Length)/Word- Length+1) memory

depth, where

m is the length of the vector a(x). Both

u-v and s-r blocks have two shifting units. In the u-v

block, the shifting unit is right shifting. Meanwhile,

in the s-r block, the unit is left shifting. Both units

load the word to be shifted, storing the most

significant digit MSD for the left shift unit or the

least significant digit LSD for the right shift unit to

be added to the next word, shif left or right by the

corresponding number of shift counts, and then write

the shifted word to the (DBRAM) port. The

Reduction unit is designed to be parameterized and

scalable to accommodate finite fields up to

m 571

≤

in addition to different data path widths. NIST

recommended reduction polynomials (NIST, 2000)

LOW AREA SCALABLE MONTGOMERY INVERSION OVER GF(2m)

365

are used to implement the reduction unit as they are

designed to provide both security and high

performance.

6 CONCLUSIONS AND RESULTS

The proposed modified algorithm for Montgomery

inversion has been fully modelled in VHDL and

implemented on the smallest and lowest cost chip

available from Xilinx Spartan III family (XC3S50).

The proposed architecture is parameterized in order

to support variable word lengths. A scalable

architecture has been implemented with 8, 16, 32

and 64 bit word lengths. Table 1-2 shows the

implementation results for the different widths after

place and rout for finite field lengths m=163 and

m=571. As expected, the control block and counters

dominate the critical path of the design. Thus, the

increment of the operand size has a lesser effect on

the working frequency. The results show that the 32-

bit data path designs are the best compromise

between the low area requirements and the practical

performance in terms of throughput (4.63 Mbps for

m = 163). Further, the proposed architecture with

low hardware resources is expected to yield

correspondingly lower power budgets and therefore

would be suited for low resource ECC

implementations.

Table 1: FPGA Implementation Results for Different Data

Path Widths on Spartan III XC3S50 assuming

m =163.

Table 2: FPGA Implementation Results for Different Data

Path Widths on Spartan III XC3S50 assuming

m =571.

Figure 1: Montgomery inverse basic building block.

Figure 2: Montgomery inverse s-r block.

Figure 3: Montgomery inverse u-v block.

Data-

path

width

Look-up

tables

Area

(slice)

Freq.

(MHz)

Throughput

(Mbps)

Throughput

/area

kbit/s.Slices

8 596 319 96 0.87 2.73

16 796 439 94.417 1.63 3.72

32 1005 583 85.096 2.7 4.63

64 1247 697 82.066 5.2 7.5

Data-

path

width

Look-up

tables

Area

(slice)

Freq.

(MHz)

Throughput

(Mbps)

Throughput

/area

kbit/s.Slices

8 596 319 96 0.256 0.8

16 796 439 94.417 0.53 1.2

32 1005 583 85.096 0.9 1.55

64 1247 697 82.066 1.74 2.5

a(x)

p(x)

even

Sel-u

Load

Sel-v

En port-u

WE port v

En port-v

WE port u

D-BRAM

Address u

Address v

u XOR v

Shift right

Shift count

Compare

even

Even

Not even

Shift count

Start

Address

Field

-1

(x).2

Address

Down

Counter

p(x)

a(x)

Up Counter

u-v

Block

s-r

Block

Control Block

Inverse ready

Field

Sel-v

D-BRAM

WE port

WE port r

Address s

Address r

En port-r

En port-s

Field

-1

(

)

Start

reduction

Shift

Reduce

Shift left

Shift count

Reduction

Unit

SECRYPT 2008 - International Conference on Security and Cryptography

366

REFERENCES

D. Hankerson, A. Menezes, and S. Vanstone.” Guide to

Elliptic Curve Cryptography

.” Springer-Verlag, 2004.

N. Koblitz, “

Introduction to Elliptic Curves and Modular

Form

s” Graduate Texts in Mathematics, Vol. 97,

Springer, 1984.

P. L. Montgomery. “

Modular Multiplication without Trial

Division

” Mathematics of Computation, vol.44. April

1985.

B. Kaliski.”

The Montgomery inverse and its applications”.

IEEE Transactions on Computers, Vol. 44, No.8,

August 1995.

NIST “

Recommended elliptic curves for federal

government tuse

”, Available at http://

csrc.nist.gov/encryption/.2000.

M. Shieh. J.Chen, And C.Ming “

High-Speed Design of

Montgomery Inverse Algorithm over GF(2

)” IEICE

Trans. Fundamentals, Vol.E89–A, February 2006.

C. H. Kim, S. Kwon, J.J. Kim, C.P. Hong, “

A Compact

and Fast Division Architecture for a Finite Field

GF(2

)”. ICCSA 2003, LNCS 2667, pp. 855-864,

2003.

LOW AREA SCALABLE MONTGOMERY INVERSION OVER GF(2m)

367