Algorithm 2 below, which will save m XORs used
to perform the degree comparison. Besides, these m
XORs lie on the critical path of the data path. Hence,
we can have great savings not only in terms of area
but also in terms of reducing the delays caused by
degree comparison in algorithm 1.
The algorithm 2 proceeds as follows:
At the beginning, the counter, the state bit, the
vectors u, v, s and r are initialized. Thus, we have u
> v at the beginning. This means that degree of u has
to be decremented according to the BGCD
algorithm. Further, at the start of the algorithm the
value of u
0
always equals to 1.
Algorithm 2.
Input : (). ,()
m
ax x px
Output :
1
().2.mod (),
m
rax px
−
=
Initialize :
up,va,r0,s1,L0
State 0,Count 0
←←←←←
←←
For 1 : 2
{ If (sta te= 0 ) th e n
if (u is even ) th en
{ u u/2, s 2s.m od p(x)
if (c o u n t = 0 ) th e n
c o u n t= co u n t + 1 ; s ta te = 1
e n d if
Lm←
←←
}
else if (v is even) then
{ v v/2, r 2r.m od p(x) , count=count +1}
e ls e
{ u (u v)/2, r r+ s, s 2 s m o d p (x ),
count=count -1
if (count = 0 )then
c o u n t= co u n t + 1 ; s t
←←
←⊕ ← ←
ate = 1
e n d if } }
else if (state= 1 ) th en
if (u is even ) th en
{ u u/2, s 2s.m od p(x),count=count +1}
else if (v is even) then
{ v v/2, r 2r.m od p(x),count=coun
←←
←← t - 1
if (count = 0) then
count=count +1 ; state = 0
e n d if }
e ls e
{ v (u v)/2, s r+s, r 2r m od p(x),
count=count -1
if (count = 0 )then
c o u n t=
←⊕←←
count +1 ; state = 0
en d if } }
}
retu rn r ;
We have two possible conditions for the vector
v. If v
0
=1, hence, in the second iteration the counter
will be incremented by one and the state bit will
equal one. The procedure for decreasing the degree
of u is performed by XORring u and v, dividing the
value by 2 and saving the result in u. In parallel,
vector s is XORed with r and vector
s is doubled.
The results of the two operations will be stored in r
and s respectively. The other possible condition is
v
0
=0. Thus, the vector v is even. Hence, the counter
will increment by one but the state bit will remain
zero. Next, the vector v will be divided by two and
the vector r will be doubled. Accordingly, the value
of the state bit =0 and the counter >0. For the state
bit = 1, If u
0
= 1 and v
0
=1. This means that the
degree of v
>u. Hence, the degree of v has to be
reduced. Thus, the vector v is XORed with u and the
result will be stored back in v. In parallel, vector r is
XORed with s and vector
r is doubled; the results of
the two operations will be stored in s and r
respectively and the counter value will be
decremented by one. If the value of the counter
becomes zero the state bit will be equal to zero
otherwise will remain one. The algorithm keeps
track as the procedures in algorithm 2 until 2m
iterations. After 2m iterations, the value of the
vectors u converges to one. Meanwhile, the values
of the vectors v and s converge to zero. Finally, the
inverse of the vector a(x) represented in the
Montgomery domain will be the value in the vector r
5 CIRCUIT DESIGN
Figure 1 depicts the new architecture for the
Montgomery inverter. The data path consists of two
blocks, namely, u-v block and s-r block. The first is
to compute the intermediate values for vectors u and
v and the second to compute in parallel the
intermediate values for vectors s and r. A control
block is designed for, interfacing the dual block
RAMs (DBRAM), decisions required by the
algorithm and the operations necessary for
computing the inverse (shifting operation, reduction,
checking the even-non even condition. etc).
As shown in figure 2 and figure 3, both u-v and
s-r blocks have a (DBRAM) that acts to hold the
vectors u, v, s, and r. The (DBRAM) in each block is
addressed by a counter controlled by the control
block. Counters are scalable and they accommodate
addressing the (DBRAM) up to
2*(
(m-m.modWord-Length)/Word- Length+1) memory
depth, where
m is the length of the vector a(x). Both
u-v and s-r blocks have two shifting units. In the u-v
block, the shifting unit is right shifting. Meanwhile,
in the s-r block, the unit is left shifting. Both units
load the word to be shifted, storing the most
significant digit MSD for the left shift unit or the
least significant digit LSD for the right shift unit to
be added to the next word, shif left or right by the
corresponding number of shift counts, and then write
the shifted word to the (DBRAM) port. The
Reduction unit is designed to be parameterized and
scalable to accommodate finite fields up to
m 571
in addition to different data path widths. NIST
recommended reduction polynomials (NIST, 2000)
LOW AREA SCALABLE MONTGOMERY INVERSION OVER GF(2m)
365