SIMD-based Implementations of Eta Pairing Over Finite Fields

of Small Characteristics

Anup Kr. Bhattacharya

, Abhijit Das

, Dipanwita Roychowdhury

, Bhargav Bellur

and Aravind Iyer

Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur 721302, India

General Motors Technical Centre India, India Science Lab, Bangalore 560066, India

eywords:

Supersingular Elliptic Curves, Eta Pairing, Software Implementation, SIMD, SSE Intrinsics.

Abstract:

Eta pairing on supersingular elliptic curves deﬁned over ﬁelds of characteristics two and three is a popular

and practical variant of pairing used in many cryptographic protocols. In this paper, we study SIMD-based

implementations of eta pairing over these ﬁelds. Our implementations use standard SIMD-based vectorization

techniques which we call horizontal and vertical vectorization. To the best of our knowledge, we are the ﬁrst

to study vertical vectorization in the context of curves over ﬁelds of small characteristics. Our experimentation

using SSE2 intrinsics reveals that vertical vectorization outperforms horizontal vectorization.

1 INTRODUCTION

Pairing over algebraic curves are extensively

used (Boneh and Franklin, 2001; Joux, 2004; Boneh

et al., 2004) in designing cryptographic protocols.

There are two advantages of using pairing in these

protocols. Some new functions are realized using

pairing (Boneh and Franklin, 2001; Joux, 2004).

Many other protocols (Boneh et al., 2004) achieve

small signature sizes at the same security level.

Miller’s algorithm (Miller, 2004) is an efﬁcient

way to compute pairing. Tate and Weil are two main

variants of pairing functions on elliptic curves, with

Tate pairing computation being signiﬁcantly faster

than Weil pairing for small ﬁelds. In the last few

years, many variants of Tate pairing (Barreto et al.,

2007; Hess et al., 2006; Lee et al., 2009) are proposed

to reduce the computation complexity of Tate pair-

ing substantially. Eta pairing (Barreto et al., 2007)

is one such variant deﬁned for supersingular curves.

Some pairing-friendly families (Freeman et al., 2010)

of curves are deﬁned over prime ﬁelds and over ﬁelds

of characteristics two and three. (Vercauteren, 2010)

proposes the concept of optimal pairing which gives

lower bounds on the number of Miller iterations re-

quired to compute pairing.

There have been many attempts to compute pair-

ing faster. (Barreto et al., 2002) propose many simpli-

ﬁcations of Tate-pairing algorithms. Final exponenti-

ation is one such time-consuming step in pairing com-

putation. (Scott et al., 2009) propose elegant meth-

ods to reduce the complexity of ﬁnal exponentiation.

(Ahmadi et al., 2007) and (Granger et al., 2005) de-

scribe efﬁcient implementations of arithmetic in ﬁelds

of characteristic three for faster pairing computation.

Multi-core implementations of Tate pairing are re-

ported in (Beuchat et al., 2009; Aranha et al., 2010).

(Beuchat et al., 2009) provide an estimate on the op-

timal number of cores needed to compute pairing in a

multi-core environment.

Many low-end processors are released with SIMD

facilities which provide the scope of parallelization

in resource-constrained applications. SIMD-based

implementations of pairing are reported in (Beuchat

et al., 2009; Aranha et al., 2010; Hankerson et al.,

2008). All these data-parallel implementations vec-

torize individual pairing computations, and vary in

their approaches to exploit different SIMD intrinsics

in order to speed up the underlying ﬁeld arithmetic.

This technique is known as horizontal vectorization.

The other SIMD-based vectorization technique,

vertical vectorization, has also been used for efﬁcient

implementation purposes. (Montgomery, 1991) ap-

plies vertical vectorization to Elliptic Curve Method

(ECM) to factor integers. For RSA implementations

using SSE2 intrinsics, (Page and Smart, 2004) use

two SIMD-based techniques called inter-operation

and intra-operation parallelisms. (Grabher et al.,

2008) propose digit slicing to reduce carry-handling

overhead in the implementation of ate pairing over

Barreto-Naerhig curves deﬁned over prime ﬁelds. Im-

plementation results with both inter-pairing and intra-

Kr. Bhattacharya A., Das A., Roychowdhury D., Bellur B. and Iyer A..

SIMD-based Implementations of Eta Pairing Over Finite Fields of Small Characteristics.

DOI: 10.5220/0004023000940101

In Proceedings of the International Conference on Security and Cryptography (SECRYPT-2012), pages 94-101

ISBN: 978-989-8565-24-2

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

pairing parallelism techniques are provided and a

number of implementation strategies are discussed.

Intuitively, so long as different instances of some

computation follow fairly the same sequence of basic

CPU operations, parallelizing multiple instances (ver-

tical vectorization) would be more effective than par-

allelizing each such instance individually (horizontal

vectorization). Computation of eta pairing on curves

over ﬁelds of small characteristics appears to be an

ideal setting for vertical vectorization. This is par-

ticularly relevant because all the parties in a standard

elliptic-curve-based protocol typically use the same

curve and the same base point (unlike in RSA where

different entities use different public moduli).

Each of the two vectorization models (horizontal

and vertical) has its private domains of applicability.

Even in the case of pairing computation, vertical vec-

torization does not outperform horizontal vectoriza-

tion in every step. For example, comb-based multipli-

cation (L´opez and Dahab, 2000) of ﬁeld elements is

expected to be more efﬁcient under vertical vectoriza-

tion than under horizontal vectorization. On the con-

trary, modular reduction using polynomial division

seems to favour horizontal vectorization more than

vertical vectorization, since the number of steps in the

division loop and also the shift amounts depend heav-

ily on the operands. This problem can be bypassed by

using deﬁning polynomials with a small number of

non-zero coefﬁcients (like trinomials or pentanomi-

als). However, computing inverse by the extended

polynomialgcd algorithm cannot be similarly tackled.

Moreover, vertical vectorization is prone to encounter

more cache misses compared to horizontal vectoriza-

tion and even to non-SIMD implementation. The ef-

fects of cache misses are rather pronounced for algo-

rithms based upon lookup tables (like comb methods).

Despite all these potential challenges, vertical

vectorization may be helpful in certain cryptographic

operations. Our experimentation with SSE intrinsics

reveals that this is the case for eta pairing on super-

singular curves over ﬁelds of characteristics two and

three. More precisely, horizontal vectorization leads

to speedup between 15–20% over non-SIMD imple-

mentation. Vertical vectorization, on the other hand,

yields an additional 15–25% speedup. In short, the

validation of the effectiveness of vertical vectoriza-

tion in pairing computations is the main technical

contribution of this paper.

The rest of the paper is organized as follows. Sec-

tion 2 reviews the notion of pairing, and lists the algo-

rithms used to implement ﬁeld and curve arithmetic.

Section 3 describes horizontal and vertical vectoriza-

tion techniques. We also intuitively explain which

of the basic operations are likely to beneﬁt for ver-

tical vectorization compared to horizontal vectoriza-

tion. Our experimental results are tabulated in Sec-

tion 4. We conclude the paper in Section 5 after high-

lighting some potential areas of future research.

2 BACKGROUND ON ETA

PAIRING

In this section, we brieﬂy describe standard algo-

rithms that we have used for implementing arithmetic

in extension ﬁelds of characteristics two and three.

We subsequently state Miller’s algorithm for the com-

putation of eta pairing on supersingular curves over

these ﬁelds.

2.1 Eta Pairing in a Field of

Characteristic Two

We implemented eta pairing overthe supersingular el-

liptic curve y

+ y = x

+ x deﬁned over the binary

ﬁeld F

1223

represented as an extension of F

by the ir-

reducible polynomial x

1223

+ x

255

+ 1. An element of

1223

is packed into an array of 64-bit words. The ba-

sic operations on such elements are done as follows.

• Addition: We perform word-level XOR to add

multiple coefﬁcients together.

• Multiplication: Computing products c = ab in

the ﬁeld is costly, but needed most often in

Miller’s algorithm. Comb-based multiplication

(L´opez and Dahab, 2000) with four-bit windows

is used in our implementations.

• Inverse: We use the extended Euclidean gcd al-

gorithm for polynomials to compute the inverse

of an element in the binary ﬁeld.

• Square: We use a precomputed table of square

values for all possible 8-bit inputs.

• Square Root: The input element is written as

a(x

) + xb(x

). Its square root is computed as

a(x) + x

1/2

b(x), where x

1/2

= x

612

+ x

128

• Reduction: Since the irreducible polynomial

deﬁning F

1223

has only a few non-zero coefﬁ-

cients, we use a fast reduction algorithm (as in

(Scott, 2007)) for computing remainders modulo

this polynomial.

The embedding degree for the supersingular curve

stated above is four. So we need to work in the

ﬁeld F

1223

)

. This ﬁeld is represented as a tower

of two quadratic extensions over F

1223

. The ba-

sis for this extension is given by (1,u,v,uv), where

g(u) = u

+ u + 1 is the irreducible polynomial for

SIMD-basedImplementationsofEtaPairingOverFiniteFieldsofSmallCharacteristics

the ﬁrst extension, and h(v) = v

+ v + u deﬁnes the

second extension. The distortion map is given by

ψ(x,y) = (x+ u

,y+ xu+ v).

Addition in F

1223

)

uses the standard word-wise

XOR operation on elements of F

1223

. Multiplication

in F

1223

)

can be computed by six multiplications in

the ﬁeld F

1223

(Hankerson et al., 2008).

Algorithm 1 describes the computation of eta

pairing η

. This is an implementation (Hankerson

et al., 2008) of Miller’s algorithm for the supersin-

gular curve E

: y

+ y = x

+ x under the above rep-

resentation of F

1223

and F

1223

)

. Here, the point

P ∈ E

1223

) on the curve has prime order r. Q too

is a point with both coordinates from F

1223

. The dis-

tortion map is applied to Q. Algorithm 1 does not

explicitly show this map. The output of the algorithm

is an element of µ

, the order-r subgroup of F

∗

1223

)

Algorithm 1: Eta Pairing Algorithm for a Field of Charac-

teristic Two.

Input: P = (x

),Q = (x

) ∈ E(F

1223

)[r]

Output: η

(P,Q) ∈ µ

T ← x

+ 1

f ← T ·(x

+ x

+ 1) + y

+ y

+ (T + x

)u+ v

for i = 1 to 612 do

T ← x

←

√

←

√

g ←T·(x

) + y

+ y

+ x

+ 1+ (T+x

)u+ v

f ← f ·g

← x

← y

end for

return f

−1)(q−

√

2q+1)

, where q = 2

1223

The complexity of Algorithm 1 is dominated by

the 612 iterations (called Miller iterations), and ex-

ponentiation to the power (q

−1)(q−

√

2q+ 1) (re-

ferred to as the ﬁnal exponentiation). In each Miller

iteration, two square roots, two squares, and seven

multiplications are performed in the ﬁeld F

1223

. In

the entire Miller loop, 1224 square roots and 1224

squares are computed, and the number of multiplica-

tions is 4284. Evidently, the computation of the large

number of multiplications occupies the major portion

of the total computation time. Each multiplication of

1223

)

(computation of f ·g) is carried out by six

multiplications in F

1223

. In these six multiplications,

three variables appear as one of the two operands.

Therefore, only three precomputations (instead of six)

are sufﬁcient for performing all these six multiplica-

tions by the Lopez-Dahab method. For characteristic-

three ﬁelds, such a trick is proposed in (Takahashi

et al., 2007). Using Frobenius endomorphism (Scott

et al., 2009; Hankerson et al., 2008), the ﬁnal expo-

nentiation is computed, so this operation takes only a

small fraction of the total computation time.

2.2 Eta Pairing in a Field of

Characteristic Three

The irreducible polynomial x

509

−x

318

−x

192

127

1 deﬁnes the extension ﬁeld F

509

. The curve y

−x + 1 deﬁned over this ﬁeld is used. Each ele-

ment of the extension ﬁeld is represented using two

bit vectors (Smart et al., 2002). The basic operations

on these elements are implemented as follows.

• Addition and Subtraction: We use the formulas

given in (Kawahara et al., 2008).

• Multiplication: Comb-based multiplication (Ah-

madi et al., 2007) with two-bit windows is used in

our implementations.

• Inverse: We use the extended Euclidean gcd al-

gorithm for polynomials to compute the inverse

of an element in the ﬁeld.

• Cube: We use a precomputed table of cube values

for all possible 8-bit inputs.

• Cube Root: The input element is ﬁrst written as

a(x

) + xb(x

) + x

c(x

). Its cube root is com-

puted as a(x) + x

1/3

b(x) + x

2/3

c(x), where x

1/3

467

+ x

361

−x

276

+ x

255

+ x

170

+ x

, and x

2/3

−x

234

+ x

128

−x

(Barreto, 2004; Ahmadi et al.,

2007). We have not used the cube-root-friendly

representation of F

509

prescribed in (Ahmadi and

Rodriguez-Henriquez, 2010).

• Reduction: We use a fast reduction algorithm

(Scott, 2007) for computing remainders modulo

the irreducible polynomial.

The embedding degree in this case is six, so we

need to work in the ﬁeld F

509

)

. A tower of ex-

tensions over F

509

is again used to represent F

509

)

The ﬁrst extension is cubic, and is deﬁned by the ir-

reducible polynomial u

−u −1. The second exten-

sion is quadratic, and is deﬁned by v

+ 1. The basis

of F

509

)

over F

509

is, therefore, (1,u,u

,v,uv, u

v).

The distortion map in this case is ψ(x, y) = (u−x,yv).

For multiplying two elements of F

509

)

, we have

used 18 multiplications in F

509

(Kerins et al., 2005).

The method (Gorla et al., 2007), which uses only 15

such multiplications, is not implemented.

Algorithm 2 describes the computation of eta pair-

ing (Beuchat et al., 2009) in the case of characteristic

three. P and Q are points with both coordinates from

509

. The distortion map is applied to Q. Algorithm 2

SECRYPT2012-InternationalConferenceonSecurityandCryptography

Algorithm 2: Eta Pairing Algorithm for a Field of Charac-

teristic Three.

Input: P = (x

),Q = (x

) ∈ E(F

509

)[r]

Output: η

(P,Q) ∈ µ

←

√

+ 1

← −

√

t ← x

+ x

R ← −(y

t −y

v−y

u)(−t

+ y

v−tu−u

)

[0] ← x

[0] ← y

[0] ← x

[0] ← y

for i = 1 to 254 do

[i] ←

[i−1]

[i] ← X

[i−1]

[i] ←

[i−1]

[i] ←Y

[i−1]

end for

for i = 1 to 127 do

t ← X

[2i−1] + X

[2i−1]

w ←Y

[2i−1]Y

[2i−1]

′

← X

[2i] + X

[2i]

′

←Y

[2i]Y

[2i]

S ←(−t

+ wv−tu−u

)(−t

′2

+ w

′

v−t

′

u−u

)

R ← R·S

end for

return f

−1)(q+1)(q+

√

3q+1)

, where q = 3

509

does not show this map explicitly. The order of P is a

prime r, and µ

is the order-r subgroup of F

∗

509

)

The ﬁrst for loop of Algorithm 2 is a precom-

putation loop. The second for loop implements the

Miller iterations. The ﬁnal exponentiation in the last

line uses Frobenius endomorphism (Scott et al., 2009;

Hankerson et al., 2008). The most time-consuming

operations involved in Algorithm 2 are 508 cubes,

508 cube roots and 3556 multiplications in the ﬁeld

509

(given that one multiplication of F

509

)

is im-

plemented by 18 multiplications in F

509

). The ﬁnal

exponentiation again does not incur a major compu-

tation overhead in Algorithm 2.

3 HORIZONTAL AND VERTICAL

VECTORIZATION

Many modern CPUs, even in desktop machines, sup-

port a set of data-parallel instructions operating on

SIMD registers. For example, Intel has been releasing

SIMD-enabled CPUs since 1999 (Microsoft, 2010).

As of now, most vendors provide support for 128-bit

SIMD registers and parallel operations on 8-, 16-, 32-

and 64-bit data. We work with Intel’s SSE2 instruc-

tions. Since we use 64-bit words for packing of data,

using these SIMD intrinsics can lead to speedup of

nearly two. In practice, we expect less speedup for

various reasons. First, all steps in a computation do

not possess inherent data parallelism. Second, the in-

put and output values are usually available in chunks

of machine words which are 32 or 64 bits in size. Be-

fore the use of an SIMD instruction, one needs to pack

data stored in normal registers or memory locations

to SIMD registers. Likewise, after using an SIMD in-

struction, one needs to unpack the content of an SIMD

Frequent conversion of data between scalar and vec-

tor forms may be costly. Finally, if the algorithm is

memory-intensive, SIMD features do not help much.

We use SIMD-based vectorization techniques for

the computation of eta pairing. These vectorization

techniques provide speedup by reducing the over-

heads due to packing and unpacking. We study two

common SIMD-based vectorization techniques called

horizontal and vertical vectorization. Though vertical

vectorization is capable of reducing data-conversion

overheads substantially, it encounters an increased

memory overhead in terms of cache misses. Ex-

perimental results of eta pairing computation over

ﬁelds of characteristics two and three validate the

claim that vertical vectorization achieves better per-

formance gains compared to horizontal vectorization.

3.1 Horizontal Vectorization

Figure 1 explains the working of horizontal vectoriza-

tion. One single operation ⋆ between two multi-word

operands is to be performed. Machine words of indi-

vidual operands are ﬁrst packed into SIMD registers,

and one SIMD instruction for ⋆ is used to compute the

output in a single SIMD register. The result stored

in the output SIMD registers can further be used in

remaining computations. With 64-bit words packed

two at a time in 128-bit SIMD registers, this use of

SIMD instructions is expected to let the operation ﬁn-

ish using half of as many clock cycles as are needed

by normal 64-bit registers.

As an example, consider operands a and b each

stored in an array of sixteen 64-bit words. Suppose

that we need to compute the bit-wise XOR of a and

b, and store the result in c. A usual 64-bit imple-

mentation calls for sixteen invocations of the CPU

instruction for XOR. SIMD-based XOR handles 128

bits of the operands in one CPU instruction, and ﬁn-

ishes after only eight invocations of this instruction.

The output array c of SIMD registers is available in

the packed format required in future data-parallel op-

SIMD-basedImplementationsofEtaPairingOverFiniteFieldsofSmallCharacteristics

erations in which c is an input.

★ ★

. . .

Operand 1 Operand 2

Packing

SIMD operation

Result

Input

Instance

Figure 1: Horizontal vectorization.

There are, however, situations where horizontal

vectorization requires unpacking of data after a CPU

instruction. Consider the unary left-shift operation on

an array a of sixteen 64-bit words. Let us index the

words of a as a

,...,a

. The words a

2i−1

and

are packed into an SIMD register R

. Currently,

SIMD intrinsics do not provide facilities for shifting

as a 128-bit value by any amount (only byte-level

shifts are allowed). What we instead obtain in the out-

put SIMD register is a 128-bit value in which both the

64-bit components are individually left-shifted. The

void created in the shifted version of a

2i−1

needs to be

ﬁlled by the most signiﬁcant bits of the pre-shift value

of a

. More frustratingly, the void created in a

the shift needs to be ﬁlled by the most signiﬁcant bits

of the pre-shift value of a

2i+1

which is a 64-bit mem-

ber of a separate SIMD register R

i+1

. The other 64-bit

member a

2i+2

packed in R

i+1

must not interfere with

the shifted value of R

. Masking out a

2i+2

from R

i+1

eats up a clock cycle, and is, in principle, same as

unpacking. To sum up, horizontal vectorization may

result in frequent scalar-to-vector and vector-to-scalar

conversions, and suffer from huge packing and un-

packing overheads.

3.2 Vertical Vectorization

Vertical vectorization works as shown in Figure 2.

Two instances of the same operation are carried out on

two different sets of data. Data of matching operands

from the two instances are packed into SIMD reg-

isters, and the same sequence of operations is per-

formed on these registers using SIMD intrinsics. For

each SIMD register, half of its data pertains to one

instance, and the remaining half to the other instance.

After each SIMD instruction, half of the output SIMD

the remaining half the result for the second instance.

Thus, data from two separate instances are maintained

in 64-bit formats in these SIMD registers throughout

a sequence of operations. When the sequence is com-

pleted, data from the ﬁnal output SIMD registers are

unpacked into the respective64-bit storage outputs for

the two instances.

★ ★ ★

Operand 2

. . .

Operand 1

SIMD operation

Instance 2

Instance 1

Result 1

Result 2

Packing

Figure 2: Vertical vectorization.

The advantage of this vectorization technique is

that it adapts naturally to any situation where two

identical sequences of operations are performed on

two separate sets of data. The algorithm does not need

to possess inherent data parallelism. However, the

sequence of operations must be identical (or nearly

identical) on two different sets of data. Finally, a

computation using vertical vectorization does not re-

quire data conversion after every SIMD operation in

the CPU, that is, potentially excessive packing and

unpacking overheads associated with horizontal vec-

torization are signiﬁcantly eliminated.

Let us now explain how vertical vectorization gets

rid of the unpacking requirement after a left-shift op-

eration. Suppose that two operands a and b need to be

left-shifted individually by the same number of bits.

The i-th words a

and b

are packed in an SIMD reg-

ister R

. First, a suitably right-shifted version of R

is stored in another SIMD register S

. After that, R

left-shifted by a single SIMD instruction causing both

and b

to be left-shifted individually. This shifted

SIMD register is then XOR-ed with the SIMD regis-

ter S

. The individual 64-bit words a

i+1

and b

i+1

are

not needed in the unpacked form.

3.3 Vectorization of Eta Pairing

Eta pairing on supersingular curves deﬁned over ﬁelds

of characteristics two and three can be computed us-

ing bit-wise operations only (that is, no arithmetic op-

erations are needed). More precisely, only the XOR,

OR, AND, and the left- and right-shift operations on

64-bit words are required. As explained earlier, both

horizontal and vertical vectorizations behave grace-

fully for the XOR, OR and AND operations. On

the contrary, shift operations are efﬁcient with ver-

tical vectorization only. Therefore, the presence and

SECRYPT2012-InternationalConferenceonSecurityandCryptography

importance of shift operations largely determine the

relative performance of the two vectorization meth-

ods. We now study each individual ﬁeld operation (in

1223

or F

509

) in this respect.

• Addition/Subtraction: Only XOR, OR and

AND operations are needed to carry out addition

and subtraction of two elements in both types of

ﬁelds. So both the vectorization models are suit-

able for these operations.

• Multiplication (without Reduction): We use

comb-based multiplication algorithms in which

both left- and right-shift operations play a cru-

cial role. Consequently, multiplication should

be faster for vertical vectorization than horizontal

vectorization.

• Square/Cube (without Reduction): Since we

have used precomputations in eight-bit chunks,

byte-level shifts sufﬁce, that is, both models of

vectorization are efﬁcient for these operations.

• Modular Reduction: Reduction using the cho-

sen irreducible polynomials call for bit-level shift

operations, so vertical vectorization is favoured.

• Square-root/Cube-root with Modular Reduc-

tion: Extraction of the polynomials a,b (and c for

characteristic three), and multiplication by x

1/2

(or x

1/3

and x

2/3

) involve several shift operations.

So vertical vectorization seems to be the better

choice.

• Inverse: The extended Euclidean algorithm is

problematic for both horizontal and vertical vec-

torization models. On one hand, bit-level shifts

impair the performance of horizontal vectoriza-

tion. On the other hand, the sequence for a gcd

calculation depends heavily on the operands, ren-

dering vertical vectorization infeasible to imple-

ment. We, therefore, use only non-SIMD imple-

mentations for the inverse operation.

Multiplication (with modular reduction) happens

to be the most frequent operation in Algorithms 1 and

2. Vertical vectorization is, therefore, expected to out-

perform horizontal vectorization for these algorithms.

4 EXPERIMENTAL RESULTS

We implement eta pairing on supersingular elliptic

curves deﬁned over ﬁelds of characteristics two and

three. SSE2 intrinsics of Intel Xeon E5410 2.33 GHz

processor are used in our 64-bit C implementations.

We measure the timing of ﬁeld operations and pair-

ing computation with the -O2 optimization ﬂag of the

gcc 4.3.2 compiler. The timing results are reported

in clock cycles. For non-SIMD and horizontal-SIMD

implementations, the timings correspond to the exe-

cution of one ﬁeld operation or one eta-pairing com-

putation. For the vertical-SIMD implementation, two

operations are performed in parallel. The times ob-

tained by our implementation are divided by two in

the tables below in order to indicate the average time

per operation. This is done to make the results directly

comparable with the results from the non-SIMD and

horizontal-SIMD implementations. We use gprof and

valgrind to proﬁle our program. Special cares are

adopted to minimize cache misses (Drepper, 2007).

Table 1: Timing for ﬁeld operations in F

1223

(clock cycles).

Mode Add Mult

∗

Sqr

∗

Sqrt

∗

Non-SIMD 44.5 16098.2 471.3 2831

SIMD (H) 21.2 13019.4 534.8 3051.2

SIMD (V) 21.7 9587.9 445.9 2253.7

Ref 1 8200 600 500

Ref 2 5438.4 480 748.8

Ref 3 4030 160 166

∗

Including modular reduction

Ref 1 (Hankerson et al., 2008)

Ref 2 (Beuchat et al., 2009)

Ref 3 (Aranha et al., 2010)

Table 2: Timing for ﬁeld operations in F

509

(clock cycles).

Mode Add Mult

∗

Cube

∗

Cube root

∗

Non-SIMD 38.1 14277.6 946.8 5769.9

SIMD (H) 22.2 12071.5 1045.3 5833.5

SIMD (V) 19.1 10002 785.8 4130.1

Ref 1 7700 900 1200

Ref 2 4128 900 974.4

∗

Including modular reduction

Ref 1 (Hankerson et al., 2008)

Ref 2 (Beuchat et al., 2009)

Tables 1 and 2 summarize the average computa-

tion times of basic ﬁeld operations in F

1223

and F

509

For the addition and multiplication operations, both

types of SIMD-based implementations perform better

than the non-SIMD implementation. For the square,

square-root, cube and cube-root operations, the per-

formance of the horizontal implementation is slightly

poorer than that of the non-SIMD implementation,

whereas the performance of the vertical implementa-

tion is noticeably better than that of the non-SIMD

implementation. The experimental results tally with

our theoretical observations discussed in Section 3.3.

That is, ﬁeld operations involving bit-level shifts sig-

niﬁcantly beneﬁt from the vertical model of vector-

ization. In particular, the time of each multiplication

operation can be reduced by 10–20% using horizontal

vectorization. For vertical vectorization, this reduc-

tion is in the range 30–40%.

In Table 3, we mention the average times for

computing one eta pairing for non-SIMD, horizontal-

SIMD-basedImplementationsofEtaPairingOverFiniteFieldsofSmallCharacteristics

Table 3: Times for computing one eta pairing (in millions

of clock cycles).

Mode Characteristic Time Speedup

Non-SIMD

2 75.2

3 59.4

SIMD (H)

2 62.2 17.3%

3 50.8 14.5%

SIMD (V)

2 41.9 44.3%

3 42.8 27.9%

Ref 1

2 39

3 33

Ref 2

2 26.86

3 22.01

Ref 3 2 18.76

Ref 1 (Hankerson et al., 2008)

Ref 2 (Beuchat et al., 2009)

Ref 3 (Aranha et al., 2010)

SIMD and vertical-SIMD implementations. The

speedup ﬁgures tabulated are with respect to the

non-SIMD implementation. Vertical vectorization is

seen to signiﬁcantly outperform both non-SIMD and

horizontal-SIMD implementations.

In Tables 1–3, we also mention other reported

implementation results on ﬁnite-ﬁeld arithmetic and

eta-pairing computation. Our implementations are

slower than these three reported implementations. In

fact, these works make use of higher SSE features,

whereas we have used only SSE2 intrinsics. The pa-

pers (Beuchat et al., 2009; Aranha et al., 2010) also

report multi-threaded implementations of eta pairing

which our work does not deal with. The timings given

in the above tables correspond to single threads only.

The main objective of our work is to compare hori-

zontal vectorization with vertical vectorization for the

implementation of eta pairing over ﬁelds of character-

istics two and three. Moreover, we have not used ad-

vanced SIMD features. To this extent, our experimen-

tal results, although slower than the best reported im-

plementations, appear to have served our objectives.

How vertical vectorization performs for (Hankerson

et al., 2008; Beuchat et al., 2009; Aranha et al., 2010)

continues to remain an open research problem.

5 CONCLUSIONS

In this paper, we focus on efﬁcient SIMD-based soft-

ware implementations of eta pairing on supersingu-

lar elliptic curves over ﬁelds of characteristics two

and three. Horizontal and vertical vectorization tech-

niques are studied, and our implementation results es-

tablish the superiority of vertical vectorization over

horizontal vectorization in the context of eta pair-

ing computations over ﬁelds of small characteristics.

Some possible extensions of our work are stated now.

• We have studied the two vectorization models for

bit-wise operations only. It is unclear how the

two models compare when arithmetic operations

are involved. For example, eta pairing on ellip-

tic curves deﬁned over prime ﬁelds heavily use

multiple-precision integer arithmetic. Other types

of pairing (on ordinary curves) and even other

cryptographic primitives(like DSA in prime ﬁelds

and RSA under common moduli) also require in-

teger arithmetic. Managing carries and borrows

during addition and subtraction stands in the way

of effective vectorization. Multiplication poses a

more potent threat to data-parallelism ideas.

• We have used only the SSE2 intrinsics set, chieﬂy

because of its wide availability. It is worthwhile

to investigate the impacts of exploiting additional

features supplied by higher SSE versions. One

may also use other intrinsics sets (like IBM’s Al-

tiVec and AMD’s 3DNow!) to compare two mod-

els of vectorization.

• GPUs, although not as common and cheap as

SIMD registers, constitute an emerging platform

for vectorizing pairing computations. Distribut-

ing computations among GPU threads and asso-

ciated memory-management issues (like packing

operands in arrays instead of SIMD registers) are

substantially different from the type of experi-

ments we have reported in this paper. The max-

imum parallelizability of eta pairing under hor-

izontal vectorization in the context of GPUs is

limited by the size of the operands. On the con-

trary, vertical vectorization imposes no such re-

strictions, and is capable of making eta pairing ar-

bitrarily parallelizable, at least in theory. A prac-

tical validation of this claim is another area that

merits research attention.

REFERENCES

Ahmadi, O., Hankerson, D., and Menezes, A. (2007). Soft-

ware Implementation of Arithmetic in F

. In Inter-

national Workshop on the Arithmetic of Finite Fields

(WAIFI 2007), pages 85–102.

Ahmadi, O. and Rodriguez-Henriquez, F. (2010). Low

Complexity Cubing and Cube Root Computation over

in Polynomial Basis. IEEE Transactions on Com-

puters, 59:1297–1308.

Aranha, D. F., L´opez, J., and Hankerson, D. (2010). High-

Speed Parallel Software Implementation of the η

Pairing. In CT-RSA 2010, pages 89–105.

Barreto, P. S. L. M. (2004). A Note on Efﬁcient Computa-

tion of Cube Roots in Characteristic 3. In IACR Eprint

Archive. http://eprint.iacr.org/2004/305.

SECRYPT2012-InternationalConferenceonSecurityandCryptography

100

Barreto, P. S. L. M., Galbraith, S. D., O

Eigeartaigh, C., and

Scott, M. (2007). Efﬁcient Pairing Computation on

Supersingular Abelian Varieties. Designs, Codes and

Cryptography, 42(3):239–271.

Barreto, P. S. L. M., Kim, H. Y., Lynn, B., and Scott, M.

(2002). Efﬁcient Algorithms for Pairing-Based Cryp-

tosystems. In CRYPTO 2002, pages 354–368.

Beuchat, J.-L., L´opez-Trejo, E., Mart´ınez-Ramos, L., Mit-

sunari, S., and Rodrguez-Henr´ıquez, F. (2009). Multi-

core Implementation of the Tate Pairing over Super-

singular Elliptic Curves. In Cryptology and Network

Security, pages 413–432.

Boneh, D. and Franklin, M. K. (2001). Identity-Based En-

cryption from the Weil Pairing. In CRYPTO 2001,

pages 213–229.

Boneh, D., Lynn, B., and Shacham, H. (2004). Short Sig-

natures from the Weil Pairing. Journal of Cryptology,

17:297–319.

Drepper, U. (2007). What Every Programmer Should Know

About Memory. http://lwn.net/Articles/250967/.

Freeman, D., Scott, M., and Teske, E. (2010). A Taxon-

omy of Pairing-Friendly Elliptic Curves. Journal of

Cryptology, 23:224–280.

Gorla, E., Puttmann, C., and Shokrollahi, J.

(2007). Explicit Formulas for Efﬁcient Mul-

tiplication in F

. In SAC, pages 173–183.

http://portal.acm.org/citation.cfm?id=1784881.17848

93.

Grabher, P., Großsch¨adl, J., and Page, D. (2008). On Soft-

ware Parallel Implementation of Cryptographic Pair-

ings. In SAC, pages 35–50.

Granger, R., Page, D., and Stam, M. (2005). Hardware and

Software Normal Basis Arithmetic for Pairing-Based

Cryptography in Characteristic Three. IEEE Trans.

Computers, 54(7):852–860.

Hankerson, D., Menezes, A., and Scott, M. (2008). Soft-

ware Implementation of Pairings. In Identity Based

Cryptography, pages 188–206. IOS Press.

Hess, F., Smart, N. P., and Vercauteren, F. (2006). The Eta

Pairing Revisited. IEEE Transactions on Information

Theory, 52(10):4595–4602.

Joux, A. (2004). A One Round Protocol for Tripartite

Difﬁe-Hellman. Journal of Cryptology, 17:263–276.

Kawahara, Y., Aoki, K., and Takagi, T. (2008). Faster

Implementation of η

Pairing over GF(3

) Using

Minimum Number of Logical Instructions for GF(3)-

Addition. In Pairing, pages 282–296.

Kerins, T., Marnane, W. P., Popovici, E. M., Barreto, P. S.

L. M., and Brazil, S. P. (2005). Efﬁcient Hardware For

The Tate Pairing Calculation In Characteristic Three.

In CHES, pages 412–426.

Lee, E., Lee, H. S., and Park, C. M. (2009). Efﬁcient

and Generalized Pairing Computation on Abelian Va-

rieties. IEEE Transactions on Information Theory,

55:1793–1803.

L´opez, J. and Dahab, R. (2000). High Speed Software Im-

plementation in F

. In Indocrypt 2000, LNCS, pages

93–102.

Microsoft (2010). MMX, SSE, and SSE2 Intrinsics.

http://msdn.microsoft.com/en-us/library/y0dh78ez.

Miller, V. (2004). The Weil Pairing and Its Efﬁcient Calcu-

lation. Journal of Cryptology, 17:235–261.

Montgomery, P. L. (1991). Vectorization of the Elliptic

Curve Method. ACM.

Page, D. and Smart, N. P. (2004). Parallel Crypto-

graphic Arithmetic Using a Redundant Montgomery

Representation. IEEE Transactions on Computers,

53:1474–1482.

Scott, M. (2007). Optimal Irreducible Polynomials

for GF(2

) Arithmetic. In IACR Eprint Archive.

http://eprint.iacr.org/2007/192.

Scott, M., Benger, N., Charlemagne, M., Perez, L. J. D., and

Kachisa, E. J. (2009). On the Final Exponentiation

for Calculating Pairings on Ordinary Elliptic Curves.

In Pairing-Based Cryptography Pairing 2009, LNCS,

pages 78–88.

Smart, N. P., Harrison, K., and Page, D. (2002). Soft-

ware Implementation of Finite Fields of Characteristic

Three. LMS Journal Computation and Mathematics,

5:181–193.

Takahashi, G., Hoshino, F., and Kobayashi, T.

(2007). Efﬁcient GF(3

) Multiplication Algo-

rithm for η

Pairing. In IACR Eprint Archive.

http://eprint.iacr.org/2007/463.

Vercauteren, F. (2010). Optimal Pairings. IEEE Transac-

tions on Information Theory, 56:455–461.

SIMD-basedImplementationsofEtaPairingOverFiniteFieldsofSmallCharacteristics

101