of the MixBytes computation to take more advantage
of the 3 available ALUs in current Intel processors by
minimizing the dependency chains. Also future CPU
features like AVX will provide another opportunity
to increase the performance, especially for the larger
variant
Grøstl
-512.
ACKNOWLEDGEMENTS
The authors thank Krystian Matusiewicz for useful
discussions and for fine-tuning the AES-NI imple-
mentations. This work was supported in part by the
European Commission through the ICT Programme
under Contract ICT-2007-216646 ECRYPT II, by the
Austrian Science Fund (FWF), project P21936and by
the IAP Programme P6/26 BCRYPT of the Belgian
State (Belgian Science Policy).
REFERENCES
Atmel (2003). 8-bit AVR Microcontroller with 16K
Bytes In-System Programmable Flash. AT-
mega163. Retrieved December 21, 2010, from
http://www.atmel.com/dyn/resources/prod documents/
doc1142.pdf.
Benadjila, R., Billet, O., Gueron, S., and Robshaw, M.
(2009). The Intel AES Instructions Set and the SHA-3
Candidates. Retrieved December 22, 2010, from
http://crypto.rd.francetelecom.com/ECHO/sha3/AES/.
C¸ alik, C¸ . (2010). Multi-stream and Constant-time
SHA-3 Implementations. NIST hash function
mailing list. Retrieved May 03, 2010, from
http://www.metu.edu.tr/∼ccalik/software.html#sha3.
Fog, A. (2010). Instruction tables - Lists of instruction la-
tencies, throughputs and microoperation breakdowns
for Intel, AMD and VIA CPUs. Retrieved December
22, 2010, from http://www.agner.org/optimize/.
Fouque, P.-A., Stern, J., and Zimmer, S. (2009). Cryptanal-
ysis of Tweaked Versions of SMASH and Reparation.
In Avanzi, R., Keliher, L., and Sica, F., editors, Se-
lected Areas in Cryptography 2008, Proceedings, vol-
ume 5381 of LNCS, pages 136–150. Springer.
Gauravaram, P., Knudsen, L. R., Matusiewicz, K., Mendel,
F., Rechberger, C., Schl¨affer, M., and Thomsen, S. S.
(2011).
Grøstl
– a SHA-3 candidate. Submission
to NIST (Round 3). Retrieved May 03, 2010, from
http://www.groestl.info.
Gueron, S. and Intel Corp. (2010). Intel
R
Advanced
Encryption Standard (AES) Instructions
Set. Retrieved December 21, 2010, from
http://software.intel.com/en-us/articles/intel-
advanced-encryption-standard-aes -instructions-set/.
Hamburg, M. (2009). Accelerating AES with Vector Per-
mute Instructions. In Clavier, C. and Gaj, K., editors,
CHES, volume 5747 of LNCS, pages 18–32. Springer.
Intel Corp. (1996). Using MMX
TM
Instructions to
Transpose a Matrix. Retrieved July 12, 2011, from
ftp://download.intel.com/ids/mmx/MMX App Transp
ose Matrix.pdf.
Intel Corp. (2010). Intel
R
64 and IA-32 Ar-
chitectures Software Developers Man-
ual. Retrieved December 21, 2010, from
http://www.intel.com/products/processor/manuals/.
National Institute of Standards and Technology (2001).
FIPS PUB 197, Advanced Encryption Standard
(AES). Federal Information Processing Standards
Publication 197, U.S. Department of Commerce.
National Institute of Standards and Technology (2007).
Cryptographic Hash Project. Available online at
http://www.nist.gov/hash-competition.
Roland, G. A. (2009). Efficient Implementation of
the
Grøstl
-256 Hash Function on an ATmega163
Microcontroller. Retrieved May 03, 2010, from
http://groestl.info.
APPENDIX
The following tables show how the message is loaded,
transposed, XORed to the chaining value and stored
in XMM registers for the byte slice implementation
of
Grøstl
-256. We use a sequence of
PUNPCK
and
PSHUFB
instructions to get the required formats.
First, the message block bytes M
ij
are loaded into
4 XMM registers (see Table 5). Note that in
Grøstl
the message is loaded in column ordering format.
Hence, the message needs to get transposed to get two
rows of the M
ij
in one XMM register (see Table 6).
The chaining value is kept in the same format. Then,
the initial XOR is computedto get P
ij
= H
ij
⊕M
ij
(see
Table 7).
To get one row of P and Q in one XMM register,
we need to reorder and transpose both, P
ij
and M
ij
again (see Table 8 and Table 9). This format is used
throughoutall 10 roundsof
Grøstl
-256 and we trans-
pose back to the chaining value format to compute the
final XOR of P and Q and the feed-forward.
Table 5: Loading the message block into XMM0-XMM3.
XMM3 XMM2 XMM1 XMM0
M48 M32 M16 M0
M49 M33 M17 M1
M50 M34 M18 M2
M51 M35 M19 M3
M52 M36 M20 M4
M53 M37 M21 M5
M54 M38 M22 M6
M55 M39 M23 M7
M56 M40 M24 M8
M57 M41 M25 M9
M58 M42 M26 M10
M59 M43 M27 M11
M60 M44 M28 M12
M61 M45 M29 M13
M62 M46 M30 M14
M63 M47 M31 M15
SECRYPT 2011 - International Conference on Security and Cryptography
132