Design and Implementation of an Espionage Network for

Cache-based Side Channel Attacks on AES

Bholanath Roy

, Ravi Prakash Giri

, Ashokkumar C.

and Bernard Menezes

Department of Computer Science, Indian Institute of Technology - Bombay, Mumbai, India

Department of Computer Science, Shri Mata Vaishno Devi University, Jammu, India

Keywords: Side Channel Attacks, AES, Caches, Lookup Tables, Spy Process, Victim Process.

Abstract: We design and implement the espionage infrastructure to launch a cache-based side channel attack on AES.

This includes a spy controller and a ring of spy threads with associated analytic capabilities – all hosted on a

single server. By causing the victim process (which repeatedly performs AES encryptions) to be interrupted,

the spy threads capture the victim’s footprints in the cache memory where the lookup tables reside.

Preliminary results indicate that our setup can deduce the encryption key in fewer than 30 encryptions and

with far fewer victim interruptions compared to previous work. Moreover, this approach can be easily

adapted to work on diverse hardware/OS platforms and on different versions of OpenSSL.

1 INTRODUCTION

Through much of the history of cryptography,

attacks on cryptographic algorithms have focused on

cracking hard mathematical problems such as the

factorization of very large integers (which are the

product of two very large primes) and the discrete

logarithm problem (Menezes et al., 1996). More

recently, however, side channel attacks have gained

prominence. These attacks leak sensitive

information through physical channels such as

power, timing, etc. and, typically, are specific to the

actual implementation of the algorithm (Brumley

and Boneh, 2005). An important class of timing

attacks are those based on obtaining measurements

from cache memory systems.

The Advanced Encryption Standard (AES)

(Daemen and Rijmen, 2002) a relatively new

algorithm for secret key cryptography, is now

ubiquitously supported on servers, browsers, etc.

Almost all software implementations of AES

including the widely used cryptographic library,

OpenSSL, make extensive use of table lookups in

lieu of time-consuming mathematical field

operations. Cache-based side channel attacks aim to

retrieve the key of a victim performing AES by

exploiting the fact that access times to different

levels of the memory hierarchy are different.

One possible attack scenario involves a victim

process running on behalf of a data storage service

provider who securely stores documents from

multiple clients and furnishes them on request after

due authentication. The same key or set of keys is

used to encrypt documents from different clients

prior to storage. The attacker or spy shares the same

core as the victim. The spy process flushes out all

the cache lines containing the AES tables – these are

used by the victim in encrypting documents prior to

storage. When CPU control returns to the victim, it

brings in some of the evicted line(s). When control

returns back to the spy, it determines which of the

evicted lines were fetched by the victim by

measuring the time to access them. Information on

which lines of cache were accessed by the victim is

critical in deducing the encryption key.

Cache-based side channel attacks belong to one

of three categories. Timing-driven attacks measure

the time to complete an encryption. Trace-driven

attacks make use of when and in what order specific

lines of cache are accessed in the course of an

encryption. Finally, access-driven attacks need

information only about which lines of cache have

been accessed, not the precise order. Two of the

most successful access-driven attacks (Tromer et al.,

2010), (Gullasch et al., 2011) assume that the victim

and spy are co-located on the same core.

Tromer et al., (2010) assume that the spy is able

to monitor the state of the cache containing the AES

tables after each encryption performed by the victim.

However, this may not always be practically

441

Roy B., Prakash Giri R., C. A. and Menezes B..

Design and Implementation of an Espionage Network for Cache-based Side Channel Attacks on AES.

DOI: 10.5220/0005576804410447

In Proceedings of the 12th International Conference on Security and Cryptography (SECRYPT-2015), pages 441-447

ISBN: 978-989-758-117-5

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

feasible. Moreover, several hundred encryptions are

required to compute the complete AES key.

Gullasch et al., (2011) use hundreds of spy threads

to monitor the cache. The completely fair (process)

scheduler (CFS) in Linux guarantees that each spy

thread and the victim process get the same aggregate

CPU time in steady state. Also, the sleep-wakeup

routines of the spy threads are carefully controlled

resulting in preemption of the victim when a

sleeping thread wakes up. This, together with the

scheduler’s “fairness guarantees”, ensures that the

victim is preempted after it makes only a single table

access. Their fine-grained approach requires a neural

network to handle the large number of false

positives. Moreover, the slowdown caused by

frequent interruptions of the victim may arouse

suspicion.

The goal of our work is the design and

implementation of an espionage network with

associated analytic capabilities that retrieve the AES

key using fewer encryptions and also fewer

interruptions to the victim process. We aim for both

simplicity and versatility. Earlier attacks may work

on only specific OS versions or with specific

versions of OpenSSL. Further, hardware prefetching

(Hennessy and Patterson, 2012) (implemented on

many modern processors) may render those attacks

unsuccessful. To the extent possible, we seek to

demonstrate successful attacks on diverse computing

platforms and with different OpenSSL versions.

This paper is organized as follows: Section 2

summarizes related work. Section 3 contains a brief

introduction to AES implementation using lookup

tables and the cryptanalytic aspects of the attack. In

Section 4 we present the design and implementation

of our espionage network. A preliminary analysis of

the success of our approach is presented in Section 5

while Section 6 concludes the paper.

2 RELATED WORK

Software implementations of AES based on lookup

tables were first exploited by Bernstein (2005). They

report the extraction of a complete AES key by

exploiting the timing dependencies of encryptions

caused by cache on a Pentium-III machine.

Although their attack is generic and portable, it

needs 2

.

encryptions and sample timing

measurements with known key in an identical

configuration of target server.

Tsunoo et al., (2003) demonstrated a timing-

driven cache attack on DES. They focused on

overall hit ratio during encryption and performed the

attack by exploiting the correlation between cache

hits and encryption time. A similar approach was

used by Bonneau and Mironov (2006) where they

emphasized individual cache collisions during

encryption instead of overall hit ratio. Although the

attack by Bonneau et al. was a considerable

improvement over previous work by Tsunoo et al.

(2003), it still requires 2



timing samples.

Osvik et al., (2006) proposed an access-driven

cache attack where they introduced the Prime and

Probe technique. In the Prime phase, the attacker

fills cache with its own data before encryption

begins. During encryption, the victim evicts some of

the attacker's data from cache in order to load

lookup table entries. In the Probe phase, the attacker

calculates reloading time of its data and finds cache

misses corresponding to those lines where the victim

loaded lookup table entries. In the synchronous

version of their attack, 300 encryptions were

required to recover the 128 bit AES key on

Athlon64 system and in the asynchronous attack,

45.7 bits of information about the key were

effectively retrieved.

The ability to detect whether a cache line has

been evicted or not was further exploited by Neve

and Seifert (2007). They designed an improved

access-driven cache attack on the last round of AES

on single-threaded processors. However the

practicality of their attack was not clear due to

insufficient system and OS kernel version details.

Gullasch et al., (2011) proposed an efficient

access driven cache attack when attacker and victim

use a shared crypto library. The spy process first

flushes the AES lookup tables from all levels of

cache and interrupts the victim process after

allowing it a single lookup table access. After every

interrupt, it calculates the reload time to find which

memory line is accessed by the victim. This

information is further processed using a neural

network to remove noise in order to retrieve the

AES key.

Weiß et al., (2012) used Bernstein's timing attack

on AES running inside an ARM Cortex-A8 single

core system in a virtualized environment to extract

the AES encryption key. Irazoqui, Inci, Eisenbarth

and Sunar (2014a) performed Bernstein's cache

based timing attack in a virtualized environment to

recover the AES secret key from co-resident VM

with 2



encryptions. Later Irazoqui et al., (2014b)

used a Flush + Reload technique and recovered the

AES secret key with 2



encryptions.

SECRYPT2015-InternationalConferenceonSecurityandCryptography

442

3 PRELIMINARIES

We first summarize the software implementation of

AES and then outline the Two Round Attack used in

this paper.

A. AES Summary

AES is a symmetric key algorithm standardized by

the U.S. National Institute of Standards and

Technology (NIST) in 2001. Its popularity is due to

simplicity in its implementation yet it is resistant to

various attacks including linear and differential

cryptanalysis. The full description of AES cipher is

provided in (Daemen and Rijmen 2002). Here, we

briefly summarize only the relevant aspects of its

software implementation.

AES is a substitution-permutation network. It

supports a key size of 128, 192 or 256 bits and block

size = 128 bits. A round function is repeated a fixed

number of times (10 for key size of 128 bits) to

convert 128 bits of plaintext to 128 bits of

ciphertext. The 16 byte input is expressed as a 4×4

array of bytes. Each round involves four steps –

Byte Substitution, Row Shift, Column Mixing and a

round key operation. The round operations are

defined using algebraic operations over the field

2



. The original 16-byte secret key 



,..,





is used to derive 10 different round keys to be used

in the round key operation of each round.

In a software implementation, field operations

are replaced by relatively inexpensive table lookups

thereby speeding encryption and decryption. In the

versions of OpenSSL targeted in this paper, five

tables are employed (each of size 1KB). Only four

table lookups and four XORs are involved in

computing each of the four columns per round. Each

table, 



,0    4, is accessed using an 8 bit

index resulting in a 32-bit output.

Given a 16-byte plaintext 



,...,



 ,

encryption proceeds by computing a 16-byte

intermediate state 

















,...,









 at each

round . The initial state 



is computed using

Equation 5. The first 9 rounds are computed by

updating the intermediate state using the following

set of equations, for 0,...,8.











,









,









,









←















⨁













⨁













⨁













⨁









(1)











,









,









,









←















⨁













⨁













⨁













⨁









(2)











,









,









,









←















⨁













⨁













⨁













⨁









(3)











,









,









,









←















⨁













⨁













⨁













⨁









(4)

Here 









refers to the 



column vector of the



1





round key expressed as a 4x4 array.

Finally to compute the last round, equations similar

to the above are used except that table 



is used

instead of 



,...,



. The output of the last round is

the ciphertext. The change of lookup tables in the

last round (for  = 10) is due to the absence of the

Column Mixing step.

B. Attack Overview

The granularity of cache access is a block or line

which is 64 bytes in most of our target machines.

Each entry in a table is 4 bytes, so there are 16

entries in each block. Each table contains 256

entries, so it occupies 16 blocks. The first four bits

of an 8-bit table index identify a line within the table

while the last four bits specify the position of the

entry within the line. Thus the first four bits of a

table index are leaked if the attacker can determine

which line of the cache was accessed.

The access-driven cache timing attack described

by Osvik et al., (2006) assumes that the attacker

provides several blocks of plaintext to be encrypted

by the victim during the course of the attack. It

involves two steps. The First Round Attack exploits

the table indices accessed in the first round which

are simply















⨁



,015

(5)

Thus the first four bits of each of the 16 bytes of the

AES key may be derived from the high-order nibble

of the corresponding plaintext and the corresponding

line number of the AES table.

The Second Round Attack seeks to obtain the

low-order nibble of each byte of the key. These can

be deduced from the set of equations in Table 1

which involve only four accesses – one each from





,



,



and 



respectively. These equations are

further used in Section 5.

DesignandImplementationofanEspionageNetworkforCache-basedSideChannelAttacksonAES

443

Table 1: Equations used in the Second Round Attack.



















⨁ 





⨁







⨁





⨁ 2 ⦁ 







⨁





⨁3⦁







⨁





⨁









⨁



(6)



















⨁





⨁2⦁







⨁





⨁3⦁







⨁





⨁







⨁





⨁









⨁



⨁



(7)











 2⦁







⨁





⨁3⦁







⨁





⨁







⨁





⨁







⨁





⨁









⨁



⨁



⨁



⨁1

(8)











 3⦁







⨁





⨁ 







⨁





⨁







⨁





⨁2⦁







⨁





⨁









⨁



⨁



⨁



⨁



(9)

4 THE ESPIONAGE

INFRASTRUCTURE

Our espionage infrastructure (Figure 1) comprises an

Espionage Network and the Centre for Advanced

Analytics (CAA). The former includes a Spy

Controller (SC) and a Spy Ring. The SC runs on one

CPU core while the ring of spy threads runs on

another core together with the victim.

Figure 1: The Espionage Infrastructure.

Our goal is to cause the execution of spy threads

and the victim (V) to be interleaved as shown in

Figure 2 (in steady state). We refer to an execution

instance of V as a run. During each run, V accesses

the AES tables and the next spy thread that is

scheduled attempts to determine which lines of the

table were accessed in the preceding run of V. The

default time slice (or quantum) assigned by the OS

to a process is large enough to accommodate tens of

thousands of cache accesses. But if V is given this

full time slice it would perform hundreds of

encryptions (each encryption involves 160 table

accesses) thus making it impossible to obtain any

meaningful information about the encryption key.

Figure 2: Execution timeline of spy threads and victim.

The CFS scheduler employed in many Linux

versions uses a calculation based on virtual runtimes

to ensure that the aggregate CPU times allocated to

all processes and threads are nearly equal. In Figure

2, the sum of the CPU times allocated to V is equal

to that of the times given to each of the spy threads.

If the number of threads is  and they execute in

round robin fashion, then each run of V is roughly of

duration / where  is the uninterrupted time

allocated to any running thread.

The task of a spy thread is to measure the access

times of each of the cache lines containing the AES

tables and then flush the tables from all levels of

cache. It then signals the SC through a shared

boolean variable, finished, that its task is complete.

Finally it waits for an amount of time δ



before

blocking on cond. At this point, all spy threads are in

the blocked state and the OS resumes execution of

Spy Thread (i):

while(true)

wait (cond)

for each cacheLine containing AES

tables

if(accessTime[cacheLine]<THRESHOLD)

isAccessed[cacheLine] = true

clflush(cacheLine)

finished = true

delay loop // time=δ



The SC continuously polls the finished flag. When it

finds that finished has been set, it waits for time δ



and then wakes up the spy thread that has waited the

longest.

Spy Controller :

while (true)

while(finished ≠ true)

delay loop // time=δ



signal(nextThread)

finished = false

Our experiments were performed on Intel(R) Core-

i5 2540M, 2.60GHz processor running Debian Kali

Linux 1.1.0, 64bit, kernel versions 3.14.5 and 3.18

SECRYPT2015-InternationalConferenceonSecurityandCryptography

444

using the C implementation of AES in OpenSSL

0.9.8a. This version of OpenSSL uses a separate

table for the last round of encryption. The core-i5

has 3-level cache architecture. The L1 cache is

32KB (8-way associative), L2 cache is 256KB (8-

way associative) and L3 cache is 3MB (12-way

associative). Each CPU core has private L1 and L2

caches whereas L3 is shared among different CPU

cores.

The distribution of cache hit and miss times as

measured by us were clearly separated as shown in

Figure 3 (32 - 68 ticks for a cache hit and 200 – 260

ticks for a cache miss). Based on these

measurements we set the threshold for hit/miss

determination = 100 ticks.

Figure 3: Distribution of cache Hit and Miss Times.

Three design parameters can be tweaked at the

discretion of the SC. These are the number of

threads and the two delay parameters, δ



and δ



. As

the number of threads increases, the execution time

of V during each run decreases and V would make

fewer table accesses. This is indeed the case as

shown in Figures 4 and 5. The number of distinct

accesses per run is concentrated between 28 and 37

with 10 spy threads but the range decreases to 18-27

for the case of 40 spy threads.

The delay at the SC, δ



, was designed to defer

waking a spy thread. This was necessitated by the

fact that occasionally multiple threads would get

executed in sequence without any intervening run of

V. Then, when V got scheduled, it computed several

encryptions and synchronization between the

espionage network and V was lost. Finally, the delay

in the spy thread, δ



, was intended to delay the start

of a run by V and so decrease the number of table

accesses made by V if indeed that was necessary.

The set of accessed lines in each table is

communicated to the CAA. The latter analyzes the

results and derives the AES key. Depending on the

“quality” of input it receives, the CAA may

recommend a change of design parameters to the

SC.

Figure 4: # Accesses per run (# Spy Threads =10).

Figure 5: # Accesses per run (# Spy Threads =40).

In the next section, we perform some back of the

envelope calculations to demonstrate that the AES

key on our set-up can be deduced with results from

fewer than 30 encryptions.

5 ANALYSIS

To retrieve a given AES key, we performed

successive encryptions on random plaintexts with

that key. Experimental results on the setup described

in Section 4 indicate that there are almost always

two consecutive runs containing accesses to 



(see

Table 2). Let these be denoted R1 and R2 and let the

run following R2 be R3. The former are useful

synchronization points signaling that an encryption

is complete and a new one has begun (or is about to

begin). If R2 also has accesses to Tables 



–



then those would certainly include accesses made in

round 1 of the new encryption. Our experimental

results with 40 threads in the Spy Ring show that,

collectively, R2 and R3 access around 7-10 distinct

DesignandImplementationofanEspionageNetworkforCache-basedSideChannelAttacksonAES

445

lines in each of 



,



,



and 



Table 2: Sample number of distinct accesses per table in

consecutive runs.

Let 

,

denote the subset of distinct line numbers

in Table 



accessed in the first two runs of

Encryption  . Also, let ̂ denote the high order

nibble of. We next show how to obtain 





as part

of the First Round Attack.

We associate a variable,, (initialized to 0)

with each of the 16 possible values of 



. For each

encryption,1,2,...,, we compute the possible

values of ⊕



where ∈

,

and increment

by 1 the score of . The with the highest score is

the true value of



. Indeed, given highly accurate

measurements by the spies, the true value of 



will

end up with a perfect score of  after the 

encryptions. On the other hand, the probability that

any other value of  gets this score is

only

∏





,









. For 

,

~8and  9, this works

out to 2

-9

. In a similar manner, we can obtain the

first nibble of each of the 16 bytes of the AES key.

To obtain the second nibble of each byte of the

AES key, we use the Second Round Attack

(Equations 6-9 in Section 3). Consider the 16-bit

integer, , obtained by concatenating candidate

values for the low-order nibbles of



, 



, 



and





. Again, we associate a variable, , with

each of the 2

possible values of  and initialize

these to 0. For an encryption  and corresponding

plaintext we do the following. For each possible

value of , we obtain the value of 







from Equation

6. We then check to see whether







∈

,

. If so,

we increment by 1 the score associated with . We

repeat this for 20-30 encryptions. The  with the

perfect score would reveal the true values of the

low-order nibbles of each of 



, 



, 



and 



Performing an analysis similar to that for the First

Round Attack leads us to conclude that it is highly

improbable that any other candidate comes close to

the winner. Finally, the rest of the key bits may be

obtained using Equations 7, 8 and 9.

6 CONCLUSIONS

There are two parts of this work. The first part

which is heavily experimental can be summarized

by the following modified lyrics of an evergreen

song (Synchronicity, 1983).

. . .

Every move you make

. . .

Every step you take

. . .

Every single tick

Every line you pick

I’ll be watching you.

We have been able to accurately monitor cache hits

and misses of the AES lookup tables with our

espionage network. We have completed a

preliminary analysis of the results and feel that the

heuristics to be implemented by the CAA (the

second part of this work) will obtain the keys with

around 12 to 30 encryptions. Though we have

completed our experiments on a specific platform,

we will next investigate the success of our attack on

other platforms and other versions of OpenSSL.

REFERENCES

Bernstein, D. J. (2005). Cache-timing attacks on aes.

Bonneau, J. and Mironov, I. (2006). Cache-collision

timing attacks against aes. In Cryptographic

Hardware and Embedded Systems-CHES 2006, pages

201–215. Springer.

Brumley, D. and Boneh, D. (2005). Remote timing attacks

are practical. Computer Networks, 48(5):701–716.

Daemen, J. and Rijmen, V. (2002). The design of

Rijndael: AES-the advanced encryption standard.

Springer Science & Business Media.

Gullasch, D., Bangerter, E., and Krenn, S. (2011). Cache

games–bringing access-based cache attacks on aes to

practice. In Security and Privacy (SP), 2011 IEEE

Symposium on, pages 490–505. IEEE.

Hennessy, J. L. and Patterson, D. A. (2012). Computer

architecture: a quantitative approach. Elsevier.

Irazoqui, G., Inci, M. S., Eisenbarth, T., and Sunar, B.

(2014a). Fine grain cross-vm attacks on xen and

vmware. In Big Data and Cloud Computing

(BdCloud), 2014 IEEE Fourth International

Conference on, pages 737–744. IEEE.

Irazoqui, G., Inci, M. S., Eisenbarth, T., and Sunar, B.

SECRYPT2015-InternationalConferenceonSecurityandCryptography

446

(2014b). Wait a minute! a fast, cross-vm attack on aes.

In Research in Attacks, Intrusions and Defenses, pages

299–319. Springer.

Menezes, A. J., Van Oorschot, P. C., and Vanstone, S. A.

(1996). Handbook of applied cryptography. CRC

press

Neve, M. and Seifert, J.-P. (2007). Advances on

accessdriven cache attacks on aes. In Selected Areas in

Cryptography, pages 147–162. Springer.

Osvik, D. A., Shamir, A., and Tromer, E. (2006). Cache

attacks and countermeasures: the case of aes. In Topics

in Cryptology–CT-RSA 2006, pages 1–20. Springer.

Synchronicity, (1983)

http://www.songfacts.com/detail.php?id=548.

Tromer, E., Osvik, D. A., and Shamir, A. (2010). Efficient

cache attacks on aes, and countermeasures. Journal of

Cryptology, 23(1):37–71.

Tsunoo, Y., Saito, T., Suzaki, T., Shigeri, M., and

Miyauchi, H. (2003). Cryptanalysis of des

implemented on computers with cache. In

Cryptographic Hardware and Embedded Systems-

CHES 2003, pages 62–76. Springer.

Weiss, M., Heinz, B., and Stumpf, F. (2012). A cache

timing attack on aes in virtualization environments. In

Financial Cryptography and Data Security, pages

314–328. Springer.

DesignandImplementationofanEspionageNetworkforCache-basedSideChannelAttacksonAES

447