A CASE STUDY OF ENERGY-EFFICIENT LOOP INSTRUCTION

CACHE DESIGN FOR EMBEDDED MULTITASKING SYSTEMS

Ji Gu and Tohru Ishihara

Department of Communications and Computer Engineering, Kyoto University, Kyoto 606-8501, Japan

Keywords:

Energy Efﬁciency, Optimization Techniques, Microprocessor, Multitasking, Embedded Systems.

Abstract:

Microprocessors increasingly execute multiple tasks in step with the increasing complexity of modern embed-

ded applications. Shared by multiple tasks, conventional on-chip L1 instruction cache (I-cache) usually suffers

a high cache miss ratio due to inter/intra task interferences and is the most energy-consuming component of

the processor chip. This paper presents a power-efﬁcient loop instruction cache design for multitasking em-

bedded applications, which is a two-fold technique that can signiﬁcantly reduce the L1 I-cache accesses for

energy saving and reduce the I-cache misses caused by task interference. Experiments on a case study show

that our scheme reduces energy consumption in the I-cache hierarchy by 36.5% and I-cache misses can be

reduced from 6.0% to 18.3%, depending on the frequency of context switch in the multitasking system.

1 INTRODUCTION

Microprocessors are widely used in the computing

systems of our daily life, from desktop PCs, worksta-

tions, and servers, to portable consumer-electronics

devices such as PDAs, mobile phones, MP3/video

players and digital cameras. In step with the shrink-

ing size of deep submicron process technology, power

dissipation in processors are exponentially increasing.

It has been reported that, microprocessors globally

used in data centers consume 1.5% of the worldwide

energy (Pan, 2009). Therefore, low power is partic-

ularly important in the design of embedded systems

such as portable and handheld electronics devices.

These systems are mostly battery-driven and thus re-

ducing power consumption can prolong the lifetime

of batteries that have limited energy resources.

As on-chip caches increasingly occupy a larger

die size, power consumption in these components ac-

counts for a dominant portion of the overall processor

energy. Work in (Dally et al., 2008) investigates the

breakdown of microprocessor power and concludes

that energy consumption in caches can amount to al-

most 70% of the total energy dissipated in the proces-

sor chip. Apparently, the cache components are the

good targets for energy optimization.

The state-of-the-art embedded applications tend to

incorporate multiple tasks running on a single proces-

sor in order to fully exploit the computing resources.

In multitasking environment, several tasks run simul-

taneously and share all resources of the processor. Si-

multaneous running can be achieved by allocating a

time slice for each task and executing the tasks at in-

tervals. When a task yields its time slice or is pre-

empted by another task with higher priority, a context

switch needs to be performed, which involves saving

the state of the current (preempted) task and retriev-

ing the state of the next (preempting) task. Execu-

tion state of a task includes the program counter (PC)

value, stack pointer (SP), register ﬁle (RF), program

code and data in the caches. Saving the values of PC,

SP and RF incurs small cost so it can be done by con-

text switch. Cache state for a task, however, might be

rather large and thus infeasible to be saved or retrieved

during context switch. Therefore, while shared be-

tween several tasks, cache may suffer signiﬁcant in-

terference when code and data of a task are frequently

overwritten by other tasks in the cache, which leads to

a large number of cache misses. Since cache misses

result in accesses to the lower level memory, which

incurs even more energy consumption and larger de-

lay, multitasking interference is problematic in terms

of energy consumption and performance of the sys-

tem.

To improve the energy efﬁciency of the multitask-

ing systems, existing schemes attempt to reducing

multitasking interference by allocating tasks to differ-

ent partitions of the shared cache (Reddy and Petrov,

2010) (Paul and Petrov, 2011) or scratch-pad memory

(Gauthier et al., 2010). While these approachesare ef-

fective in reducing cache interference for low energy,

the design complexity is very high and most of them

197

Gu J. and Ishihara T..

A CASE STUDY OF ENERGY-EFFICIENT LOOP INSTRUCTION CACHE DESIGN FOR EMBEDDED MULTITASKING SYSTEMS.

DOI: 10.5220/0003951001970202

In Proceedings of the 1st International Conference on Smart Grids and Green IT Systems (SMARTGREENS-2012), pages 197-202

ISBN: 978-989-8565-09-9

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

require the large design space exploration at design

time. For example, to decide the optimal size of cache

partition for an individual task, a large space of cache

conﬁgurations needs to be searched. With more and

more tasks involved in a single multitasking appli-

cation, such scheme becomes rather time-consuming

and thus infeasible in practice.

In this paper, we propose an energy-efﬁcient Par-

titioned Loop Instruction Cache (PLIC) design to re-

duce the energy consumption of the shared cache in

the embedded multitasking systems. The proposed

PLIC is motivated by a previous work (Gu and Guo,

2010) targeting the energy efﬁciency of single-task

based system, and improved in this paper for mul-

titasking systems. The partitioned loop instruction

cache can be shared with multiple tasks without any

interferences. In comparison with I-cache partition

for multitasking applications(Paul and Petrov, 2011),

partitioning the loop instruction cache has lower de-

sign complexity yet can effectively reduce the task in-

terference in the shared I-ache. With our PLIC design,

access of the shared cache in the multitasking system

can be signiﬁcantly reduced such that cache misses

due to interference can be effectively reduce as well,

which, as a consequence, can efﬁciently reduce the

energy consumption and improve the performance of

the multitasking systems.

2 DESIGN OF PLIC

The proposed PLIC is designed for multitasking em-

bedded processors, based on the loop instruction

cache (Gu and Guo, 2010) for single-task based pro-

cessors. In order to be shared with multiple tasks and

address the features of context switch of the multi-

tasking applications, the proposed PLIC has extended

and improved the loop instruction cache design as fol-

lows:

1. PLIC introduces a partitioned structure of the loop

instruction cache and a hardware-based task state

table to manage the sharing of PLIC with multi-

ple tasks. With the proposed architecture, differ-

ent tasks can exploit the PLIC without interfering

each other during their executions.

2. Instead of caching the decoded instructions, our

PLIC design caches the encoded (original) in-

structions. Size of decoded instruction is highly

machine dependent and can be several times

larger than the encoded instruction, which may

result in a large PLIC design size for the multi-

tasking application. Hence, as smaller cache size

is more energy-efﬁcient, we opt for a PLIC archi-

tecture design for encoded instructions.

2.1 PLIC Architecture

The PLIC proposed in this paper is an extra level of

small loop cache designed for multitasking proces-

sors. Fig. 1 (a) shows the PLIC design in a pipelined

processor. For the sake of simplicity, only three typi-

cal pipeline stages, IF (instruction fetch ), ID (instruc-

tion decode) and EX (execution), are given in the ﬁg-

ure. The PLIC component is placed at the ID stage

and controlled by special instructions at the software

level.

The architecture design of PLIC can be seen in

Fig. 1 (b). PLIC takes inputs from the IF, ID and

EX pipeline stages at the hardware level and inputs

from the OS at the software level. While inputs from

the pipeline are received during task execution, inputs

from OS are received only when context switch oc-

curs. The PLIC for the multitasking processor is com-

posed of ﬁve components: a task sate table, a PLIC

index table, a PLIC branch target table, the branch

address logic, and the local PC logic.

The task state table stores the information of the

PLIC cache partition allocated for each task. It also

has a ﬁeld for the local program counter (L-PC) of

each task, which is used to save the execution state of

the preempted task and retrieve the state of the pre-

empting task during context switch. The PLIC index

table and PLIC branch target table are the components

for caching the loop instructions, each of which is par-

titioned for the multitasking application. When a task

is using the PLIC during its execution, only the allo-

cated partition of each table is used. The size of each

partition to be allocated for tasks is decided at compile

time, based on a static loop proﬁling of the multitask-

ing application. And such partition information will

be loaded into the task state table at the beginning of

the program execution.

The PLIC branch target table is used to cache

the ﬂow control instructions (such as jump/branch)

of a loop and the PLIC index table is used for other

loop instructions. These two tables, together with the

branch address logic, the local PC logic, and the four

special instructions (slp, elp, brb and brf as shown in

Fig. 1(b)) are used to handle the complex ﬂow control

of loop executions after the loop has been cached in

the PLIC after the ﬁrst loop iteration. A description

about such loop execution control can be seen in (Gu

and Guo, 2010) and thus will not be discussed in de-

tail here. This paper focuses on the PLIC operation

for context switch, which is the feature of the mul-

titasking system and will be elaborated in following

section.

SMARTGREENS2012-1stInternationalConferenceonSmartGridsandGreenITSystems

198

task ID partition ID

L-PC

Instruction

PLIC Index

Table

To ID

L-PC

branch target L-PC

PLIC Branch

Target Table

L-PC

Base

addr

Mem.target

addr

Decoded

Instruction

PLIC

controller

elp

brb

brf

slp

Branch

Address

Logic

Sub

load

Add

branch

taken/untaken

reg

MUX

Local

Logic

From EX

stall

load

branch flag

Instruction

From IF From ID

L-PC

n-1

From

OS/Task Scheduler

Task ID

Task State

Table

(b)

PLIC

ID EX

load

stall

(a)

Figure 1: The proposed PLIC design based on (Gu and Guo, 2010): (a) PLIC in a processor pipeline, (b) The architecture of

PLIC.

2.2 Context Switch and PLIC

Operation

In multitasking systems, context switch occurs when

a task yields its time slice, or a task is preempted by

another task with higher priority. That is, only one

task is running in the system at any point of time be-

fore and/or after context switch.

When context switch is not taking place, the PLIC

is operated dynamically by the running task, with the

allocated partition being used. In this case, the PLIC

operation can be classiﬁed into three states: non-loop

execution, ﬁrst loop iteration, and following loop it-

eration (second to the last). In non-loop execution,

the PLIC is not activated and instructions are fetched

form the I-cache. In ﬁrst loop iteration, the instruc-

tions are fetched from I-cache and loaded into PLIC

at the same time. During following loop iteration, in-

structions are fetched from the PLIC only.

When a context switch occurs, state of the pre-

empted task needs to be saved ﬁrst, and then state

of the preempting task needs to be retrieved for run-

ning. In the proposed PLIC based multitasking sys-

tem, apart from the conventional context switch oper-

ations (i.e., saving PC, register values etc.), the PLIC

operation state (non-loop execution, ﬁrst loop itera-

tion, or following loop iteration, as described above)

of current task also needs to be saved for future recov-

ery. For context switch, the PLIC operation depends

on if the preempted task is using PLIC (Case 1) or not

(Case 2).

• Case 1. Apart from performing conventional con-

text switch and saving the PLIC operation state,

local PC (L-PC) of the preempted task needs to

be saved into the PLIC task state table for future

recovery.

• Case 2. Conventional context switch at the OS

level is performed, and the PLIC operation state

(non-loop execution) of the preempted task is

saved.

After saving all states of the preempted task, the

preempting task can be put into execution based on

its retrieved states.

3 EXPERIMENTAL RESULTS

3.1 Experimental Setup

We applied our partitioned loop instruction cache

(PLIC) design to a multitasking application, as a case

study, to evaluate the energy efﬁciency of our scheme

in multitasking embedded systems. The multitask-

ing application is composed of ﬁve benchmarks (i.e.

tasks, adpcm, jpeg, rawdaudio, sha, stringsearch)

from MiBench (Guthaus et al., 2001) and Powerstone

(Scott et al., 1998) suites, which are widely used in

the embedded domain of telecommunication, image

processing, audio/vedio coding, and security.

We use a multiprocessor system with a task sched-

uler to emulate the multitasking system since there

ACASESTUDYOFENERGY-EFFICIENTLOOPINSTRUCTIONCACHEDESIGNFOREMBEDDED

MULTITASKINGSYSTEMS

199

I-cache

PLIC

Task

Scheduler

Figure 2: Experimental platform.

are no OS supporting context switch available for us.

The platform can be seen in Fig. 2, which is com-

posed of ﬁve processor cores, ﬁve main memory com-

ponents, the shared L1 I-cache, the PLIC design, the

task scheduler and some switching logic controlled by

the task scheduler.

Each task is stored in one of the ﬁve memory com-

ponents after compilation. At any time, only one pro-

cessor is running by executing the task from its cor-

responding memory. Context switch is performed by

the task scheduler by clock-gating the active proces-

sor and putting another processor into running, which

is similar to saving states for the preempted task and

retrieving states for the preempting task. During con-

text switch, the task scheduler also needs to send a

signal and ID of the preempting task to PLIC, for

which to update the task state table and perform con-

text switch in the PLIC cache (see Section 2.2). Since

the L1 I-cache and the PLIC design is shared by mul-

tiple processors, their behavior is the same as used in

a multitasking uniprocessor.

The processor cores in our platform are homoge-

neous and based on Simplescalar PISA (Burger and

Austin, 1997). We use VHDL to specify the platform

at RTL level for simulation. In this paper, we focus on

the I-cache of multitasking processor and the mem-

ory hierarchy is assumed two levels with I-cache and

main memory. The task scheduler utilizes a round-

robbin scheme for tasks scheduling, with switching

intervals of 5K, 10K, and 20K clock cycles.

Size of our PLIC design is decided by the three ta-

bles as discussed in Section 2.1. We use a PLIC index

table (PIT) of 128 entries and a PLIC branch target ta-

ble (PBTT) of 32 entries for partition. The task state

table (TST) is set 8 entries, which is large enough for

the 5 tasks in our experiment. The parameters setting

of our platform is given in Table 1.

Table 1: System settings.

Processor PISA RISC processor, 6-stage single pipeline

Instr. width 64bits

PLIC PIT: 128x(1+64)bits

PBTT: 32x(6+6)bits

TST: 8x(3+3+6)bits

I-cache 8KB, 2-Way, 32B block, 1-cycle latency

Memory 64MB SDRAM, 30 cycles

Task scheduling Round-Robbin, 5K, 10K, 20K cycles interval

3.2 Reduction of I-cache Access and

Miss

We ﬁrst investigate how much of the I-cache access

and miss can be reduced by the proposed PLIC design

for the multitasking application, since such reduction

is attributed to the energy savings in the system. Fig. 3

reports the reduction rate of I-cache access and miss

over the baseline system which does not have a PLIC

design. The given results are based on three differ-

ent switching intervals (5K, 10K, and 20K cycles)

with different orders of task scheduling(system start-

ing with T

, T

or T

I-cache access can be reduced by 50.9% and such

reduction is independent on the scheduling policy of

the multitasking system. This is due to that, sharing

the partitioned PLIC, tasks do not interfere with each

other so that hit rate of PLIC is identical for each run-

ning. In contrast, the frequency of context switch im-

pacts much on the task interference in I-cache. With

higher switching frequency, the I-cache miss becomes

larger and more miss reduction can be achieved by our

scheme. On average, our PLIC design can reduce the

I-cache miss from 6.0% to 18.3%.

3.3 Energy Savings

To evaluate the energy efﬁciency of our scheme,

we calculate energy consumption in the memory

hierarchy of the baseline architecture and the PLIC

design. We use the following energy model for the

calculation:

Baseline

energy

= I-cache

access

× I-cache

energy/access

+ I-cache

miss

× Memory

energy/access

PLIC

energy

= I-cache

access

× I-cache

energy/access

+ I-cache

miss

× Memory

energy/access

+ (PLIC

write

+ PLIC

access

) × PLIC

energy/access

where values of I-cache

access

, I-cache

miss

, PLIC

write

and PLIC

access

are obtained by simulating the appli-

cation on our VHDL model. Energy consumption

per access of I-cache and main memory are calcu-

SMARTGREENS2012-1stInternationalConferenceonSmartGridsandGreenITSystems

200

Figure 3: Reduction of I-cache accesses and misses.

lated in CACTI 5.1 (Thoziyoor et al., 2008) using

65nm process technology. For our PLIC cache, the

PLIC index table can be implemented as a tagless

direct-mapped cache with cache line of one instruc-

tion word, and thus we also use CACTI to calculate

the energy. The other two tables (PLIC branch target

table and task sate table) can be implemented as reg-

ister ﬁles. Therefore, these two tables and other con-

trol logic of PLIC are synthesized in Synopsys Design

Compiler for the energy values, also using the 65nm

process technology. Energy consumption of caches

and memory is given in Table 2.

Table 2: Energy consumption per access.

Energy/Acess [pJ]

PLIC 56.3

I-cache 227.3

Memory 6332.5

Fig. 4 reports the energy consumption of our PLIC

design normalized to the baseline system. As can be

seen, PLIC can reduce energy consumption by 36.5%

for the multitasking system. Note that, though reduc-

tion of cache misses varies for each task scheduling

(see Fig. 3), the energy reduction is almost identical.

This is because, evencache misses have increased due

to task interference, the miss ratio (cache misses over

cache references) is still small in the multitasking ap-

plications. As a result, energy reduction due to cache

miss reduction accounts for a minor portion of the

overall energy savings.

4 RELATED WORK

Several low-power techniques for embedded systems

have been focused on improving the cache hierarchy

for powerand energyefﬁciency. Work in (Malik et al.,

2000) makes the set-associative cache behave like a

direct-mapped cache such that tag comparison can be

reduced for power efﬁciency. The way-halting cache

(Zhang et al., 2005) design uses some least signiﬁcant

tag bits for comparison, an un-match of which can ﬁl-

ter the full tag comparison for power saving. Ishihara

and Fallah (Ishihara and Fallah, 2005) propose a soft-

ware/hardware co-design of a non-uniform cache ar-

chitecture, where the cache can have different number

of ways for different cache sets. The unused cache

ways are disconnected from the sense-ampliﬁers for

power saving. Techniques in (Tang et al., 2002) (Yang

and Lee, 2004) (Gu and Guo, 2010) attempt to re-

ducing cache energy by introducing an extra level of

tiny and low-powercache structure in front of the con-

ventional cache so that the power-expensive cache ac-

cesses can be mostly ﬁltered.

For multitasking systems, work in (Reddy and

Petrov, 2010) and (Paul and Petrov, 2011) propose

to allocate tasks to different partitions of cache such

that cache misses due to task interferences can be re-

duced. Gauthier et al. (Gauthier et al., 2010) attempt

to ﬁnding the optimal allocation of scratch-pad mem-

ory space between tasks, which can effectively reduce

accesses of the main memory and hence reducing en-

ergy consumption. In contrast, our PLIC design uses

the ﬁltering scheme to reduce the power consump-

tion of embedded multitasking systems. By ﬁltering

the I-cache assesses, the PLIC can also reduce the

cache misses caused by task interferences. In addi-

tion, as our technique is orthogonal to the technique

of (Reddy and Petrov, 2010) (Paul and Petrov, 2011),

they can be applied in combination for energy reduc-

tion in a multitasking system.

5 CONCLUSIONS

This paper targets the multitasking processors and

attempts to reducing energy dissipation of the in-

struction cache hierarchy in processors. With our

proposed partitioned loop instruction cache (PLIC)

shared by multiple tasks, a signiﬁcant amount of

energy-expensive I-cache access can be ﬁltered. Fur-

thermore, ﬁltering I-cache access leads to effective

reduction of I-cache misses caused by task interfer-

ACASESTUDYOFENERGY-EFFICIENTLOOPINSTRUCTIONCACHEDESIGNFOREMBEDDED

MULTITASKINGSYSTEMS

201

Figure 4: Energy consumption normalized to the baseline design.

ence, and hence reducing energy and delay of lower-

level memory access. Experiments on a case study of

multitasking application reports an energy reduction

of 36.5% with the PLIC design. The reduction on I-

cache misses ranges from 6.0% to 18.3%, depending

on the frequency of context switch in the multitasking

system.

ACKNOWLEDGEMENTS

This work is supported by JSPS NEXT program un-

der grant number GR076.

REFERENCES

Burger, D. C. and Austin, T. M. (1997). The simplescalar

tool set, version 2.0. Technical Report CS-TR-1997-

1342, Department of Computer Science, University of

Wisconsin, Madison.

Dally, W. J., Balfour, J., Black-Shaffer, D., Chen, J., Hart-

ing, R. C., Parikh, V., Park, J., and Shefﬁeld, D.

(2008). Efﬁcient embedded computing. IEEE Com-

puter, 41(7):27–32.

Gauthier, L., Ishihara, T., Takase, H., Tomiyama, H.,

and Takada, H. (2010). Minimizing inter-task inter-

ferences in scratch-pad memory usage for reducing

the energy consumption of multi-task systems. In

Proceedings of the 2010 international conference on

Compilers, architectures and synthesis for embedded

systems (CASES'10), pages 157–166.

Gu, J. and Guo, H. (2010). Enabling large decoded instruc-

tion loop caching for energy-aware embedded proces-

sors. In Proceedings of the 2010 international con-

ference on Compilers, architectures and synthesis for

embedded systems (CASES'10), pages 247–256.

Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M.,

Mudge, T., and Brown, R. B. (2001). Mibench: A

free, commercially representative embedded bench-

mark suite. In IEEE 4th Annual Workshop on Work-

load Characterization, pages 83–94.

Ishihara, T. and Fallah, F. (2005). A non-uniform cache

architecture for low power system design. In Pro-

ceedings of the 2005 International Symposium on Low

Power Electronics and Design (ISLPED'05), pages

363–368.

Malik, A., Moyer, B., and Cermak, D. (2000). A low power

uniﬁed cache architecture providing power and per-

formance ﬂexibility. In Proceedings of the 2000 In-

ternational Symposium on Low Power Electronics and

Design (ISLPED'00), pages 241–243.

Pan, D. Z. (2009). Low power design and challenges in

nanometer multicore era. In IEEE CAS Melbourne

and Victoria University, Invited Talks, August 20,

2009.

Paul, M. and Petrov, P. (2011). Dynamically adaptive i-

cache partitioning for energy-efﬁcient embedded mul-

titasking. IEEE Transactions on Very Large Scale In-

tegration (VLSI) Systems, 19(11):2067–2080.

Reddy, R. and Petrov, P. (2010). Cache partitioning for

energy-efﬁcient and interference-free embedded mul-

titasking. ACM Transactions on Embedded Comput-

ing Systems (TECS), 9(3):16:1–16:35.

Scott, J., Lee, L. H., Arends, J., and Moyer, B. (1998). De-

signing the low-power m-core architecture. In Inter-

national Sympsium on Computer Architecture Power

Driven Microarchitecture Workshop, pages 145–150.

Tang, W., Gupta, R., and Nicolau, A. (2002). Power savings

in embedded processors through decode ﬁler cache. In

Proceedings of the Conference on Design, Automation

and Test in Europe (DATE'02), pages 443–448.

Thoziyoor, S., Muralimanohar, N., Ahn, J. H., and Jouppi,

N. P. (2008). CACTI: An integrated cache and mem-

ory access time, cycle time, area, leakage, and dy-

namic power model. Technical Report HPL-2008-20,

HP Laboratories.

Yang, C.-L. and Lee, C.-H. (2004). Hotspot cache: Joint

temporal and spatial locality exploitation for icache

energy reduction. In Proceedings of the International

Symposium on Low Power Electronics and Design,

pages 114–119.

Zhang, C., Vahid, F., Yang, J., and Najjar, W. (2005). A

way-halting cache for low-energy high-performance

systems. ACM Transactions on Architecture and Code

Optimization (TACO), 2(1):34–54.

SMARTGREENS2012-1stInternationalConferenceonSmartGridsandGreenITSystems

202