SELF-ADAPTIVE NOC POWER MANAGEMENT

WITH DUAL-LEVEL AGENTS

Architecture and Implementation

Syed M. A. H. Jafri

1,2,3

, Liang Guang

2,3

, Axel Jantsch

, Kolin Paul

, Ahmed Hemani

and Hannu Tenhunen

1,2,3

Royal Institute of Technology, Stockholm, Sweden

University of Turku, Turku, Finland

Turku Centre for Computer Science, Turku, Finland

Indian Institute of Technology Delhi, Delhi, India

Keywords:

DVFS, Agent based Design, Hardware/Software Co-design, Multiprocessor Architectures, Low-power

Design.

Abstract:

Architecture and Implementation of adaptive NoC to improve performance and power consumption is pre-

sented. On platforms hosting multiple applications, hardware variations and unpredictable workloads make

static design-time assignments highly sub-optimal e.g. in terms of power and performance. As a solution to

this problem, adaptive NoCs are designed, which dynamically adapt towards optimal implementation. This pa-

per addresses the architectural design of adaptive NoC, which is an essential step towards design automation.

The architecture involves two levels of agents: a system level agent implemented in software on a dedicated

general purpose processor and the local agents implemented as microcontrollers of each network node. The

system agent issues speciﬁc instructions to perform monitoring and reconﬁguration operations, while the local

agents operate according to the commands from the system agent. To demonstrate the system architecture,

best-effort power management with distributed voltage and frequency scaling is implemented, while meeting

run-time execution requirements. Four benchmarks (matrix multiplication, FFT, wavefront, and hiperLAN

transmitter) are experimented on a cycle-accurate RTL-level shared-memory NoC simulator. Power analysis

with 65nm multi-Vdd library shows a signiﬁcant reduction in energy consumption (from 21 % to 36 %). The

synthesis also shows minimal area overhead (4 %) of the local agent compared to the original NoC switch.

1 INTRODUCTION

The many-core SoC (System-on-Chip) era has come

due to the constant shrinking of transistor sizes. The

state-of-the-art Network-on-Chip (NoC), as the most

promising on-chip parallel communication architec-

ture, has a large number of integrated components

(Truong et al., 2009; Howard et al., 2011). In terms

of technology scaling, thousand-core processor will

be feasible in the near future (Asanovic et al., 2006).

The design of massively parallel NoC is strongly

challenged by the hardware and workload variations.

The hardware variations are generally characterized

by PVT (process, voltage, and thermal) variations

(Borkar et al., 2003), which all result in instability

and unpredictability of the hardware performance, in

terms of functional correctness, timing and power co-

nsumption. The worst-case design based on the

biggest deviations requires large design margin, lead-

ing to low performance or increased power overhead

(Rabaey, 2007). In addition to these hardware vari-

ations, the run-time workloads on the NoC are also

highly diverse and unpredictable at the design time.

Scientiﬁc applications, radio processing, media pro-

cessing, and various other types of applications are

expected on future many-core systems, for instance

smartphones (van Berkel, 2009). It is infeasible to

rely solely on design-time conﬁguration for potential

workloads, especially when the mapping and schedul-

ing are likely to be dynamic as well.

To deal with the hardware variations and unpre-

dictable workloads, future NoCs should be made

“Self-Adaptive” (Salehie and Tahvildari, 2009)

its

status (self-awareness) and reconﬁgure system pa-

450

M. A. H. Jafri S., Guang L., Jantsch A., Paul K., Hemani A. and Tenhunen H. (2012).

SELF-ADAPTIVE NOC POWER MANAGEMENT WITH DUAL-LEVEL AGENTS - Architecture and Implementation.

In Proceedings of the 2nd International Conference on Pervasive Embedded Computing and Communication Systems, pages 450-458

DOI: 10.5220/0003942204500458

 SciTePress

rameters (self-adaptation), in order to achieve the ex-

pected performance goals or seek possible optimiza-

tion. Our previouswork (Guang et al., 2010) proposes

a hierarchical agent monitoring system architecture

for generic parallel embedded system. It suggests that

several level of controllers shall be added with hier-

archical scope and priorities, to provide both coarse

and ﬁne-granular observability and reconﬁgurability.

Conceptually, agents are monitoring and reconﬁgu-

ration functions, which can be realized as software,

hardware or a hybrid of both. The conventional NoC

platform, including data, memory and communica-

tion, is considered resources supervised by the agents.

This work presents the essential architectural-

level support for hierarchical agent monitored NoC, in

order to enable the design automation of self-adaptive

systems. In particular, the system agent, which de-

termines the adaptive policy for the whole NoC, is

implemented as a software agent, with speciﬁc in-

structions designed for monitoring and reconﬁgura-

tion operations. The local agents, which monitor and

reconﬁgure the local resources based on the command

from the system agent, are implemented as micro-

controllers for each network node. The communi-

cation between agents are implemented on existing

NoC channels. The agents are fully integrated in a

RTL-level cycle-accurate NoC simulator with Leon 3

processing elements and distributed shared memory.

The system architecture is demonstrated with

best-effort per-core DVFS (dynamic voltage and fre-

quency scaling), as a representative algorithm for

self-adaptive power and energy management. The

whole system is synthesized with 65nm multi-Vdd li-

brary. Four real benchmarks (matrix multiplication,

FFT, wavefront, and hiperLAN transmitter) are exper-

imented with power and energy measurement to show

the effectiveness of the approach. The software and

hardware overheads are evaluated to show the scala-

bility of the system architecture.

2 RELATED WORK

Most previous works are focused on speciﬁc algo-

rithms or monitoring goals, including power con-

sumption, thermal management or dependability is-

sues. Works on systematic approaches of generic

monitoring and reconﬁguration architecture are quite

limited.

(Ciordas, 2008) proposes a monitoring-awaresys-

tem architecture and design ﬂow for NoC. This

Although this paper is targeted at self-adaptive soft-

ware, the general concept applies to the whole system.

work focuses on hardware-based probes for transac-

tion debugging and QoS (Quality-of-Service) provi-

sion. Our work, however, presents a SW/HW (soft-

ware/hardware) co-design approach to the monitoring

and reconﬁguration, with services for non-functional

design goals, such as power and energy consumption.

(Sylvester et al., 2006) is an early work pre-

senting an adaptive system architecture, ElastIC, for

self-healing many-core SoC. Each core is designed

with observable and tunable parameters, for instance

power monitors. A centralized DAP (diagnostic and

adaptivity processing) unit dynamically tests and re-

conﬁgures cores of degraded performance. However

(Sylvester et al., 2006) does not explore in detail the

architecturalsupport or the implementation of the sys-

tem architecture.

A two-level controlling architecture is presented

in (Dafali and Diguet, 2009). A centralized conﬁgu-

ration manager determines the management policies

of the whole network, while each local manager de-

cides the reconﬁguration based on the management

policies. (Dafali and Diguet, 2009) is only focused on

the design of self-adaptive network interface, without

the system-scale discussion on power efﬁciency nor

dependability.

Our previews work (Guang et al., 2010) proposes

the functional overview of hierarchical agent mon-

itoring design paradigm. This work presents the

instruction-level architectural design and implemen-

tation speciﬁcally for NoC platforms. In particular,

both the system agent and local agents are designed

and implemented (down to the RTL-level) on a con-

crete NoC platform (Section 5.2), while (Guang et al.,

2010) only indicates the general principles of func-

tional partition and implementation manner (software

or hardware) of agents.

(Hoffmann et al., 2010) presents a so-called

”heartbeat framework” or application to notify its per-

formance “als and behaviors t” the platform observer,

and obtain actual performance. The progression of

the application is symbolized as a heartbeat. By mon-

itoring the intervals between heartbeats, the platform

observer and the application can be aware of the sys-

tem performance. We integrate this application label-

ing approach into our system architecture, where the

system agent monitors the application execution time

by checking the labeled timestamps.

Compared to these existing works, this paper

makes several major contributions:

• Dual-level agent monitoring with SW/HW co-

design and synthesis is a scalable approach for

many-core systems with various monitoring and

reconﬁguration services.

• Instruction-level architectural design enables the

SELF-ADAPTIVE NOC POWER MANAGEMENT WITH DUAL-LEVEL AGENTS - Architecture and Implementation

451

Agent Layer

Application with Timestamps

Application_start();

...

monitored_event();

...

monitored_event();

...

Application_end();

Mapper,

Scheduler

Power Management

Local

Agent

Node

Local

Agent

Node

delegate

Dedicated processor in NoC

(only runing agent function)

Wrapper

microcontroller

NoC Backbone

microcontroller

System Agent

Figure 1: System overview of hierarchical agent monitored

NoC.

system architecture to be integrated into NoC de-

sign ﬂow.

• RTL-level full system implementation (digital

parts) provides accurate power and area analysis

with 65nm multi-Vdd library.

3 SYSTEM OVERVIEW

To achieve self-adaptive NoC, we propose that an

intelligence layer shall be added upon the conven-

tional NoC system architecture (Fig. 1). The layer is

composed of one system agent, which is the general

manager of all monitoring and reconﬁguration oper-

ations, and distributed local agents, which are dele-

gates of the system agent to actuate the operations.

In the meantime, the application shall be labeled with

timestamps to enable agent’s awareness of applica-

tion progression (Section 4.1). The joint efforts of

the system and local agents realize the adaptivity of

the system, for example autonomoustradeoff between

power/energy and timing requirements of the applica-

tion.

In terms of function, the agent layer is orthogonal

to the data computation. The underlying NoC back-

bone, regardless of the exact implementation (topol-

ogy, routing, ﬂow control or memory architecture),

performs the conventionaldata communication, while

the agent subsystem monitors the computation and

communication. The separation of agent services pro-

vides portability of the system architecture to differ-

ent NoC platforms, thus leading to improved design

efﬁciency.

Application_start();

...

Monitored_event_start();

...

Monitored_event_end();

...

Application_end();

Memory_write(memory_location1)

On the system agent’s memory space

Memory_write(memory_location4)

Application labelled with Timestamps

Memory_write(memory_location2)

Memory_write(memory_location3)

Implementation

Figure 2: Labeling timestamps in the application.

4 ARCHITECTURAL DESIGN

The functions of system and local agents need to be

implemented as either software instructions, micro-

controllers or hardware components. For software

agents, instructions are needed for monitoring and

reconﬁguration operations. For microcontroller or

hardware-based agents, the micro-architecture to in-

terface with the software agent and the local resources

needs to be designed.

4.1 Application Timestamps

To enable the monitoring of applications, meta-data

needs to be added in the instructions, for instance to

denote the progression of the application. Fig. 2 gives

an example of adding timestamps in the applications.

In particular, the starting and ﬁnishing time of the ap-

plication and the critical sections are labeled with spe-

cial instructions, so that the occurrence of these events

can be monitored by the system agent.

As one alternative, these timestamps labeling in-

structions are implemented as memory write instruc-

tions. Speciﬁc data shall be written to a memory lo-

cation of the system agent, to notify the occurrence

of the event. The allocation of the memory address

can be performed in the compilation process, which

is beyond the scope of the paper.

4.2 System Agent

The system agent works as the “general manager”

for monitoring and reconﬁguration services. Depen-

dent on the design requirement, the system agent shall

be responsible for task mapping, process scheduling,

run-time power management and fault tolerance. Due

to such diversity, the system agent is implemented as

a dedicated processor in NoC, so that the agent func-

tions can be reloaded dynamically.

Generally speaking, the system agent monitors the

progression of applications and the system parame-

PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems

452

Parallel processes:

Check ( monitored_parameter1 );

...

Check ( monitored_parameter2 );

...

blocking_read( memory_location1)

Software Instruction

Check_Application_Start();

...

Check_Application_End();

...

Implementation

Process 1:

blocking_read(location_parameter1);

reconfiguration1(paramter1);

Process 2:

blocking_read(location_parameter2);

Reconfiguration2(parameter2);

Memory_write(command1);

Memory_write(command2);

Reconfiguration1 ( monitored_parameter1);

Reconfiguration2 ( monitored_parameter2);

blocking_read( memory_location2);

Figure 3: Monitoring and reconﬁguration software on sys-

tem agent.

ters, and reconﬁgures the system according to cer-

tain adaptive algorithms. In detail, the system agent

ﬁrst checks the start of the application (or a frame in

streaming applications), which is implemented as a

blocking memory read. The application will label the

timestamps when it starts (Section 4.1). To monitor a

certain parameter after the application starts, the sys-

tem agent ﬁrst issues a command to check the run-

time value of the parameter. The command is written

to the memory location of the intended network node,

so that the corresponding local agent will receive the

command. In the similar manner, the system agent

can issue any number of parameter-checking com-

mands, which are all implemented as non-blocking

memory writes. Afterwards, the system agent waits

on the report of the monitored parameters by the cor-

responding local agents (as memory writes; Section

4.3). These waiting operations are implemented as

blocking reads. When a certain read completes, the

system agent performs reconﬁguration based on the

run-time parameter values. The waiting of multiple

parameters are parallel processes, since the parame-

ters may be returned in random orders. When all re-

quired monitoring and reconﬁguration operations are

ﬁnished, the system agent waits for the completion of

the application. However, in case one monitored pa-

rameter is the execution time of an application frame,

the monitoring operation may be ﬁnished after the

frame ends.

Table 1 lists the detailed C instructions (on a Leon

3 processor) on the system agent to implement moni-

toring and power management.

4.3 Local Agents

Local agents are distributed microcontrollersattached

to each network node. They actuate the monitoring

and reconﬁguration operations as commanded by the

system agent. In particular, each local agent, upon

receiving the monitoring commands from the system

agent, reads the required parameters from the local re-

source (Fig. 4). Similarly, when receiving a reconﬁg-

uration command, it actuates the reconﬁguration, for

Network

Node

System Agent

Local agent

Microcontroller

Monitor commands

e.g. get_load

load

wrapper

Clk_sel

Vol_sel

Other parameters,

e.g. Packet latency

Reconfiguration

commands e.g.

DVFS_change

Figure 4: Schematics of local agent and its interfaces to

system agent and network node.

instance by setting the power switch and frequency

generator. The interfaces to various parameters for

monitoring and reconﬁguration are hardwired, so that

the network node can be used as a modularized com-

ponent integrable into any NoC systems.

The implementation of local agents as microcon-

troller considers the tradeoff between performance,

ﬂexibility and overhead. A software-based agent,

i.e. the system agent, has the largest ﬂexibility

with higher operation latencies and larger area over-

head. Purely hardware-based monitoring and recon-

ﬁguration circuit provides the fastest operation speed,

while changing its function requires reconﬁguration

of the circuit itself. Microcode approach is a suitable

tradeoff between software and hardware-based design

(Chen et al., 2010), for local agents whose operations

are strictly based on the commands from the system

agent.

4.4 Architectural Integration

The agent intelligence layer is the architectural inte-

gration of the system agent and the distributed local

agents, with timestamp-labelled application (Fig. 5) .

The application programmers specify the times-

tamps of monitored events in the application, for in-

stance the starting/end times of each frame. The sys-

tem designers write software instructions for moni-

toring and reconﬁguration operations with high-level

abstraction. These operations are sent to and imple-

mented by local agents, which are microcontrollers

of each network node. The wrapping of the local

agent and the resource is design speciﬁc. For instance,

if parameters from both the processing element and

the router are needed for the monitoring and recon-

ﬁguration, the local agent is attached to the whole

node. Since the monitoring and reconﬁguration are

infrequently issued compared to data communication

(Ciordas et al., 2008), we can reuse the existing NoC

interconnect for inter-agent communication. In par-

ticular, the inputs from a processing element and the

SELF-ADAPTIVE NOC POWER MANAGEMENT WITH DUAL-LEVEL AGENTS - Architecture and Implementation

453

Table 1: Experimented instructions for monitoring and power management on system agent (a Leon 3 processor).

Instruction Function

wait(memory location) Wait for the occurrence of an event (the application

writes the corresponding memory location)

get load(row, column, switch) Check the run-time workload of a particular switch

reset load(row, column, switch) Refresh the workload record of a particular switch

set window(row, column, switch, windowsize) Set the monitoring window

set priority(row, column, switch, priority) Set the priority of agent command in the network

arbitration

DVFS change(memory location, clk sel, vol sel) Change the voltage and frequency of a particular

switch (denoted by the memory location)

System Agent

Management Sofware

Processing element

Router

Application_start;

......

Start

(monitored_event1);

......

Local agent

Local

agent

Router

check(paramter1);

check(parameter2);

Reconfiguration1(parameter1);

Reconfiguration2(parameter2);

...

Check

parameters

Wrapper

......

End

(monitored_event1);

......

Router

Inactive

Inter-agent communication

Commands/

Monitored data

reconfigure

Local

agent

Check

parameters

reconfigure

Local

agent

reconfigure

......

Application_end;

Figure 5: Integrating hierarchical agents as an intelligence

layer.

local agent share the same router port, with the agent

having the higher priority in arbitration.

Due to the SW/HW co-design and modularized ar-

chitectural integration, the agent intelligence layer is

highly scalable. The local agent wrapper can be ap-

plied to any NoC node (or a particular NoC compo-

nent, e.g. router), and be used as a “building brick”

to construct a NoC of any size without incurring ad-

ditional overhead. The software-based system agent,

on the other hand, can be written with various mon-

itoring and reconﬁguration instructions as needed for

the application.

5 SELF-ADAPTIVE POWER

MANAGEMENT

To demonstrate the effectiveness and overheads of us-

ing dual-level agents, we have used best-effort per-

core DVFS on the existing NoC platform. Based on

the speciﬁed parameters (e.g. peak load and aver-

age load), the local agents trace run-time system in-

formation. Upon the request of the system agent,

they return the recorded values. Depending upon

the provided information and the application perfor-

mance constraints, the system agent adjusts the volt-

age and/or frequency to optimize the power and en-

ergy consumption.

5.1 Best-effort Per-core DVFS (BEPCD)

The adaptive power management using distributed

DVFS with run-time application performance moni-

toring, abbreviated as BEPCD, is illustrated in Fig.

6. P, S, LT, F and Ts represent processor, switch,

low trafﬁc switches (the switch with the lowest work-

load), switch frequency and threshold time (the appli-

cation latency), respectively. The terms inside paren-

thesis represent the function to be performed on the

entity to the left (e.g. P(any) starts? means if any

of the processors starts). Simply put, the process is

performed in three steps: (i) the initialization of volt-

age and frequency of each switch and the setting of

application latency requirement, (ii) run-time tracing

of the workload of each switch and the application la-

tency (Section 4.2), (iii) if the latency is lower than

the constraint, DVFS is applied to the switch with the

lowest workload.

5.2 NoC Infrastructure

An in-house cycle-accurate RTL-level NoC simula-

tor (Jean-Michel Chabloz, 2012) is used for exper-

iments. The simulator is based on Nostrum archi-

tecture (Nostrum, 2011), with X-Y dimension-order

routing and deﬂection routing (when contention is en-

countered). Each processing element is a Leon 3 pro-

cessor as a fully synthesizable IP block. Distributed

shared memory is used, with distributed memory con-

troller attached to each network node. Each router

PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems

454

Figure 6: Per-core DVFS for best-effort power management

with run-time performance monitoring.

or processing element can be conﬁgured with differ-

ent voltage and frequency values. Such a module is

equipped with its own clock generation unit, which is

fully implemented in the simulator. It is also equipped

with the interfaces (powerswitch) to connect to differ-

ent power grids for run-time voltage scaling, as pre-

sented in (Truong et al., 2009). The power switch is

only emulated as a latency module since it is an ana-

log circuit. A conservative voltage switching delay of

27ns (Jean-Michel Chabloz, 2012) is set in the sim-

ulator, which is larger than the delay measured on a

real NoC prototype ((Truong et al., 2009); less than

20ns). To support such ﬁne-grained power island,

the NoC infrastructure adopts globally ratiochronous

locally synchronous (GRLS) clocking (Chabloz and

Hemani, 2010). In particular, all clocks on the chip

run at frequencies which are submultiple of a certain

. This restriction achieves a signiﬁcant simpliﬁca-

tion in the implementation of synchronizes with low

latency and overhead. Synchronizing registers (4 ﬂip-

ﬂops per data line) are used between two different

clock regions to reduce metastability, as suggested in

(Chabloz and Hemani, 2010).

5.3 Experiment Setup

To identify the voltages and their corresponding sup-

ported frequencies, the switches were synthesized us-

ing Synopsys design compiler for 65 nm multi-Vdd

technology (Table 2). The technology supports volt-

ages from 1.1 V to 1.32 V. The synthesis results reveal

that the routers are capable of supporting up to 300

MHz frequency at 1.32 V and up to 200 MHz fre-

quency at 1.1 V. Based on GRLS clocking in the NoC

platform (Section 5.2), the allowable frequencies are

300, 200, 100, 50, 40, and 20 MHz (exact divisors of

= 600MHz, least common multiplier of 300MHz

and 200MHz).

Table 2: Voltage frequency pairs.

Voltage Frequency Timing constraints

(V) (MHz)

1.32 400 violated

1.32 300 met

1.32 200 met

1.1 400 violated

1.1 300 violated

1.1 200 met

1.1 100 met

1.1 50 met

1.1 40 met

1.1 20 met

Four applications (matrix multiplication, FFT,

wavefront, and hiperLAN transmitter) are mapped on

a 3x3 mesh-based NoC. The absence of DSPs in exist-

ing NoC platform prevents us from meeting the dead-

line (4 µs/frame) of hiperLAN transmitter. Thus we

set the deadline as the minimal latency of the appli-

cation on the NoC platform (39 µs) , when all routers

are conﬁgured with the highest frequency.

To analyze the power and energy consumption, the

switching activity ﬁles are generated for each applica-

tion from Cadence NCSim. The NoC routers are syn-

thesized using 65 nm multi-Vdd library. The power

analysis is performed by Synopsys design compiler

on the synthesized NoC routers with the generated

switching activity ﬁles.

5.4 Experiment Result

Four benchmarks (matrix multiplication, FFT, wave-

front, and hiperLAN) were experimented with

BEPCD algorithm. Initially, the system agent as-

signed max frequency (300 MHz) and voltage (1.32

V) to all switches. At each iteration, the application

execution time was monitored and if it did not violate

the timing deadline, the next lower voltage/frequency

pair from Table 2 was assigned to the lowest trafﬁc

switch (in terms of peak load in a time window of 40

cycles).

Tables 3, 4, 6, and 5 show the energy and power

savings of each of the four benchmarks. In the ta-

bles, the second column shows the switch number

which changes its voltage/frequency followed by ”f”

or ”vf”. ”f” indicates a frequency change, while ”vf”

shows that both the voltage and frequency change.

The power and energy trends for each of the four

applications are clearly depicted in Figure 7. It is seen

that as a consequence of BEPCD, the NoC quickly it-

erates towards the minimum power for each of the ap-

plication. If the targeted switch is present in the crit-

ical path, as expected, the application execution time

SELF-ADAPTIVE NOC POWER MANAGEMENT WITH DUAL-LEVEL AGENTS - Architecture and Implementation

455

Table 3: Energy and power savings for matrix multiplica-

tion.

Iteration Switch Time Energy Power Energy saving Power savings

(ns) (mJ) mW % %

1 - 105834 1.73 16.35 0 0

2 1vf 105834 1.67 15.84 3.11 3.11

3 3vf 106808 1.26 11.84 26.90 27.56

4 3f 107415 1.21 11.31 29.78 30.82

5 3f 112134 1.25 11.20 27.39 31.46

6 3f 116373 1.27 10.99 26.07 32.76

7 1f 101815 1.11 10.97 35.46 32.91

8 2vf 108774 1.92 16.96 31.11 32.97

9 2f 113100 1.17 10.41 31.97 36.34

10 2f 111134 1.15 10.38 33.32 36.50

11 2f 111467 1.57 10.38 33.12 36.53

Table 4: Energy and power savings for FFT.

Iteration Switch Time Energy Power Energy saving Power savings

(ns) (mJ) mW % %

1 - 381615 17.40 45.61 0 0

2 3vf 381615 15.87 41.60 8.78 8.78

3 3f 381615 15.67 41.07 9.95 9.95

4 3f 381616 14.10 36.96 18.95 18.95

5 1vf 377320 13.66 36.21 21.49 20.59

6 1f 430525 15.54 36.11 10.68 20.83

7 1f 381616 16.69 35.89 21.29 21.29

8 2vf 381616 13.69 35.89 21.29 21.29

9 2f 381154 13.68 35.89 21.39 21.29

10 2f 376549 12.01 31.89 30.99 30.06

(AET) increases with a decrease in voltage/frequency

(iteration 3 to 6 and 7 to 9 in Table 3, iteration 2 and 4

in Table 6). The AET remains unaffected if the switch

does not come in the critical path (Table 5, iteration 6

to 13 Table 6). In some situations, the memory con-

tention is reduced with voltage/frequency decrease,

then AET may also decrease (iteration 7 and 10 Ta-

ble 3, iteration 7 and 10 in table 4, and iteration 3 in

Table 6 ).

The BEPCD performs iterations only till the ap-

plication meets deadline. To cater for the sudden

changes in time (iteration 6 in Table 4) resulting from

massive memory contention (iteration 6 Table 4), the

algorithm performs an additional iteration to check if

a further reduction in frequency would reduce time.

If no reduction is encountered, switch is reverted to

original frequency and no further DVFS commands

are given. The plots conﬁrm clearly signiﬁcant ad-

vantages of our proposed strategy (from 21% to 33%

decrease in energy and from 21% to 36% decrease in

power consumption).

Table 5: Energy and power savings for HiperLAN.

Iteration Switch Time Energy Power Energy saving Power savings

(ns) (mJ) mW % %

1 - 39000 1.77 45.61 0 0

2 1vf 39000 1.62 41.60 8.78 8.78

3 3vf 39000 1.60 41.07 9.95 9.95

4 3f 39000 1.44 37.06 18.73 18.73

5 3f 39000 1.42 36.54 19.88 19.88

6 3f 39000 1.42 36.42 20.13 20.13

7 1f 39000 1.41 36.21 20.59 20.59

8 - 39000 1.40 35.90 21.29 21.29

9 - 39000 1.40 35.90 21.29 21.29

10 - 39000 1.40 35.90 21.29 21.29

11 - 39000 1.40 35.90 21.29 21.29

12 - 39000 1.40 35.90 21.29 21.29

Table 6: Energy and power savings for wavefront.

Iteration Switch Time Energy Power Energy saving Power savings

(ns) (mJ) mW % %

1 - 91970 1.51 16.50 0 0

2 3vf 110234 1.37 12.50 9.15 31.94

3 3f 106529 1.28 12.03 15.51 37.09

4 3f 110294 1.32 11.97 12.96 37.79

5 - 110294 1.32 11.97 12.96 37.79

6 - 110294 1.32 11.97 12.96 37.79

7 - 110294 1.32 11.97 12.96 37.79

8 - 110294 1.32 11.97 12.96 37.79

9 - 110294 1.32 11.97 12.96 37.79

10 - 110294 1.32 11.97 12.96 37.79

5.5 Overhead Analysis

To evaluate the overhead of the dual-level agent intel-

ligence layer, we need to analyze the area overhead

of microcontroller-based local agent (Fig. 4) and the

instruction overhead of software-based system agent

(Fig. 3).

At 300 MHz frequency with 1.32 V operating

voltage, Synopsys design compiler shows an area of

1459 µm

for each local agent, which is negligible (4

%) as compared to the router area (33806 µm

). The

local agent does not contribute to any timing over-

head as it is not present in the critical path of the

switch. Concerning the software overhead of the sys-

tem agent, it only amounts to 279 lines of C code on

Leon 3 processor for the BEPCD algorithm.

We can see from the overhead analysis that, dual-

level agent monitoring incurs minimal hardware area

overhead and software instruction overhead. Thus the

system architecture is scalable to large-sized NoCs

with a diversity of monitoring and reconﬁguration

functions.

PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems

456

Figure 7: Energy and power comparison for (a) matrix mul-

tiplication, (b) FFT, (c) wavefront, and (d) hiperLAN.

6 CONCLUSIONS AND FUTURE

WORK

We have presented the design and implementation of

a generic and scalable self-adaptive NoC architecture.

The system is monitored and reconﬁgured by dual-

level agents with SW/HW co-design and synthesis.

The system agent is implemented in software, with

high-level instructions tailored for issuing adaptive

operations. The local agent is attached to each net-

work node and implemented as a microcontroller. The

local agent provides tracing and reconﬁguration of the

local circuit parameters, based on the run-time adap-

tation commands from the system agent. The dual-

level agents make a joint effort to achieve the perfor-

mance goals of the application, where the monitored

events are labeled with timestamps. The separation of

the intelligence layer from NoC infrastructure makes

the approach generic and improves the design efﬁ-

ciency. The SW/HW co-design and synthesis effec-

tively reduces the hardware overhead while offering

ﬂexibility for adaptive operations.

We demonstrated the effectiveness and the scal-

ability of the system architecture with best-effort dy-

namic power management using distributed DVFS. In

this case study, the application execution time and the

run-time workloads of all routers are directly moni-

tored by the agents. The router with the lowest work-

load will be switched to a lower voltage and/or fre-

quency when there is a positive slack of application

latency (per frame/stream). The experiments were

performed with four benchmarks (matrix multiplica-

tion, FFT, wavefront, and hiperLAN transmitter), on a

cycle-accurate RTL-level NoC simulator. With 65nm

multi-Vdd library for synthesis and power analysis,

we showed that the adaptive power management saves

up to 33% energy and up to 36% power. The hardware

overhead of each local agent is only 4% of a router

area.

In the future work, we will present a complete de-

sign chain for the system architecture, including ap-

plication mapping, scheduling followed by run-time

monitoring and reconﬁguration. The inter-agent com-

munication shall also be provided with guaranteed

services.

REFERENCES

Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J.,

Husbands, P., Keutzer, K., Patterson, D. A., Plishker,

W. L., Shalf, J., Williams, S. W., and Yelick, K. A.

(2006). The landscape of parallel computing re-

search: A view from berkeley. Technical report,

U.C.Berkeley.

Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Ke-

shavarzi, A., and De, V. (2003). Parameter variations

and impact on circuits and microarchitecture. In Proc.

Design Automation Conference, pages 338–342.

Chabloz, J. M. and Hemani, A. (2010). Distributed dvfs

using rationally-related frequencies and discrete volt-

age levels. In Low-Power Electronics and Design

(ISLPED), 2010 ACM/IEEE International Symposium

on, pages 247 –252.

Chen, X., Lu, Z., Jantsch, A., and Chen, S. (2010). Support-

ing distributed shared memory on multi-core network-

on-chips using a dual microcoded controller. In Proc.

Design, Automation & Test in Europe Conf. & Exhibi-

tion (DATE), pages 39–44.

Ciordas, C. (2008). Monitoring-Aware Network-on-Chip

Design. PhD thesis, Eindhoven University of Tech-

nology.

Ciordas, C., Hansson, A., Goossens, K., and Basten, T.

(2008). A monitoring-aware network-on-chip design

ﬂow. J. Syst. Archit., 54:397–410.

Dafali, R. and Diguet, J.-P. (2009). Self-adaptive network

interface (sani): Local component of a noc conﬁgura-

tion manager. In Proc. Int. Conf. Reconﬁgurable Com-

puting and FPGAs ReConFig ’09, pages 296–301.

Guang, L., Nigussie, E., Isoaho, J., Rantala, P., and Ten-

hunen, H. (2010). Interconnection alternatives for hi-

erarchical monitoring communication in parallel socs.

Microprocessors and Microsystems, 34(5):118–128.

Hoffmann, H., Eastep, J., Santambrogio, M. D., Miller,

J. E., and Agarwal, A. (2010). Application heartbeats

for software performance and health. In Proceed-

ings of the 15th ACM SIGPLAN symposium on Prin-

ciples and practice of parallel programming, PPoPP

’10, pages 347–348, New York, NY, USA. ACM.

Howard, J., Dighe, S., Vangal, S. R., Ruhl, G., Borkar,

N., Jain, S., Erraguntla, V., Konow, M., Riepen, M.,

SELF-ADAPTIVE NOC POWER MANAGEMENT WITH DUAL-LEVEL AGENTS - Architecture and Implementation

457

Gries, M., Droege, G., Lund-Larsen, T., Steibl, S.,

Borkar, S., De, V. K., and Van Der Wijngaart, R.

(2011). A 48-core ia-32 processor in 45 nm cmos us-

ing on-die message-passing and dvfs for performance

and power scaling. IEEE Journal of Solid-State Cir-

cuits, 46(1):173–183.

Jean-Michel Chabloz, A. H. (2012). Scalable Multi-

core Architectures, chapter PowerManagement Ar-

chitectureinMcNoC, pages 55–80. Springer Sci-

ence+Business Media, LLC.

Nostrum (2011). http://www.ict.kth.se/nostrum/.

Rabaey, J. M. (2007). Scaling the power wall: Revisiting

the low-power design rules. Keynote speech at SoC

07 Symposium.

Salehie, M. and Tahvildari, L. (2009). Self-adaptive soft-

ware: Landscape and research challenges. ACM

Trans. Auton. Adapt. Syst., 4:14:1–14:42.

Sylvester, D., Blaauw, D., and Karl, E. (2006). Elas-

tic: An adaptive self-healing architecture for unpre-

dictable silicon. IEEE Design & Test of Computers,

23(6):484–490.

Truong, D., Cheng, W., Mohsenin, T., Yu, Z., Jacobson,

A., Landge, G., Meeuwsen, M., Watnik, C., Tran, A.,

Xiao, Z., Work, E., Webb, J., Mejia, P., and Baas, B.

(2009). A 167-processor computational platform in

65 nm cmos. IEEE Journal of Solid State Circuits,

44(4):1130–1144.

van Berkel, C. (2009). Multi-core for mobile phones. In

Design, Automation & Test in Europe Conference &

Exhibition, 2009. DATE ’09.

PECCS 2012 - International Conference on Pervasive and Embedded Computing and Communication Systems

458