An Application- and Platform-agnostic Runtime Management
Framework for Multicore Systems
Graeme M. Bragg
1
, Charles Leech
1
, Domenico Balsamo
1
, James J. Davis
2
, Eduardo Wachter
1
,
Geoff V. Merrett
1
, George A. Constantinides
2
and Bashir M. Al-Hashimi
1
1
School of Electronics and Computer Science, University of Southampton, SO17 1BJ, U.K.
2
Department of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ, U.K.
Keywords:
Heterogeneous Systems, Runtime Management, Software Framework.
Abstract:
Heterogeneous multiprocessor systems have increased in complexity to provide both high performance and
energy efficiency for a diverse range of applications. This motivates the need for a standard framework that
enables the management, at runtime, of software applications executing on these processors. This paper
proposes the first fully application- and platform-agnostic framework for runtime management approaches
that control and optimise software applications and hardware resources. This is achieved by separating the
system into three distinct layers connected by an API and cross-layer constructs called knobs and monitors.
The proposed framework also supports the management of applications that are executing concurrently on
heterogeneous platforms. The operation of the proposed framework is experimentally validated using a basic
runtime controller and two heterogeneous platforms, to show how it is application- and platform-agnostic and
easy to use. Furthermore, the management of concurrently executing applications through the framework is
demonstrated. Finally, two recently reported runtime management approaches are implemented to demonstrate
how the framework enables their operation and comparison. The energy and latency overheads introduced by
the framework have been quantified and an open-source implementation has been released
a
.
1 INTRODUCTION
The management and control of hardware settings at
runtime is crucial to the efficient execution of applica-
tions with varying performance requirements on em-
bedded platforms. This has, however, become a non-
trivial task for multi-core and heterogeneous embed-
ded systems. In addition, applications have become
increasingly dynamic in order to exploit the capabili-
ties of these systems, with adjustable parameters that
must be tuned to optimise their behaviour. As a result,
the proactive optimisation of application performance
and system energy efficiency is a key research chal-
lenge. Runtime management is a solution that enables
optimisation of, and tradeoff between, quality, appli-
cation throughput and energy with varying require-
ments.
One way in which this can be achieved is by the
exposure and adaptation of tunable parameters from
a
Available at: https://github.com/PRiME-project/
PRiME-Framework
the application and platform through a consistent fra-
mework interface. However, the majority of current
frameworks only provide a mechanism to monitor ap-
plication performance, and do not allow for the simul-
taneous monitoring and control of hardware compo-
nents and applications at runtime. Moreover, most ex-
isting frameworks do not support heterogeneous plat-
forms, which contain processors with differing capa-
bilities, or the management of concurrent applicati-
ons.
This paper presents the first framework for fully
application- and platform-agnostic runtime manage-
ment that enables the simultaneous control and op-
timisation of software applications and hardware re-
sources. This is achieved by separating systems
into three distinct layers: application, runtime ma-
nagement and device. These layers are connected
through cross-layer constructs called knobs and mo-
nitors, accessed through a novel application program-
ming interface (API), which enable the flow of infor-
mation between layers and the control and monito-
ring of runtime-tunable and -observable parameters.
Bragg, G., Leech, C., Balsamo, D., Davis, J., Wachter, E., Merrett, G., Constantinides, G. and Al-Hashimi, B.
An Application- and Platform-agnostic Runtime Management Framework for Multicore Systems.
DOI: 10.5220/0006939100570066
In Proceedings of the 8th International Joint Conference on Pervasive and Embedded Computing and Communication Systems (PECCS 2018), pages 57-66
ISBN: 978-989-758-322-3
Copyright © 2018 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
57
Table 1: Properties of state-of-the-art frameworks for runtime management of applications on multiprocessor systems.
Framework
Application–RTM RTM–device
Monitor
bounds
Hetero.
platforms
Open
source
Knobs Monitors
Non-temp.
monitors
Multiple
monitors
Knobs Monitors
Heartbeats
(Hoffmann et al., 2010)
7 X 7 7 7 7 7 7 X
PowerDial
(Hoffmann et al., 2011)
X Heartbeats 7 7 7 7 7 7 7
Heterogeneous Heartbeats
(Fleming and Thomas, 2014)
7 Heartbeats 7 7 7 X 7 CPU+FPGA X
ARGO
(Gadioli et al., 2015)
X X X X 7 7 7 7 7
AS-RTM
(Paone et al., 2014)
X Heartbeats 7 X 7 7 7 7 7
PTRADE
(Hoffmann et al., 2013)
7 Heartbeats 7 7 X 7 7 7 7
DRM
(Baldassari et al., 2017)
7 Heartbeats 7 7 7 7 7 7 7
BEEPS
(Gaspar et al., 2015)
7 Heartbeats 7 7 X 7 7 7 7
Proposed X X X X X X X X X
This reduces the design complexity by enabling the
runtime management layer to provide a specific ser-
vice to the applications, e.g. to meet a performance
requirement, whilst meeting optimisation targets by
controlling the hardware. The framework’s novel fe-
atures include:
The ability to control and monitor applicati-
ons and hardware simultaneously using a cross-
layered approach.
An API that provides a consistent way in which
knobs and monitors are specified and monitored
across applications and platforms.
A mechanism to enable the management of con-
currently executing applications and heterogene-
ous platforms.
Additionally, the framework enables the direct com-
parison of runtime management approaches and algo-
rithms, which has not previously been possible, and
simplifies runtime manager (RTM) development.
In the remainder of this paper, a survey of existing
frameworks is carried out to contrast the proposed fra-
mework against the state of the art. The proposed fra-
mework is experimentally validated with a range of
applications and two different types of heterogeneous
platform to demonstrate its application- and platform-
agnostic properties and to illustrate its ease of use.
The management of two concurrently-executing ap-
plications is then demonstrated. In addition, two re-
cently reported runtime management approaches, the
first based on performance counter-driven control and
the second using reinforcement learning, are imple-
mented with the framework to demonstrate how the
framework enables their operation and comparison.
Finally, the energy and latency overheads of the pro-
posed framework are quantified. An open-source C++
implementation of the framework and API has also
been released.
2 RELATED WORK
Various runtime management approaches exist in the
literature for optimising system behaviour, whilst sa-
tisfying application requirements. These include dy-
namic voltage and frequency scaling (DVFS) (Das
et al., 2014; Wang et al., 2017), per-core power ga-
ting (Rahmani et al., 2017), dynamic task mapping
and thread migration (Reddy et al., 2017). While
RTMs are typically designed to address general chal-
lenges, such as energy efficiency or thermal manage-
ment, they are largely implemented on specific plat-
forms or with specific classes of application, e.g. mul-
timedia (Kim et al., 2017) or image processing (Yang
et al., 2015).
In addition, benchmarks are typically used to as-
sess relative performance and measure specific as-
pects of RTMs and hardware platforms. However,
they do not typically expose application requirements
(e.g. error or accuracy) in addition to performance
PEC 2018 - International Conference on Pervasive and Embedded Computing
58
and this can limit the range of optimisation opportu-
nities of runtime management approaches. Further-
more, source code for RTMs is often not released,
with limited detail on implementation reported, ma-
king reproduction of results a non-trivial task. This
prevents the direct comparison of approaches, with
several works relying on comparison via Linux go-
vernors (Singla et al., 2015; Reddy et al., 2017).
Runtime management can be enhanced by the ex-
posure of dynamic knobs and monitors, which pro-
vide a mechanism to communicate with the appli-
cation and platform. Specifically, knobs allow the
tuning of hardware and application parameters by
the RTM, while monitors enable the measurement of
hardware properties and the observation of applica-
tion behaviour, including the setting of performance
targets by the application (Hoffmann et al., 2011;
Fleming and Thomas, 2014; Gadioli et al., 2015;
Leech et al., 2018b). In addition, knobs and moni-
tors can been used to explore application-device tra-
deoffs, such as throughput-power (Hoffmann et al.,
2013) and precision-throughput (Sui et al., 2016), and
locate optimal operating points for applications (Vas-
siliadis et al., 2016). However, runtime management
lacks portability unless these knobs and monitors are
exposed through a consistent interface.
Several frameworks have been proposed in the
past to address the challenge of providing such inter-
faces. Table 1 summarises their features. The most
relevant framework is the Heartbeats API (Hoffmann
et al., 2010), which provides a standardised interface
for single or concurrent applications to communicate
their current and target performance to external ob-
servers, such as an RTM. The Heartbeats API only
allows applications to communicate their throughput
(i.e. the heart rate), therefore it does not allow ot-
her types of parameters to be exposed, such as accu-
racy and error (classed as non-temporal monitors in
column four of Table 1), and prevents tradeoffs bet-
ween them. In addition, it does not extend this inter-
face for monitoring or control of device parameters.
Most of the frameworks reported in Table 1 are based
on the Heartbeats concept and inherit its features, e.g.
application monitors (column three).
In order to perform tradeoffs within a single ap-
plication, multiple monitors of different types must
be exposed, e.g. throughput and error. Column five
of Table 1 shows that Heartbeats, and most of the fra-
meworks that rely on it, do not support this functiona-
lity. In addition, for an application to meet its require-
ments, a target can be specified with the monitor. Ho-
wever, there is no indication as to whether the target
is a maximisation or minimisation objective, as listed
in column eight. As a result, these approaches do not
Figure 1: Cross-layer framework and API enabling com-
munication between the application, runtime management
and device layers using knobs and monitors. Examples are
given for an image filter application on a CPU.
allow fully application-agnostic behaviour.
Columns six and seven show that current frame-
works only provide partial abstraction of RTM to de-
vice communication, and do not include both knobs
and monitors to control hardware components at run-
time. Moreover, most existing works do not operate
on heterogeneous platforms (column nine), which
provide both high performance and energy efficiency
by combining conventional CPUs with other accele-
rators. These platforms typically increase the scalabi-
lity of parallel applications and systems, and therefore
they need to be managed by a framework that sup-
ports device-agnostic control. One framework sup-
ports a heterogeneous platform; however, it has been
designed for a specific platform and introduces a har-
dware dependency in the process (Fleming and Tho-
mas, 2014). This restricts the cross-platform capabi-
lities of current frameworks, meaning that they do not
allow current RTM approaches to be portable across
multiple platforms.
3 PROPOSED FRAMEWORK
To address the limitations of existing frameworks dis-
cussed in Section 2, a framework for application-
and platform-agnostic runtime management of hete-
rogeneous systems is presented. Figure 1 shows the
proposed framework and how the three layers are
connected by novel APIs (App to RTM API and RTM
to device API). This provides consistent interfaces
from an RTM to both hardware platforms and appli-
cations, which enables the design and implementation
of application- and platform-agnostic runtime mana-
gement approaches. As discussed in Section 2, ap-
An Application- and Platform-agnostic Runtime Management Framework for Multicore Systems
59
Table 2: Application-to-RTM and RTM-to-device API functions for the proposed framework.
Layer Construct Space Identifier Input(s) Output(s) Description
app
knob
disc
/
cont
min knob, min Update application knob’s minimum allowed value
max knob, max Update application knob’s maximum allowed value
get knob value Pull application knob’s current value
mon
min mon, min Update application monitor’s minimum desired value
max mon, max Update application monitor’s maximum desired value
weight mon, weight Update application monitor’s relative importance
set mon, value Push application monitor’s current value
dev
knob
min knob min Pull device knob’s minimum allowed value
max knob max Pull device knob’s maximum allowed value
init knob init Pull device knob’s initial (default) value
type knob type Pull device knob’s type
set knob, value Push device knob’s current value
mon
type mon type Pull device monitor’s type
get mon value Pull device monitor’s current value and bounds
plication knobs expose tunable application parame-
ters, e.g. filter precision, while monitors convey infor-
mation about the behaviour of the applications, e.g.
frame rate. Similarly, device knobs expose tunable
device parameters while monitors convey information
about the status of devices. Exposing knobs and mo-
nitors at both the application and device layer ena-
ble tradeoffs, e.g. performance-energy or accuracy-
temperature, to be explored and exploited by the run-
time management layer.
In addition, the proposed framework facilitates the
comparison of existing RTMs as well as the manage-
ment of concurrently-executing applications and he-
terogeneous platforms. The remainder of this section
provides an overview of the technical concepts of the
proposed framework and details of the novel API.
3.1 Framework Concepts
Structure: The separation of the system into the
three distinct layers—application, runtime manage-
ment and device—shown in Figure 1 reduces de-
sign complexity and provides flexibility during ope-
ration. The application layer comprises any number
of software processes, while the device layer inclu-
des the hardware and its software drivers. The run-
time management layer comprises an RTM responsi-
ble for the control and monitoring of the other two
layers. This separation ensures portability and cross-
compatibility; applications and device drivers only
need to be written once to be used with any imple-
mented RTM.
The framework can be viewed hierarchically “do-
wnwards” since, as far as knob and monitor control
is concerned, applications are masters of the RTM.
Applications make calls to the API, controlling the
presence and configuration of each knob and monitor.
Devices, meanwhile, are the RTM’s slaves since they
must respond to requests to set and get knob and mo-
nitor values, respectively. Thus, applications “pull”
their knob settings from the RTM and “push” moni-
tor updates, while device knobs are pushed from the
RTM and monitor values pulled.
Communication: Knobs and monitors, shown in the
dashed regions of Figure 1, facilitate communica-
tion between the layers. Bounds are attached to both
knobs and monitors, in the form of minima and max-
ima, which allow applications and devices to inform
an RTM of targets and constraints. Knob bounds
represent a range of allowed values while monitor
bounds represent a range of desired values, rather
than a single target. An RTM’s primary objective is
to ensure that the monitor values of all applications
and the device remain within their specified bounds.
Beyond this, it is free to optimise any unbounded mo-
nitors in order to meet secondary objectives, e.g. to
reduce power consumption. Minimal modification of
applications is required to expose knobs and monitors
through the framework.
The image filtering application shown in Figure 1
provides the option of selecting float or double pre-
cision for its numeric operations at runtime. This
choice will be controlled by an RTM using an applica-
tion knob with options
{
0, 1
}
. If the same application
requires a minimum throughput, e.g. expressed as a
frame rate α, an application monitor with this bound
can be provided. In this case, the application will
periodically update the current frame rate so that the
RTM can keep it within the range [α, ). On the har-
dware side, DVFS of the CPU is achieved via a device
knob with options
{
0, 1, ·· · , 9
}
, enabling the RTM to
switch between ten distinct voltage-frequency pairs.
PEC 2018 - International Conference on Pervasive and Embedded Computing
60
Finally, to enable thermal management by the RTM,
a temperature sensor is exposed as a device monitor.
Weights: Individual applications may feature multi-
ple performance objectives with differing priorities.
For example, an application aware of both its throug-
hput and accuracy may wish to prioritise the optimisa-
tion of one over the other. In the proposed framework,
such priorities are expressed with a numeric weight
attached to each monitor. These weights instruct the
RTM to expend proportional effort in optimising each
monitor’s value. In a similar manner, application pri-
ority is indicated through attached weights such that,
for example, a higher level of performance can be en-
sured by foreground processes.
Concurrency: Real-world systems commonly exe-
cute more than one application concurrently. Due
to this, an RTM is required to carefully manage sy-
stem resources so that each application can meet its
performance targets. When considering concurrently
executing applications, the framework provides a me-
chanism to identify and manage them simultaneously,
enabling inter-application tradeoffs by the RTM.
Types: Knobs and monitors each have a type selec-
table from a discrete set of options, e.g. TEMP for a
temperature monitor or FREQ for a frequency knob.
This represents a compromise between complete ag-
nosticism and the full provision of information. Pro-
viding “hints” to the RTM simplifies the process of
determining the function of knobs and the properties
represented by monitors, e.g. “lower power is better.
Spaces: All knob and monitor values are expressed
in standardised, unitless formats to maintain applica-
tion and device agnosticism. The proposed frame-
work allows discrete- and continuous-valued versions
of each knob and monitor so that appropriate optimi-
sation processes can be used by the RTM. These spa-
ces enable the translation of application-specific in-
formation into agnostic sets, as shown in Figure 1 for
the ranges of the knobs and monitors. Discrete versi-
ons use signed integer values while their continuous
counterparts operate using floating-point data.
Adaptability: In order to provide maximal flexibi-
lity, all bounds and weights are adjustable at runtime,
and no restrictions are placed on when update to these
can occur. Most commonly, applications create their
knobs and monitors before being executed, however
no limitation is imposed on such events occurring
partway through application execution instead. Ap-
plications are allowed to be attached to and detached
from the framework at any point during runtime. This
capability is in contrast to existing frameworks, most
of which assume a constant application set, contrary
to the typical use of many embedded systems.
3.2 API Specification
The proposed framework is realised through novel
API calls that connect the system layers of Figure 1
and enable the exposure of knobs and monitors be-
tween them in a consistent manner across applica-
tions and hardware platforms. Table 2 illustrates
how the API functions are split into application (app)
and device (dev) categories, with subcategories for
knob (knob) and monitor (mon) interaction. Discrete-
(disc) and continuous-valued (cont) versions exist
across the API to indicate knob and monitor typology.
The RTM must be made aware of the allo-
wable and desired values for knobs and monitors,
respectively, in order to ensure that its optimisa-
tions have positive effects. For knobs, functi-
ons app
knob (disc|cont) (min|max)() facilitate
this, letting the application indicate the range in which
values can be chosen. Conversely, monitor functions
app mon (disc|cont) (min|max|weight)() allow
the setting of RTM objectives, with * min()
and * max() functions indicating desired lower
and upper bounds. Where an application re-
quires only a maximum or minimum bound, the
other end of the range can be left unboun-
ded using (DISC|CONT) MIN or (DISC|CONT) MAX.
Intra-application weighting values between 0.0 and
CONT MAX can be used to indicate relative monitor
importance to the RTM using * weight() functi-
ons, guiding its optimisations. All of these settings
can be updated during application execution if requi-
red. Functions app knob (disc|cont) get() and
app mon (disc|cont) set() are used by the appli-
cation to get the current value of a knob from the RTM
and set a value for a monitor to the RTM, respectively.
The timing of these actions is application-controlled.
Device-layer knobs and monitors are exposed
and updated via the RTM-to-device API functions,
as shown in the lower half of Table 2. Functi-
ons dev knob (disc|cont) (min|max)() are
equivalent to their application-layer counterparts,
setting ranges of valid values. Additional functi-
ons dev knob (disc|cont) (type|init)()
return the type of the knob or its initial value,
i.e. that from which the RTM starts its explora-
tion. Type-related functions return values from
defined sets and are called by the RTM using
dev mon (disc|cont) type(). The RTM uses
functions dev knob (disc|cont) set() and
dev mon (disc|cont) get() for setting device
knob values and accessing monitor values and
bounds from the device at runtime.
An Application- and Platform-agnostic Runtime Management Framework for Multicore Systems
61
4 EVALUATION
In order to demonstrate the capabilities of the fra-
mework and validate its operation, a series of ex-
periments have been carried out. Illustrative RTMs
were used where appropriate to demonstrate speci-
fic concepts. The experimental setup is discussed in
Section 4.1, after which the framework’s basic opera-
tion and ease of use are exemplified in Section 4.2.
Application agnosticism is shown throughout this
section while platform agnosticism is demonstra-
ted in Section 4.3 with the same application-RTM
pair executing on two different heterogeneous plat-
forms. Support for concurrent applications is shown
in Section 4.4, with two different applications execu-
ting on one platform The ability of the framework
to enable direct comparison of RTMs is shown in
Section 4.5 with two recently reported runtime ma-
nagement approaches. Finally, framework overheads
are analysed in Section 4.6.
4.1 Experimental Setup
Two heterogeneous embedded platforms were used
to demonstrate the proposed framework. The
Odroid-XU3 development board, containing an ARM
big.LITTLE architecture with two quad-core CPU
clusters and a GPU, was used to demonstrate the ease
of use of the framework, the direct comparison of
RTMs and to assess overheads. The platform contains
five temperature sensors to monitor the CPU and GPU
and four power sensors to monitor each CPU cluster,
the GPU and memory. Each of these was exposed
to the framework as a device monitor. Three device
knobs were exposed to provide DVFS for each CPU
cluster and the GPU. Table 3 summarises the knobs
and monitors of the Odroid-XU3.
A Cyclone V SoC Development Kit was used
to demonstrate platform-agnostic operation of the
framework. This platform includes a heterogene-
ous CPU-FPGA system-on-chip containing two ARM
CPUs and FPGA fabric. Using OpenCL, applications
can execute on either the CPUs or the FPGA.
Four different applications from the numerical and
multimedia domains were used to demonstrate the
application-agnostic properties of the framework.
4.2 Agnostic Runtime Management
A basic controller was implemented within the run-
time management layer to illustrate the use of knobs
and monitors for maintaining an application perfor-
mance target while optimising a given device mo-
nitor. Listing 1 shows the code for the controller,
Table 3: Device-level knobs and monitors for Odroid-XU3.
Const. Space Type For No.
knob
disc FREQ LITTLE cluster 1
disc FREQ big cluster 1
disc FREQ GPU 1
mon
cont POW Clusters, RAM, GPU, SoC 5
cont TEMP big cores 4
cont TEMP GPU 1
disc PMC LITTLE cores 16
disc PMC big cores 24
Listing 1: RTM code for agnostic control and monitoring of
application and device knobs and monitors.
1 v oi d r tm :: co n t r o l_ l o o p () {
2 wh i le ( 1 ) {
3 te m p _m o n = d e v _a p i . m o n _ co n t _ get ( t em p _ mon s [ 2
]) ;
4 if ( a pps . s iz e () ){
5 a p p _p e r f = a p p _ mon s _ c on t [ 0 ] ;
6 if ( a p p _p e r f . va l < a p p_ p e rf . min ) {
7 if ( f r eq _ k no b . val < f r e q_k n o b . ma x ) {
8 fr e q _k n o b . va l ++ ;
9 de v _a p i . k n o b _di s c _ se t ( fr e q_ kn o b ,
fr e q _kn o b . v al );
10 }}
11 e ls e i f ( t em p _mo n . v al > t e mp _ m on . ma x ) {
12 f r eq _ k nob . va l - -;
13 d ev_ a pi . k n ob_ d i s c_ s e t ( f r eq _k n ob , f r e q_ k n ob
. v al );
14 } }} }
which ensures that the value of the application per-
formance monitor remains within its bounds. This
is achieved by adjusting the device frequency knob
in order to avoid violations of the monitor bounds
app perf.min and app perf.max (lines 6 9). The
optimisation of device temperature (line 11) is the se-
condary objective and is achieved by decrementing
the frequency knob (line 12), trading off excess ap-
plication performance (lines 12–13).
The behaviour of this controller is shown in Fi-
gure 2 while running a numerical benchmarking ap-
plication (Whetstone). This benchmark performs
numerical functions using integer and floating-point
arithmetic. Its performance is measured in thousands
of Whetstone instructions per second (KIPS), which
is exposed as a continuous monitor with bounds of
[2.30, ). Initially, the controller set the device fre-
quency to maximum and observed the device tempe-
rature. As the temperature increased above the max-
imum threshold specified by temp mon.max (80
C),
the controller reduced the frequency until the tem-
perature was below the threshold whilst ensuring
that the application performance was higher than
PEC 2018 - International Conference on Pervasive and Embedded Computing
62
0
1
2
3
4
5
6
Performance (KIPS)
app_perf.min
0
0.5
1
1.5
2
Frequency (GHz)
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80
Temperature (°C)
Time (s)
temp_mon.max
Figure 2: Device temperature optimisation under application performance constraints using the controller RTM, including
dynamic adjustment of the temperature threshold from 80 to 60
C.
Figure 3: Design-space exploration of the Jacobi applica-
tion across the Odroid-XU3 and Cyclone V devices.
app perf.min. After 50 seconds, the platform redu-
ced its temperature threshold to 60
C and the RTM
reduced the frequency in response until the updated
monitor bound was satisfied while still meeting the
application throughput requirement.
This experiment demonstrates the basic operation
of the framework and illustrates the dynamic nature of
its knobs and monitors. The controller is application-
and platform-agnostic as it could operate, without
modification, with any application that exposes a per-
formance monitor and any platform that exposes a fre-
quency knob and temperature monitor.
4.3 Platform Agnosticism
The portability of RTMs and applications implemen-
ted within the framework is demonstrated in Figure 3,
which shows the design-space exploration (DSE) of
the same application across two heterogeneous plat-
forms using the same RTM code. A Jacobi iterative
solver was used as a case-study application.
The Jacobi method solves the system of N linear
equations Ax = b, where A is an N × N matrix and
x and b are N × 1 column vectors. If A is decompo-
sed into diagonal and remainder components D and
R, under suitable conditions x can be computed ite-
ratively, with later iterations containing more accu-
rate results. The application can operate a tradeoff
between the speed of calculation (solves per second)
and the accuracy of the result (mean squared error) by
adjusting the number of iterations performed and the
precision of the data type.
Throughput and accuracy were exposed as moni-
tors while iterations to perform and precision were
exposed as knobs. The DSE extended to application
execution on the heterogeneous components of both
platforms, including the GPU on the Odroid and the
FPGA on the Cyclone V, in addition to the CPUs.
Points in Figure 3 show the resultant throughput and
error for each combination of knob values, with blue
crosses for the Odroid and green triangles for the Cy-
clone V. This experiment demonstrates that the same
application and RTM code can be used on any plat-
form supported within the proposed framework.
4.4 Concurrency Management
This subsection demonstrates how the framework
supports the management of concurrently execu-
ting applications. A runtime control algorithm was
implemented with a target of keeping the throug-
hput monitor of each application within its bounds,
app perf.min and app perf.max, while minimising
device frequency. The behaviour of this controller is
shown in Figure 4, where the execution of two ap-
plications is indicated by their throughput over time.
The top plot shows a video filtering application and
An Application- and Platform-agnostic Runtime Management Framework for Multicore Systems
63
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35 40
Video filter throughput
(Frames per second)
app_perf.min
app_perf.max
0
100
200
300
400
500
0 5 10 15 20 25 30 35 40
Jacobi throughput
(Solves per second)
app_perf.min
app_perf.max
1
2
0 5 10 15 20 25 30 35 40
Frequency
(GHz)
Time (s)
Figure 4: Runtime management of the throughput of two concurrently-executing applications through the framework. The
Jacobi application begins execution at 21 seconds and the device frequency is adjusted to compensate.
the middle plot shows the Jacobi iterative solver.
Initially, the video filter application was the only
application executing. As a result, the runtime con-
troller adjusted the CPU frequency to meet the appli-
cation throughput bounds at the lowest frequency pos-
sible. The Jacobi application began its execution af-
ter 21 seconds, shortly after which the RTM observed
that its throughput was below the desired minimum
bound. The throughput of the video filter also decrea-
sed due to competition for device resources. To com-
pensate, the controller increased the CPU frequency
such that the throughput of both applications returned
to within their bounds.
4.5 Comparison of RTM Approaches
To demonstrate the framework’s optimisation and
comparative capabilities, two state-of-the-art runtime
management approaches were implemented within
the proposed framework. The first approach, RTM-
A (Reddy et al., 2017), aims to optimise power con-
sumption by monitoring hardware performance coun-
ters to identify opportunities where CPU frequency
can be reduced without impacting application perfor-
mance. The second approach, RTM-B (Maeda-Nunez
et al., 2015), employs reinforcement learning to pre-
dict the frequency that should be selected to meet an
application performance target based on previous ap-
plication behaviour. RTM-A was originally evalua-
ted on the Odroid-XU3 platform using standard ben-
chmarks with a reported mean energy saving of 25%
compared to the Linux Ondemand governor. RTM-B
was evaluated on the BeagleBoard-xM platform using
a video decoder application with a reported mean re-
Figure 5: Mean total energy consumed by the Odroid-XU3
running the video decoder application under the control of
each RTM, both with and without the framework (FW). The
experiment was repeated 50 times for each RTM.
duction in energy consumption of 30% when compa-
red to the Ondemand governor.
These two approaches lack portability and direct
comparisons cannot be made due to the different plat-
forms used for experimental validation. Implementa-
tion within the proposed framework allows them to
be directly compared, saving development time and
improving the accuracy of the comparison. To de-
monstrate this, the RTMs were evaluated using an
OpenCV video decoding application on the Odroid-
XU3 platform. The application exposes a continuous
monitor for the frame rate, with a minimum bound of
25 frames per second. The RTMs are directly compa-
red in Figure 5, between bars two and four, showing
that the application consumed a mean total energy of
381 J and 376 J under the control of RTM-A and
RTM-B, respectively. Comparison with the Linux
Ondemand governor (bar five) shows energy savings
of 17.2% and 18.2%, respectively. This demonstrates
that while RTM-B achieves a greater energy saving, it
PEC 2018 - International Conference on Pervasive and Embedded Computing
64
Figure 6: Breakdown of the sources of latency introduced
by the framework for communication between the RTM and
device layers.
is less than reported in the literature for this specific
application and platform pair.
4.6 Overheads
As with any abstraction, the framework introduces an
energy overhead due to the additional computation
required. This overhead can be estimated by com-
paring standalone versions of RTM-A and RTM-B
against their implementations within the framework.
Results of these experiments can be seen in Figure 5
for RTM-A (bars one and two) and for RTM-B (bars
three and four). RTM-A required 19.6 J (5.48%) more
energy, while RTM-B required only 15.2 J (4.23%)
more energy, in the minimum case. The minimum
case was used to minimise the impact of other run-
ning processes on the result. When compared to the
Ondemand governor, the two RTMs still achieved sig-
nificant savings regardless.
The framework also introduces latency overheads
that limit RTM reaction rates. Figure 6 is a visuali-
sation of the steps involved in reading a device moni-
tor inside the framework, from which seven internal
latency sources can be identified. t
asm
, t
tx
and t
diss
are the times to assemble, transmit and disassemble a
message used for conveying monitor information. t
net
is the message-passing interface latency and t
search
is
the time to search for and read a monitor.
The latency related to each API call was measu-
red and found to be 80–200 µs, with 40% attributed
to cross-layer communication. For an RTM reading
one device monitor and setting one device knob per
update, this limits the update rate to 1.67 kHz.
5 CONCLUSIONS
This paper has presented a framework that enables
application- and platform-agnostic runtime manage-
ment of concurrently executing applications on hete-
rogeneous multi-core systems. This is achieved by
visualising a system as three distinct layers connected
by dynamic knobs and monitors that allow a range of
tunable parameters and observable metrics to be ex-
posed. Framework operation with concurrent appli-
cations has been demonstrated. The framework ena-
bles the direct comparison of competing RTM appro-
aches, which was not previously possible, and simpli-
fies RTM development. It also introduces very mo-
dest energy and latency overheads that have limited
impact on the operation and performance of RTMs.
An open-source C++ implementation is available
1
.
In addition to the experiments presented in this pa-
per, the framework has been used to explore tempera-
ture variability of a heterogeneous platform for relia-
bility modelling (Tenentes et al., 2017) and to demon-
strate how application knobs and monitors can pro-
vide additional opportunities for system optimisation
(Leech et al., 2018a). Research is ongoing to provide
further validation of the framework and to integrate
additional applications, devices and RTMs.
ACKNOWLEDGEMENTS
This work was supported by the PRiME programme
grant EP/K034448/1 (http://www.prime-project.org)
and EPSRC grant EP/L000563/1.
Data supporting the results presented in
this paper are openly available from the Uni-
versity of Southampton repository available at
https://doi.org/10.5258/SOTON/D0565.
An open source implementation of the frame-
work can be found at https://github.com/PRiME-
project/PRiME-Framework.
The authors would like to thank Joshua M. Levine
and James R. B. Bantock for their role in the initial
development of the PRiME Framework methodology
and API. The authors would like to acknowledge Mo-
hammad Sadegh Dalvandi and Basireddy Karunakar
Reddy for contributions to the experimental results
and the development of runtime algorithms.
REFERENCES
Baldassari, A., Bolchini, C., and Miele, A. (2017). A Dy-
namic Reliability Management Framework for Hete-
rogeneous Multicore Systems. In IEEE Internatio-
nal Symposium on Defect and Fault Tolerance in VLSI
and Nanotechnology Systems.
Das, A., Shafik, R. A., Merrett, G. V., Al-Hashimi, B. M.,
Kumar, A., and Veeravalli, B. (2014). Reinforcement
An Application- and Platform-agnostic Runtime Management Framework for Multicore Systems
65
Learning-based Inter- and Intra-application Thermal
Optimization for Lifetime Improvement of Multicore
Systems. In Design Automation Conference.
Fleming, S. T. and Thomas, D. B. (2014). Heterogeneous
Heartbeats: A Framework for Dynamic Management
of Autonomous SoCs. In International Conference on
Field-Programmable Logic and Applications.
Gadioli, D., Palermo, G., and Silvano, C. (2015). Appli-
cation Autotuning to Support Runtime Adaptivity in
Multicore Architectures. In International Conference
on Embedded Computer Systems: Architectures, Mo-
deling and Simulation.
Gaspar, F., Tanic¸a, L., Tom
´
as, P., Ilic, A., and Sousa, L.
(2015). A Framework for Application-guided Task
Management on Heterogeneous Embedded Systems.
ACM Transactions on Architecture and Code Optimi-
zation, 12(4).
Hoffmann, H., Eastep, J., Santambrogio, M. D., Miller,
J. E., and Agarwal, A. (2010). Application Heartbe-
ats: A Generic Interface for Specifying Program Per-
formance and Goals in Autonomous Computing Envi-
ronments. In International Conference on Autonomic
Computing.
Hoffmann, H., Maggio, M., Santambrogio, M. D., Leva,
A., and Agarwal, A. (2013). A Generalized Software
Framework for Accurate and Efficient Management of
Performance Goals. In International Conference on
Embedded Software.
Hoffmann, H., Sidiroglou, S., Carbin, M., Misailovic, S.,
Agarwal, A., and Rinard, M. (2011). Dynamic Knobs
for Responsive Power-aware Computing. In Interna-
tional Conference on Architectural Support for Pro-
gramming Languages and Operating Systems.
Kim, Y. G., Kim, M., and Chung, S. W. (2017). Enhan-
cing Energy Efficiency of Multimedia Applications in
Heterogeneous Mobile Multi-core Processors. IEEE
Transactions on Computers, 66(11).
Leech, C., Bragg, G. M., Balsamo, D., Wachter, E., Mer-
rett, G. V., and Al-Hashimi, B. M. (2018a). Ap-
plication Control and Monitoring in Heterogeneous
Multiprocessor Systems. In International Symposium
on Reconfigurable Communication-centric Systems-
on-Chip.
Leech, C., Kumar, C., Acharyya, A., Yang, S., Merrett,
G. V., and Al-Hashimi, B. M. (2018b). Runtime per-
formance and power optimization of parallel disparity
estimation on many-core platforms. ACM Transacti-
ons on Embedded Computing Systems, 17(2).
Maeda-Nunez, L. A., Das, A. K., Shafik, R. A., Mer-
rett, G. V., and Al-Hashimi, B. (2015). PoGo: An
Application-specific Adaptive Energy Minimisation
Approach for Embedded Systems. In HiPEAC Works-
hop on Energy Efficiency with Heterogenous Compu-
ting.
Paone, E., Gadioli, D., Palermo, G., Zaccaria, V., and Sil-
vano, C. (2014). Evaluating Orthogonality between
Application Auto-tuning and Run-time Resource Ma-
nagement for Adaptive OpenCL Applications. In In-
ternational Conference on Application-specific Sys-
tems, Architectures and Processors.
Rahmani, A. M., Haghbayan, M. H., Miele, A., Liljeberg,
P., Jantsch, A., and Tenhunen, H. (2017). Reliability-
aware runtime power management for many-core sy-
stems in the dark silicon era. IEEE Transactions on
Very Large Scale Integration Systems, 25(2).
Reddy, B. K., Singh, A. K., Biswas, D., Merrett, G. V.,
and Al-Hashimi, B. M. (2017). Inter-cluster Thread-
to-core Mapping and DVFS on Heterogeneous Multi-
cores. IEEE Transactions on Multi-scale Computing
Systems, PP(99):1–1.
Singla, G., Kaur, G., Unver, A. K., and Ogras, U. Y. (2015).
Predictive dynamic thermal and power management
for heterogeneous mobile platforms. In Design, Auto-
mation Test in Europe.
Sui, X., Lenharth, A., Fussell, D. S., and Pingali, K. (2016).
Proactive Control of Approximate Programs. In In-
ternational Conference on Architectural Support for
Programming Languages and Operating Systems.
Tenentes, V., Leech, C., Bragg, G. M., Merrett, G., Al-
Hashimi, B. M., Amrouch, H., Henkel, J., and Das,
S. (2017). Hardware and Software Innovations in
Energy-efficient System-reliability Monitoring. In
IEEE International Symposium on Defect and Fault
Tolerance in VLSI and Nanotechnology Systems.
Vassiliadis, V., Chalios, C., Parasyris, K., Antonopoulos,
C. D., Lalis, S., Bellas, N., Vandierendonck, H., and
Nikolopoulos, D. S. (2016). Exploiting Significance
of Computations for Energy-constrained Approximate
Computing. International Journal of Parallel Pro-
gramming, 44(5).
Wang, Z., Tian, Z., Xu, J., Maeda, R. K. V., Li, H., Yang,
P., Wang, Z., Duong, L. H. K., Wang, Z., and Chen,
X. (2017). Modular Reinforcement Learning for Self-
adaptive Energy Efficiency Optimization in Multicore
System. In Asia and South Pacific Design Automation
Conference.
Yang, S., Shafik, R. A., Merrett, G. V., Stott, E., Levine,
J. M., Davis, J., and Al-Hashimi, B. M. (2015). Adap-
tive Energy Minimization of Embedded Heterogene-
ous Systems using Regression-based Learning. In In-
ternational Workshop on Power and Timing Modeling,
Optimization and Simulation.
PEC 2018 - International Conference on Pervasive and Embedded Computing
66