Automatic Evaluation and Partitioning of Algorithms for Heterogeneous

Systems

Simon Heimbach

and Stephan Rudolph

Institute of Aircraft Design, University of Stuttgart, Pfaffenwaldring 31, 70569 Stuttgart, Germany

Keywords:

Heterogeneous Computing, Algorithmic Partitioning, Graph-Based Design Language, Code-Generation,

Data-Links.

Abstract:

The ever growing demand on performance and power efﬁciency can only be met by multiple specialised

compute engines for single tasks while costs and time to market constraints force development of programmes

for a known single micro-controller or conﬁguration development for an FPGA. With our proposition, an

executable logic can be designed in an integral project development effort and then partitioned by an algorithm

for different compute engines depending on the user’s demand, thus generating a heterogeneous system. The

timing evaluation is not only based upon different sources like data-sheet, simulation and benchmarks but also

on the parallelism offered by FPGA. With exporters, the code for these different devices can be automatically

generated including communication channels between them to transfer all necessary data. The paper explains

the algorithm’s fundamentals and demonstrates its beneﬁts using an example algorithm running on a micro-

controller paired with an FPGA. This shows that not only the algorithm but also the amount of data processed

is crucial for balancing a heterogeneous system.

1 INTRODUCTION

To leverage the unique abilities of every integrated

circuit (IC), it is possible to split a programme and its

algorithms into tasks optimised for different architec-

tures and implement data buses between them. Parts

of an algorithm that favours sequential execution can

be run on a controller whereas parallelisable tasks can

be swapped to an FPGA while necessary synchronisa-

tion of data is done via the buses. Every time an em-

bedded system consists of more than only one chip,

with these chips working collaboratively on the same

problem, the system is heterogeneous.

With a ﬁne balanced heterogeneous system a de-

vice can greatly improve its overall data throughput

and responsiveness while lowering energy consump-

tion and total cost of ownership. Assumed a designed

device is too slow for a given algorithm, developers

have fundamentally three different options: Declining

the feature, using a faster and therefore more power

consuming and expensive chip or splitting the algo-

rithm into two or more tasks and running each one on

an ideal compute engine.

https://orcid.org/0009-0002-4758-0690

https://orcid.org/0009-0006-0773-1713

However, the adoption of heterogeneous systems

is hampered by problems in software development.

Not only software and conﬁgurations of different

compute engines are programmed or described in var-

ious programming languages, each speciﬁcally tai-

lored for its speciﬁc domain. There also exist dif-

ferences between libraries offered to chips that are

based on the same language – take C as an exam-

ple: The SPI-interface works very differently for

the micro-controllers Atmega (Microchip Technology

Inc., 2020, p. 172), PIC18 (Bujor, 2020, p. 2) and

STM32 (STMicroelectronics, 2012, p. 403). On top,

chips from different vendors use different tool-chains.

As a result, developers need a wide understanding of

different languages and tools to develop for a speciﬁc

architecture.

At the beginning of the development, the exact

algorithms and data-structures are mostly unidenti-

ﬁed, and thus the perfect splits between different ar-

chitectures are also unknown. This is aggravated by

the fact that many products are developed as fami-

lies with cost-sensitive and high performance options

that often demand several divergent boundaries (Stre-

itferdt et al., 2005). In a hardware-software project

either deep knowledge of the advantages of the avail-

able hardware and a clear understanding of the ideal

Heimbach, S. and Rudolph, S.

Automatic Evaluation and Partitioning of Algorithms for Heterogeneous Systems.

DOI: 10.5220/0013153700003896

In Proceedings of the 13th International Conference on Model-Based Software and Systems Engineering (MODELSWARD 2025), pages 177-185

ISBN: 978-989-758-729-0; ISSN: 2184-4348

177

hardware-topology right from the beginning is given

or time-consuming benchmarks require rewriting of

code for different architectures in order to ﬁnd an ‘op-

timal’ combination.

2 BACKGROUND AND RELATED

WORK

Industry’s aspiration to support different accelerators

from a single source for embedded devices, scientiﬁc

research and immense parallel computing has been

continuous over the past years. For static analysis,

four main developing branches have evolved from

these pursuits: Cross-Compiler, Just-In-Time com-

piler (JIT), Hardware-Abstraction-Layers (HAL) and

Model-Based Systems Engineering software(MBSE).

Dynamic approaches schedule tasks between multiple

processors of the same time and can achieve a low-

latency, fail-safe system for automotive and aerospace

applications. Each proposal offers a solution for as-

pects of the problems discussed here but lack the fo-

cus on multiple different architectures in heteroge-

neous embedded systems and the automated gener-

ation of necessary source code for each target-type.

Impulsive-C, as an example for cross-compilers,

tries to translate ANSI-C-code into RTL for FPGA.

It is often used for image algorithms as it can signiﬁ-

cantly reduce the development time, as shown by Xu,

Subramanian, Alessio and Hauck (Xu et al., 2010).

However, it is not possible to balance software for

an embedded system between multiple devices. Also,

the website of the original developer Impulse Acceler-

ated Technologies is unfortunately not available any-

so the future of Impulsive-C is uncertain.

JIT compilers are used on performance comput-

ers like notebooks, PC and server applications in the

form of peripheral drivers (e.g. GPU) or program-

ming languages (e.g. JVM). These solutions are de-

veloped by the hardware manufacturer to enable users

to take full advantage of their products. A more uni-

ﬁed approach is offered by AMD with ROCm and

Intel with OneAPI, but these solutions are aimed at

high performance computing and cannot be adopted

on embedded systems. Java has a niche role in scien-

tiﬁc computing and can also be swapped to GPU and

FPGA with an OpenCL core (Tornado VM) but is not

widely used for microelectronics, as the lack of low

level support and the resource intense JVM make an

implementation on small components unfeasible.

A hardware-abstraction-layer (HAL) and also

drivers consist of functions that offer access to hard-

see: https://impulseaccelerated.com/

ware without the need for knowledge of the exact op-

erations. This enables the programming of different

models of hardware with the same source code with

the ARM ecosystem is a great example. Operating

systems and drivers take this one step further and offer

interfaces that can be used dynamically. Programmes

don’t need to be recompiled and can communicate

over a standardised format. Many major operating

systems are build upon this principle.

All these discussed solutions however have in

common that the developer has to take care of par-

titioning the programme into chunks to run those on

different devices. Also automated approaches have

been developed in the past.

There has also been academic work that focuses

on certain aspects of heterogeneous computing such

as compute architecture, memory model and dynamic

data transfer.

Lilja(Lilja, 1992) researches on splitting a pro-

gramme into tasks and scheduling those across a high-

speed-network between two computers – one based

on a regular CPU and the other equipped with a vec-

tor machine. He shows that the this approach can ac-

celerate the execution by more than a 1.000 fold, but

highly depends on the type of task and the amount of

data to be processed and shared across the network.

SymTA/S (Symbolic Timing Analysis for Sys-

tems) analyses a design space for distributed work-

loads on multiprocessor system on chip designs (Mp-

SoCs) (Hamann et al., 2006). It evaluates events

and allocates them over multiple processor nodes for

ideal latency. In a very related work tasks have been

mapped and scheduled over busses for embedded sys-

tems (Ferrandi et al., 2010). In both cases the targets

and potential code-generators are unknown to the au-

thor.

Ali et. al (Ali, 2012) describe in detail the costs of

communication between different computers in a net-

work with shared memory. Upon these models, they

experiment on the scalability of their approach. Even

though they mention CPU and GPU co-compute, it is

unclear if and how they manage to run tasks on differ-

ent type of compute architectures. The issue of mem-

ory overhead in distributed computing networks has

been focused by Xie, Chen, Liu, Wei, Li and Li (Xie

et al., 2017). In this approach, multiple processors of

the same type have been put in a network and its given

tasks dynamically scheduled between them.

The work presented in this paper, implements a

holistic solution with a single source description of

the desired system, an analytical partitioning into op-

timal architectures and exporters to generate the com-

pulsory source codes for the chosen devices. A het-

erogeneous system for embedded devices can auto-

MODELSWARD 2025 - 13th International Conference on Model-Based Software and Systems Engineering

178

AST

Partitioning

Injection

VHDL

</>

Export

Compilation

Evaluation

BIN

DAT

IMG

Figure 1: Big picture: Operations of an AST modelled by a user (left) is evaluated and split into Sub-AST for its perfect

device (Partitioning). When a data-movement is split, a link via a bus in automatically injected. For all AST, source-code is

generated by the corresponding exporters and the compilation- and upload-process is triggered.

matically be generated. Moreover, users are able to

create families of hardware with different architec-

tures to meet the demand for high performance and

budget options with a single design.

3 ABSTRACT LOGIC

A generic approach to an embedded system with var-

ious devices linked via busses in a single logic is in-

troduced here. A model, consisting of the operations

to describe the logic for such a system, can be ex-

ported to different architectures without any modiﬁ-

cation and will behave in the same way on all devices.

The user does not need to rewrite code when changing

the hardware but only needs to export it for a differ-

ent target. In a second step, an algorithm has been

implemented to even split the logic between different

devices strictly based on performance metrics without

any manual work from users. Because this approach

is abstract it is called abstract logic.

Commonly used operations for programming em-

bedded systems have been abstracted in a class-

diagram and can be instantiated and connected by the

user. These linked instances are generally called a

graph or in a tighter deﬁned subset an abstract syn-

tax tree – AST in short – like they are used in com-

pilers and lexers. Operations are connected via links

like nextop to model a sequential statement or ﬁrst to

enter a block in e.g. functions or loops. Each opera-

tion can have multiple relations to other instances, for

example as arguments. In total, these AST represent

the whole logic of algorithms and programmes.

In Figure 1, the big picture of the system of ab-

stract logic is shown in a simpliﬁed manner. Develop-

ers are able to model their system in a singular AST

(left), that can be partitioned by an algorithm into an

optimal heterogeneous system based on requirements

such as runtime, energy consumption or cost. In the

ﬁrst step (evaluation) all constraints are evaluated and

their containing functions tied to the chosen chip.

Such a constraint can be a bounded GPIO for exam-

ple. Also, every remaining function of the AST gets

benchmarked in regards to the requirement for every

available architecture predicated on publicly available

data – such as data-sheets, logic derived from formal

analysis and runtime measurements.

The fundamental idea for benchmarking opera-

tions in an AST is picked from E. W. Dijkstra’s 1959

paper “A Note on Two Problems in Connexion with

Graphs”(Dijkstra, 1959). The presented algorithm

ﬁnds the shorted path between two vertices in a graph

structure. A numerical weight is added to the edges

representing a distance. When the weights of con-

nected edges are summed up, the total distance be-

tween the two vertices is found. This algorithm only

follows the temporarily shortest path to ensure that

the best solution is found. This concept can be used

for path-ﬁnding and routing.

Although two major aspects differ in this paper:

• Dijkstra added the weights to the edges empha-

sising the distance between two vertices. In our

approach the vertices themself are weighted ac-

centuating the runtime of an operation

• Dijkstra took advantage of pursuing the temporar-

ily shortest path like other greedy algorithms.

This is not needed here as the AST under analysis

is linear and all its elements need to be evaluated.

When the algorithm is triggered, the given AST and

device-topology is fed into the performance algorithm

and every function is analysed in regards to their run-

time by penalising all operations in the AST. These

penalties are stored in databases and sourced either

from data-sheets, runtime analysis or simulation.

For a full evaluation of the performance on mul-

tiple devices, an additional penalty for the necessary

data-transfer is added. Only if the sum of the run-

time of all operations and the data-transfer provides

an uplift, the function is split from the original AST

(partitioning) with a new entry-point linked to an in-

stance of the chosen device to swap to. In the third

step (injection) the necessary data-transfers between

the chosen devices are automatically generated and

Automatic Evaluation and Partitioning of Algorithms for Heterogeneous Systems

179

embedded into the model.

This rebalanced AST is then handed to the ex-

porters that generate source-code for its respective de-

vice such as micro-controller, -processors and FPGA.

The injected interfaces allow the exporters to replace

in-software function-calls with driver-calls for the

respective interfaces – called target-communication

(TC). If wanted, the corresponding tool-chains are

called and the devices are programmed automatically.

The abstract logic is therefore interchangeable be-

tween different architectures and can beneﬁt from its

unique advantages.

3.1 Sources for Time Estimation

The time required for an operation can be divided into

two units: time ∆t and clock ticks ∆clks. The data is

standardised by multiplying the time measured by the

frequency of the executing chip. Dividing this data by

the frequency will give the correct time span.

Data-Sheets: The ﬁrst source of needed clocks are

data-sheets by the manufacturers of a device. Take

Microchip (former Atmel) as an example of their

AVR family of micro-controllers. The operation

CALL needs 4 cycles to perform on an AVR with

16 bit programme counter (like Atmega328)(Atmel,

2016, 63). When running the chip at a frequency of

1MHz, the resulting time span is 4µs. It has been

shown, that estimating the runtime from the AST re-

sults in only small divergence to the real runtime on

the device.

Benchmarks: If an operation consists of many in-

structions and perhaps even branches, the time span

cannot easily be modelled from its underlying instruc-

tions. A more technical approach is to benchmark the

operation on a device. Inside a loop, the operation

is executed multiple times and the time span is mea-

sured accordingly. Two drawbacks come with this ap-

proach: 1) the measurement itself is only to a certain

degree accurate and 2) different parameters of an op-

eration can lead to different execution time. A well

balanced methodology is to be chosen.

Assembly: Benchmarking however does not properly

function on more complex architectures like the Cor-

tex M4. For a more ﬁne grained estimation the gener-

ated source code needs to be compiled into assembler

code that then get read back into the algorithm by a

lexer. For most assembler mnemonics a speciﬁc num-

ber of clocks can be derived from data-sheets for a

few mnemonics a good estimation can be drawn.

3.2 Parallelism on FPGA

When exporting the AST to an FPGA, Finite-State-

Machines (FSM) need to be modelled to execute se-

quential operations. All operations are tested on their

linear independence to achieve maximum parallelism.

Independence in this context means that in-going ar-

guments/variables of an operation are not inﬂuenced

by the results of the previous operations. Blocks of

non-inﬂuencing operations can then be put into a sin-

gle state of the FSM and the transition to its next state

will happen when all operations have ﬁnished.

A simple example is given in Figure 2a and Figure

2b. The former represents an AST designed by a user.

The variable b is set to 1, then the operation a ← b + c

is executed. Next e is set to 2 and the operation d ←

e + f is executed. Finally there is the addition with

the terms a and d resulting in g.

The VHDL-exporter tries to put as many sequen-

tial operations in a single state as possible. In this ex-

ample a ← b+ c and e ← 2 are independent and there-

fore reduced to a single state (see Figure 2b). This

results in an FSM that can be executed in four clocks

– one clock for each state and therefore a reduction of

one clock, or 20%, over the sequential-only AST.

Operations can even be reordered to optimise par-

allelism. In this example the term e ← 2 can be pre-

poned before a ← b + c because its result is not used

as the others arguments (see Figure 3a). Operations

that write a result in a variable, are tracked to ensure

that succeeding operations that consume these vari-

ables are generated after the result is stored.

An even more advantageous parallelism can be

achieved in collection- or count-controlled loops with

constant steps and independent operations. Those are

often used in vector- or matrix-operations like addi-

tion or in traversal search algorithms. Such loops are

detected by looking for non-parallelisable operations,

such as IO-operations and return-statements, and by

calculating the dependence of variables as shown in

the previous example. If one of these conditions is

met, the loop cannot be parallelised and will be gen-

erated as an FSM. Otherwise, a parallel FSM is gen-

erated over the operations in the loop but with a re-

placement of internal signals with variables.

3.3 Analysis and Scheduling

When analysing, the ﬁrst step is to determine all func-

tions that refer to constraints and those functions are

then bound to the given chip by the algorithm. These

constraints can be either a GPIO, an ADC or a com-

munication channel, e.g. UART or SPI. The contain-

ing operations in all other functions are then eval-

MODELSWARD 2025 - 13th International Conference on Model-Based Software and Systems Engineering

180

b ← 1

a ← b + c

e ← 2

d ← e + f

g ← a + d

(a) Sequential AST as modelled by the user.

b ← 1

a ← b + c

e ← 2

d ← e + f

g ← a + d

(b) Generated FSM by the VHDL-exporter with the second

and third operation in parallel.

Figure 2: From an AST to an FSM.

b ← 1

e ← 2

a ← b + c

d ← e + f

g ← a + d

(a) Reordered sequential AST after optimisation.

g ← a + d

a ← b + c

d ← e + f

b ← 1

e ← 2

(b) Generated FSM by the VHDL-exporter of the optimised

AST with the second and third operation in parallel.

Figure 3: Optimised AST to improve parallelism on FPGA.

uated. For sequential algorithms, their correspond-

ing runtime from any source of time estimations are

added. In conditions, all branches are summed up

and the highest number is then added. Loops multi-

ply their containing statements whereas the multiplier

can either be analytically determined, like in count-

controlled, or need to be tested e.g. in collection-

controlled loops with constant number of elements.

In the last case a Rust-programme is generated and

an iterator is injected that is incremented with each

step. After the loop terminates the iterator is returned

and fed back as the multiplier. This method works

only on deterministic algorithms and cannot be used

on waiting from inputs from other devices or users.

From these cases many of them can be excluded from

the evaluation due to requirements inside the contain-

ing function (e.g. IO) binding it to a certain device

anyway. The still remaining loops are marked as un-

parallelisable and exported as such.

In a hypothetical example, the operations IN and

OUT are constrained to the MCU as the signals are

bound to its pins and therefore the function main will

be exported for this chip. As an MCU executes op-

erations sequentially, the operations of a function foo

are brought in order of execution. In contrast, the ex-

porter for the FPGA parallelises the operations and

increases the execution speed. In case of such a speed

up, the exporter also injects the necessary transmis-

sion of the arguments from the MCU to the FPGA and

the return value from the FPGA back to the MCU.

3.4 Communication Between Chips

To call a function that is swapped to a different device,

the call needs to be scheduled via the underlying pro-

tocol. In the analysis process a list of all functions

is collected and for each of those two unique virtual

function ID are assigned: First for calling the function

and second to fetch the return value. On the calling

device the function call is replaced with operations to

send a byte array consisting the virtual function ID

and its arguments via the physical interface. On the

target device a new main function is generated that

listens to incoming data, accumulates those and in-

terprets the data-block. A multi-way branch checks

the incoming data with each virtual function ID and

in case of a match calls the corresponding function.

This process is called target-communication (TC in

short).

When the function is triggered on the target de-

vice, the caller can execute operations that are inde-

pendent from the functions result. Once the target has

ﬁnished executing and a return value is to be transmit-

ted, the target is sending the virtual function ID and

the return value via the bus to the calling device.

Additionally the analysis has to take shared global

variables and data-structures into account. The lat-

ter are containers storing elements that needs to be

updated and processed in a swapped function. In

most cases the system beneﬁts from placing these data

structures directly on the target device and create a

virtual function call for access, storage or delete oper-

ations. On the primary device these operations are

then replaced with operations to transfer the corre-

sponding ID and the value.

3.5 Re-Balancing and Injection

The time span for executing the function foo on the

MCU only (t

) is the sum of the execution-times of

Automatic Evaluation and Partitioning of Algorithms for Heterogeneous Systems

181

all operators, the time to call each function, and re-

turning its return value. Each operation is added to

the execution time separately, as they are executed se-

quentially. When swapping the function to the FPGA,

the time for the data transfer from main to foo via the

UART bus needs to be considered, too. Here, despite

its time penalty of the data transfer, the total time for

the heterogeneous system (t

) might be shorter due to

the parallelisation of multiple operations. The differ-

ence between t

and t

is therefore the speed up of the

heterogeneous system.

As a drawdown, the data-transfer between the two

chips need to be taken into consideration. The max-

imum frequency and bit-width (baud-rate) for each

chip is read from a database and compared against

each other. The protocol with the highest through-

put is then chosen. The number of bits for the virtual

function ID, the arguments and the return value is cal-

culated and divided by the baud-rate to establish the

time to transfer the necessary data to the device and

back.

Once it is decided which device ﬁts best for each

function, the whole system is re-balanced. New in-

stances of Programme are generated to target its cor-

responding device. Functions and their abstract logic

Elements [1]

t [µs]

Estimated runtime over data-size

Atmega

STM32

Atmega + FPGA

# Atmega Atmega +

FPGA

relative gain

1 2.687 25.125 -89.31%

10 9.467 25.125 -62.44%

100 76.938 25.125 206.22%

1k 751.937 25.125 2,892.78%

10k 7,501.937 25.125 29,758.46%

Figure 4: Estimated runtime over data sizes (at 16 MHz in

µs): With an increasing amount of data, the Atmega slows

considerably down, whereas the FPGA can beneﬁt from

parallelising the tasks and hence achieve a constant runtime.

Once a certain amount of data has been reached, swapping

the function the FPGA is favourable.

for this chip are then moved to the new programme.

Each swapped function is assigned a unique ID that

can be referred by the original and target device.

4 A HETEROGENEOUS

EXAMPLE

A simple heterogeneous system is setup for analysis.

It consists of an MCU and an FPGA that are intercon-

nected via an UART-bus. The MCU is wired to an

analogue input which is deﬁned as a constraint inside

the AST.

The two chips are linked via an UART-connection.

When the algorithm decides to split the logic be-

tween the two devices, code to activate the UART-

block is injected and the libraries for the target-

communication is included for both devices. The li-

brary provides functions to serialise data-packages, to

send and to receive them between the devices.

The function Main consists of a loop that calls

the function Limiter twice and toggles a GPIO-pin in

between. These pins are constrained to the Atmega

as they are – in this hypothetical example – physi-

cally bound to an LED. The algorithm therefore can-

not swap Main to another device. The function Lim-

iter does not take any arguments nor does it return any

data. It runs through an array of integers and if it en-

counters a value that is larger than a given threshold,

it will set it to a predeﬁned constant value.

The estimated runtime for the given example has

been logged and plotted over different data sizes in

Figure 4. While the runtime of the Atmega and

STM32F3 increases linearly over the data size, the

exporter of the FPGA manages to parallelise the loop

and achieves a constant runtime driven only by the

target-communication between the devices. When the

number of elements is 33 or lower (approx. 200 for

the STM32F3), the Atmega needs less time to run

the function Limiter then the heterogeneous system

as the transmission of the arguments and the return-

value between the devices dominates the runtime.

Above that threshold, the heterogeneous system with

the FPGA and its parallelised execution aces the algo-

rithm and increases its lead over the Atmega. At 1000

elements it is almost 28 times and at 10.000 elements

300 times quicker.

5 EXPORT

During this research 3 abstract exporters and 6 spe-

cialised exporters have been developed but newer

MODELSWARD 2025 - 13th International Conference on Model-Based Software and Systems Engineering

182

ones can be inherited or completely implemented in

the future. C source code is used by the AVR- and

STM32-exporters. For both controller families C li-

braries and tool-chains are well supported by the ven-

dors. The Rust back-end is used by the Linux/x86-

exporter which provides a much better memory safety

and therefore reliability. On top, programmes writ-

ten in Rust can more easily be ported between oper-

ating systems. The VHDL-exporter is inherited by

the speciﬁed exporters for Lattice and Xilinx FPGA

and the open-source simulator GHDL. Depending on

the deviation from the base a speciﬁed exporter has

to provide functionality to generate code for IO and

timers or can optionally replace already implemented

methods for code generation in the abstract exporter.

In the exporting process, all instances of Pro-

gramme are extracted and depending in the target de-

ﬁned in an assigned tag, the corresponding exporter

is called. For each operation in the abstract logic a

method is called that describes the necessary logic in

the programming language for the device, e.g. C for

the Atmega and VHDL for an FPGA. When export-

ing the example the Main-function including the tog-

gle of the LED is processed by the AVR/C-Exporter

while the Limiter-function is generated by the VHDL-

exporter as it was split in the previous step.

When a function-call points to a function on an-

other device, a target-communication-request (here

tc callfunc()) with the function-id and its arguments

is send to the target device. Tc waitfunc() waits until

the result is returned by the target via the interface.

On the FPGA-side, all functions are generated by

the VHDL-exporter from the logic described in the

AST. The FSM of the Main entity waits in its re-

ceiving state for incoming data from the host-device.

When all data has been transferred the FSM switches

to its exec state where it processes the data. The ﬁrst

byte consist of the function’s unique id. When this ID

is correct, the arguments for the corresponding sub-

entity are assigned and the entity is activated. Once

the entity has ﬁnished the FSM transitions to the send-

state, which will reactivate the tc-core to send the re-

turned value from the before executed core. As this

example does not have arguments nor a return-value

both are omitted.

6 RESULTS AND DISCUSSION

The generated code has been compiled and tested on

an Atmega328 at 16 MHz, an STM32F3 at 64 MHz

and in GHDL as an FPGA test bench. The results

for the Atmega and STM32 are shown in Table 1.

Both chips were connected to an oscillator giving it a

precise clock source. However there are three caveats:

• the time was measured with an internal timer on

the chips that lack accuracy but is good enough

for a comparison

• an array with only one element turned out to be

too small for measurement and resulted in 0 µs

and was therefore omitted

• on the Atmega an array with a data size of 1.000

and more elements resulted in a stack overﬂow

and could only be measured by two nested loops

that deﬁnitely harmed the performance

At 10 elements the relative error is at 15% for the

Atmega and -18% on the STM32 respectively. It is

believed that it is that high due to a deviation in the

approximation of the loop. When the element size

grows, the real runtime on the Atmega increases more

quickly than estimated what is attributed to the nested

loops. On the STM32, the estimation is too conserva-

tive for larger datasets. Nevertheless, the derived error

is small enough for a rough estimation when partition-

ing a heterogeneous system.

7 CONCLUSIONS AND FUTURE

WORK

With the idea presented here, the optimal solution

for heterogeneous designs across different types of

architectures can be computed within a few seconds

and makes distributed computing for embedded sys-

tems economically feasible. Users can generate hard-

ware families of heterogeneous systems with different

number of features with each member running on a

different architecture for optimal performance and re-

duced costs from a single development project only.

The example that is split automatically into different

Table 1: Calculated and measured runtime (in µs) for the Atmega328 at 16 MHz and the STM32F3 at 64 MHz.

Atmega328 STM32F3

# calculated measured rel. error calculated measured rel. error

10 9.467 8 15.23% 1.469 1.8 -18.40%

100 76.938 77 -0.08% 11.313 11.8 -4.13%

1k 751.937 822 -9.32% 109.750 94.7 15.89%

10k 7,501.937 8,224 -9.63% 1,125.378 939.5 19.78%

Automatic Evaluation and Partitioning of Algorithms for Heterogeneous Systems

183

tasks, demonstrates the fundamentals of the technol-

ogy of algorithmic partitioning and the associated in-

jection of data-links.

The experiments in the paper show that the devel-

opment of embedded systems can hugely beneﬁt from

this algorithmic partitioning approach. Even in this

simple example a manifold performance uplift can be

witnessed which is believed to be often present in any

design of an embedded system. It is demonstrated

that users can uncover these gains without any fur-

ther ado, only by modelling the algorithm in an inte-

gral AST and applying the algorithm to split it auto-

matically. This also enables the adoption of different

designs with different topologies as the AST can be

partitioned into different devices.

However, the presented methodology has draw-

backs, mainly that the auto generated code to the

exporters will always lack the potential of the code-

quality and data-footprint of hand-written code since

human developers are well aware of the holistic de-

sign and its context. Developers with deep under-

standing of the matter are likely to create better so-

lutions for a given problem. Furthermore, very com-

plex systems might still be hard to implement with

this new idea or even impossible to design at all. It

is therefore strongly believed that corner cases will

remain that can be solved faster with traditional de-

velopment tools.

Two major opportunities lay in the development

ahead:

• Utilising the idle time of devices waiting for an-

other to ﬁnish its operations. In its meantime the

calling device can compute operations ahead that

are independent from the result of the secondary

device.

• Augmenting the generation of FSM on FPGA. We

believe that an algorithm can developed to de-

tect FSM that handle streaming data better in a

pipelined design. This can lower the needed clock

cycles, reduce the resource utilisation and perhaps

also increase the maximum clock frequency.

In the future we will also focus on using the idle time

of devices waiting for another to ﬁnish its operations.

In its meantime the calling device can compute opera-

tions ahead that are independent from the result of the

secondary device.

At the same time further investigations need to

be conducted in the ﬁeld of data size, energy con-

sumption and costs. A glimpse of the ﬁrst problem

was caught in preparing of this paper already namely

when the function caused a stack overﬂow on the

micro-controller. Here, a memory usage estimation

could be introduced and would serve as a secondary

evaluation criterion when partitioning the AST. The

same is true for an FPGA: An immensely parallelised

algorithm results in lot of consumed die area and

lower potential clock speeds. Here, a pipelined ap-

proach dividing huge data into chunks and process

them sequentially can largely reduce the demand for

an FPGA. Energy consumption is often also a crit-

ical aspect in an embedded design. Here a hetero-

geneous system needs careful balancing as a data-

transfer between chips always comes at the cost of

energy that can foil the beneﬁts gained by splitting

the algorithm on different chips. We believe that these

two additional aspects can also be added as criteria for

analysing a system.

ACKNOWLEDGEMENTS

The authors would like to thank the German Fed-

eral Ministry of Education and Research (BMBF) for

supporting the project SaMoA within VIP+. This

publication was also funded by the German Re-

search Foundation (DFG) grant ”Open Access Publi-

cation Funding / 2023-2024 / University of Stuttgart”

(512689491).

REFERENCES

Ali, J. (2012). Optimal task partitioning model in dis-

tributed heterogeneous parallel computing environ-

ment. International Journal on Artiﬁcial Intelligence

Tools, 2:13–24.

Atmel (2016). AVR Instruction Set Manual.

Bujor, I. (2020). Getting started with SPI using MSSP on

PIC18.

Dijkstra, E. W. (1959). A note on two problems in connex-

ion with graphs. Numer. Math., 1(1):269–271.

Ferrandi, F., Lanzi, P. L., Pilato, C., Sciuto, D., and Tumeo,

A. (2010). Ant colony heuristic for mapping and

scheduling tasks and communications on heteroge-

neous embedded systems. IEEE Transactions on

Computer-Aided Design of Integrated Circuits and

Systems, 29(6):911–924.

Hamann, A., Jersak, M., Richter, K., and Ernst, R. (2006).

A framework for modular analysis and exploration

of heterogeneous embedded systems. Real-Time Sys-

tems, 33:101–137.

Lilja, D. J. (1992). Experiments with a task partitioning

model for heterogeneous computing. Citeseer.

Microchip Technology Inc. (2020). megaAVR® data sheet.

STMicroelectronics (2012). Um1581 user manual.

Streitferdt, D., Sochos, P., Heller, C., and Philippow, I.

(2005). Conﬁguring embedded system families using

feature models. In Proc. of Net. ObjectDays, pages

339–350.

MODELSWARD 2025 - 13th International Conference on Model-Based Software and Systems Engineering

184

Xie, G., Chen, Y., Liu, Y., Wei, Y., Li, R., and Li, K. (2017).

Resource consumption cost minimization of reliable

parallel applications on heterogeneous embedded sys-

tems. IEEE Transactions on Industrial Informatics,

13(4):1629–1640.

Xu, J., Subramanian, N., Alessio, A., and Hauck, S. (2010).

Automatic Evaluation and Partitioning of Algorithms for Heterogeneous Systems

185