
hardware-topology right from the beginning is given
or time-consuming benchmarks require rewriting of
code for different architectures in order to find an ‘op-
timal’ combination.
2 BACKGROUND AND RELATED
WORK
Industry’s aspiration to support different accelerators
from a single source for embedded devices, scientific
research and immense parallel computing has been
continuous over the past years. For static analysis,
four main developing branches have evolved from
these pursuits: Cross-Compiler, Just-In-Time com-
piler (JIT), Hardware-Abstraction-Layers (HAL) and
Model-Based Systems Engineering software(MBSE).
Dynamic approaches schedule tasks between multiple
processors of the same time and can achieve a low-
latency, fail-safe system for automotive and aerospace
applications. Each proposal offers a solution for as-
pects of the problems discussed here but lack the fo-
cus on multiple different architectures in heteroge-
neous embedded systems and the automated gener-
ation of necessary source code for each target-type.
Impulsive-C, as an example for cross-compilers,
tries to translate ANSI-C-code into RTL for FPGA.
It is often used for image algorithms as it can signifi-
cantly reduce the development time, as shown by Xu,
Subramanian, Alessio and Hauck (Xu et al., 2010).
However, it is not possible to balance software for
an embedded system between multiple devices. Also,
the website of the original developer Impulse Acceler-
ated Technologies is unfortunately not available any-
more
1
so the future of Impulsive-C is uncertain.
JIT compilers are used on performance comput-
ers like notebooks, PC and server applications in the
form of peripheral drivers (e.g. GPU) or program-
ming languages (e.g. JVM). These solutions are de-
veloped by the hardware manufacturer to enable users
to take full advantage of their products. A more uni-
fied approach is offered by AMD with ROCm and
Intel with OneAPI, but these solutions are aimed at
high performance computing and cannot be adopted
on embedded systems. Java has a niche role in scien-
tific computing and can also be swapped to GPU and
FPGA with an OpenCL core (Tornado VM) but is not
widely used for microelectronics, as the lack of low
level support and the resource intense JVM make an
implementation on small components unfeasible.
A hardware-abstraction-layer (HAL) and also
drivers consist of functions that offer access to hard-
1
see: https://impulseaccelerated.com/
ware without the need for knowledge of the exact op-
erations. This enables the programming of different
models of hardware with the same source code with
the ARM ecosystem is a great example. Operating
systems and drivers take this one step further and offer
interfaces that can be used dynamically. Programmes
don’t need to be recompiled and can communicate
over a standardised format. Many major operating
systems are build upon this principle.
All these discussed solutions however have in
common that the developer has to take care of par-
titioning the programme into chunks to run those on
different devices. Also automated approaches have
been developed in the past.
There has also been academic work that focuses
on certain aspects of heterogeneous computing such
as compute architecture, memory model and dynamic
data transfer.
Lilja(Lilja, 1992) researches on splitting a pro-
gramme into tasks and scheduling those across a high-
speed-network between two computers – one based
on a regular CPU and the other equipped with a vec-
tor machine. He shows that the this approach can ac-
celerate the execution by more than a 1.000 fold, but
highly depends on the type of task and the amount of
data to be processed and shared across the network.
SymTA/S (Symbolic Timing Analysis for Sys-
tems) analyses a design space for distributed work-
loads on multiprocessor system on chip designs (Mp-
SoCs) (Hamann et al., 2006). It evaluates events
and allocates them over multiple processor nodes for
ideal latency. In a very related work tasks have been
mapped and scheduled over busses for embedded sys-
tems (Ferrandi et al., 2010). In both cases the targets
and potential code-generators are unknown to the au-
thor.
Ali et. al (Ali, 2012) describe in detail the costs of
communication between different computers in a net-
work with shared memory. Upon these models, they
experiment on the scalability of their approach. Even
though they mention CPU and GPU co-compute, it is
unclear if and how they manage to run tasks on differ-
ent type of compute architectures. The issue of mem-
ory overhead in distributed computing networks has
been focused by Xie, Chen, Liu, Wei, Li and Li (Xie
et al., 2017). In this approach, multiple processors of
the same type have been put in a network and its given
tasks dynamically scheduled between them.
The work presented in this paper, implements a
holistic solution with a single source description of
the desired system, an analytical partitioning into op-
timal architectures and exporters to generate the com-
pulsory source codes for the chosen devices. A het-
erogeneous system for embedded devices can auto-
MODELSWARD 2025 - 13th International Conference on Model-Based Software and Systems Engineering
178