occupancy grid map, introduced by Elfes (Elfes, 1989)
and Moravec (Moravec, 1988), is a common problem
in robotics. They used their approach to describe the
environment, which was static, for a mobile robot.
Also, grid maps are used in robotics for Simultaneous
Localization and Mapping (
SLAM
) problems like in
(Grisettiyz et al., 2005) or multiple object tracking
(Chen et al., 2006). Most of the approaches use only a
2D occupancy grid, but some works extended it to a
third dimension (Moravec, 1996). The problem with
3D methods is the high memory consumption, which
was addressed by (Dryanovski et al., 2010). Since the
basic idea of robotics and autonomous vehicles is very
similar, it is not surprising that the concept of occupancy
grid maps was also introduced to the automotive sector.
The difference to the automotive context is, that for
robotics often a static map is created only once, whereas
for vehicles the map has to be continuously updated
due to the dynamic environment. In (Badino et al.,
2007), the authors introduced an occupancy grid for
the automotive domain to compute the free space of
the environment. Occupancy grid maps are also used
for lane detection (Kammel and Pitzer, 2008) or path
planning (Sebastian Thrun, 2005). Often laser range
finders are used to create occupancy grid maps (Homm
et al., 2010) (Fickenscher et al., 2016), as we consider
in this paper, but it is also possible to use radar sensors
(Werber et al., 2015) to create such maps. Since creating
occupancy grid maps can be very compute-intensive, it
was parallelized on a
GPU
(Homm et al., 2010). They
used in their approach a desktop PC, as proof of con-
cept. However, space and energy requirements are far
away from a realistic
ECU
. In this paper, we use an
embedded platform, which is much more similar to a
later
ECU
design. Yguel et al. (Yguel et al., 2006) used
several laser range finders, but only with one vertical
layer and they did their experiments also on a desktop
computer. In (Fickenscher et al., 2016), the occupancy
grid map was parallelized on an embedded
GPU
. In
this approach, a laser sensor with only one layer in the
vertical was used. In this paper instead, a sensor with
several vertical layers is used. Hereby the accuracy of
the occupancy grid map is enormously increased but
also the computational effort rises proportionally.
2 FUNDAMENTALS
In this section, a brief overview of creating environment
maps, especially an occupancy grid map is given. Then,
our very efficient algorithm to create an occupancy grid
is described and its parallelization. Also, the specific
properties of embedded
GPU
s are shortly summarized.
Further, differences in parallelizing the algorithm on a
Host (CPU)
Grid 0
Block (0,0)
Shared Memory
Thread
(0,0)
Thread
(1,0)
Block (1,0)
Shared Memory
Thread
(0,0)
Thread
(1,0)
M
e
m
o
r
y
Grid 1
Block (0,0)
Shared Memory
Thread
(0,0)
Thread
(1,0)
Block (1,0)
Shared Memory
Thread
(0,0)
Thread
(1,0)
Device (GPU)
CPU 0
CPU 2
CPU 3
CPU 1
Figure 1: Structural overview of the CUDA programming
model for an embedded GPU.
GPU, instead on a CPU are described.
2.1 Programming an Embedded GPU
The main purpose of
GPU
s has been rendering of com-
puter graphics, but with their steadily increasing pro-
gramming and memory facilities,
GPU
s recently be-
come attractive for also accelerating compute-intensive
algorithms from other domains.
GPU
s show their
strengths, if the algorithm has a high arithmetic density
and could be parallelized in a data-parallel way. Ho-
wever, the hardware architecture of a
GPU
and a
CPU
is quite different. A
GPU
embodies at hardware level
several streaming processors, which further contain
processing units. Such a streaming processor manages,
spawns, and executes the threads. Those threads are
managed in groups of
32
threads, so-called warps. In a
warp, every thread has to execute the same instruction
at the same time. E.g., if there is a branch, for which
only half of the threads, the statement is evaluated true,
the other half of the threads has to wait. As illustrated
in Figure 1, threads are combined to logical blocks,
and these blocks are combined to a logical grid. In
2006 (NVIDIA Corp., 2016b), Nvidia introduced the
framework CUDA to ease the use of
GPU
s for general
purpose programming. Program blocks, which should
be executed in parallel, are called kernels. Those kernels
can be executed over a range (1D/2D/3D) specified by
the programmer. For a range, which has
n
elements,
n
threads are spawned by the CUDA runtime system.
The programming model for the parallelization of an
algorithm is also different. On a
CPU
, parallel threads
are executed on different data, and every thread pro-
cesses different instructions. On a
GPU
instead, every
thread executes the same instruction at the same time on
different data. This model is called Single Instruction
Multiple Threads (
SIMT
). The main difference bet-
Base Algorithms of Environment Maps and Efficient Occupancy Grid Mapping on Embedded GPUs
299