A Coarse and Relevant 3D Representation for Fast and Lightweight

RGB-D Mapping

Bruce Canovas, Michele Rombaut, Amaury Negre, Serge Olympieff and Denis Pellerin

Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France

Keywords:

RGB-D, Dense Reconstruction, Superpixel, Surfel, Fusion, Robotics.

Abstract:

In this paper we present a novel lightweight and simple 3D representation for real-time dense 3D mapping of

static environments with an RGB-D camera. Our approach builds and updates a low resolution 3D model of

an observed scene as an unordered set of new primitives called supersurfels, which can be seen as elliptical

planar patches, generated from superpixels segmented RGB-D live measurements. While most of the actual

solutions focuse on the accuracy of the reconstructed 3D model, the implemented method is well-adapted to

run on robots with reduced/limited computing capacity and memory size, which do not need a highly detailed

map of their environment but can settle for an approximate one.

1 INTRODUCTION

Live 3D reconstruction from RGB-D data is a major

and active research topic in computer vision for robo-

tics. Indeed, to be able to interact in an environment,

a robot needs to have access in real-time to its 3D ge-

ometry. Several dense visual SLAM and 3D mapping

systems able to produce impressive results have been

proposed. They enable robots to simultaneously lo-

calize themselves, build and update a 3D map of an

observed scene.

Depending on the 3D representation they use to

model the observed scene, many state of the art 3D re-

construction systems are limited to small areas and/or

require heavy and costly hardware to operate properly

in real-time. Indeed, most of the softwares operate

with really accurate but expensive and memory con-

suming forms of representation while in many case

an approximate one can be sufﬁcient for the needs of

a robot which does not have to perform classiﬁcation

tasks, place recognition, or to localize with a centime-

ter accuracy. A good representation should be adapted

to the robotic system running the algorithm and to its

mission.

In this paper, we present a method based on a no-

vel representation to build and update iteratively an

approximate 3D model of an observed space, as a

set of piecewise planar elements called supersurfels,

from the segmentation in superpixels of the input live

video stream of a moving RGB-D camera. Our con-

tribution is a real-time and memory efﬁcient 3D map-

ping system, called SupersurfelFusion, that accom-

plishes rough but light 3D reconstruction.

Our method is not designed to compete with the

level of detail of existing dense RGB-D 3D recon-

struction approaches. Instead, it aims to enable fast

3D mapping with good scalability on power restricted

platforms, or to serve as groundwork for applications

that require high efﬁciency. It has been developed un-

der ROS (Robot Operating System) for ﬂexibility and

portability.

2 RELATED WORKS

Most of the actual 3D reconstruction systems with

RGB-D camera share a common pipeline, based on

a frame-to-model strategy. They build and update a

single global3D model of an observed scene, from an

input live RGB-D video stream. First the RGB-D me-

asurements are acquired and preprocessed. Then the

camera pose is estimated and the data of the current

frame are aligned to the predicted 3D model (frame-

to-global-model registration strategy). After that, the

newly aligned data are integrated/fused into the mo-

del. Finally the 3D model is cleaned and can be ren-

dered.

These systems mainly differ in the way they re-

present the target 3D scene. Volumetric (or voxel-

based) representations for 3D reconstruction systems

have been popularized with KinectFusion (Izadi et al.,

824

Canovas, B., Rombaut, M., Negre, A., Olympieff, S. and Pellerin, D.

A Coarse and Relevant 3D Representation for Fast and Lightweight RGB-D Mapping.

DOI: 10.5220/0007381708240831

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 824-831

ISBN: 978-989-758-354-4

2011), one of the ﬁrst methods to perform real-time

dense reconstruction using an RGB-D camera. Sy-

stems using volumetric representation build and up-

date a single high-quality 3D model along with live

RGB-D data, based on the volumetric fusion method

of (Curless and Levoy, 1996). The model is represen-

ted as a truncated signed distance function stored in a

regular 3D grid volume, known as voxel grid. Volu-

metric approaches are robust to noises and produce

really accurate results but are highly memory con-

suming and are thus limited to small environments.

Furthermore the volume and the resolution of the re-

gular grid have to be predeﬁned. Another drawback

of voxel-based representation comes from the need to

transit between different data representations: ﬁrst the

RGB-D data are extracted as a point-cloud conver-

ted to a continuous signed-distance function, which is

then discretised into a regular grid to update the mo-

del. Finally the reconstructed model is raycasted to

render the resulting reconstruction. Numerous other

volumetric approaches have been developed in order

to override these drawbacks or to improve the method,

such as Kintinuous (Whelan et al., 2012), a spatial

extension of KinectFusion, or Patch Volumes (Henry

et al., 2013) which presents a more compact represen-

tation.

Alternatively, point-based representations may

also be used. They represent the model as a set of

unordered 3D points or surfels. A surfel is a circu-

lar surface element which principally encodes a posi-

tion, a color, a normal, a radius (the point size), and

a conﬁdence value that quantiﬁes the reliability of the

surfel (states stable or unstable). A surfel is conside-

red as stable, reliable, if it has been repeatedly obser-

ved. In surfel-based approaches, such as PointBased-

Fusion (Keller et al., 2013) and ElasticFusion (Whe-

lan et al., 2015), the acquired RGB-D data are directly

stored and accumulated in a model composed of sur-

fels. Surfel-based representations are directly obtai-

ned from an RGB-D frame and can be used for ren-

dering without converting to an other form of repre-

sentation. Unlike volumetric approaches, point-based

systems do not provide a continuous surface recon-

struction. However, the resolution of the model does

not have to be predeﬁned because each surfel size is

adapted to the accuracy limit of the sensor. Further-

more, free spaces do not need to be represented which

makes these methods more memory efﬁcient. Even if

surfel-based reconstructions are usually lighter than

volumetrics, they are still expensive in memory and

computation (a 3D model can count about millions of

surfels).

Other approaches were developed too, such as

(Thomas and Sugimoto, 2013) (Salas-Moreno et al.,

2014), proposing to use sets of planes in their 3D

scene representation to make it more compact and

still achieve accurate reconstruction. The method

presented by (B

odis-Szomor

u et al., 2014) shares

many similarities with ours. Their multi-view ste-

reo algorithm combines sparse structure-from-motion

with superpixels to generate a lightweight, piecewise-

planar surface reconstruction. However it is not capa-

ble of real-time performances.

In the subsequent section, we introduce a new 3D

scene representation as a set of supersurfels, which

are simply oriented and colored elliptical planar pa-

tchs, generated from the superpixel segmentation of

RGB-D frames. The use of superpixels allows us to

considerably reduce the quantity of data to process

while preserving the meaningful information. Super-

surfels share similarities with surfels: they can be dis-

played directly through a traditional graphic pipeline

such as OpenGL and are easily updated. They allow

us to store a less accurate, but still relevant, and more

compact 3D model in memory, so as to achieve fast

mapping on multitasks or less performing embedded

platforms.

3 PIPELINE AND SUPERSURFEL

REPRESENTATION

The system developed, SupersurfelFusion, achieves

low resolution 3D reconstruction of static environ-

ments using an RGB-D camera. It builds, comple-

tes and updates a single global model in world space

coordinate R

, denoted as G

G, as a set of unordered

3D primitives named supersurfels (Figure 1). The

model G

G is updated at each acquisition from the set

of supersurfels F

F associated to the current frame.

Figure 1: Oriented supersurfel deﬁned in world space coor-

dinate system R

(O;X

), with position p

, length

and width l

A supersurfel can be seen as an approximation of

the 3D reprojection (from 2D to 3D space) of an as-

A Coarse and Relevant 3D Representation for Fast and Lightweight RGB-D Mapping

825

sociated superpixel. A Superpixel (Ren and Malik,

2003) deﬁnes a group of connected pixels sharing ho-

mogeneous informations (color and surface for RGB-

D superpixels). A supersurfel G

∈ G

G is simply an

elliptical planar patch in 3D space, that encodes its

centroid position p

∈ R

, an orientation R

∈ SO

a color c

∈ R

, longitudinal and lateral elongations

∈ R and l

∈ R, a timestamp t

∈ N to store the

last time it has been observed, a conﬁdence weight

∈ R

to quantify its state (stable or unstable) and

a 3D covariance matrix Σ

∈ M

(R) to describe its

shape.

The system takes as input live registered RGB-D

pairs of images (C

), with C

: Ω

Ω

Ω → R

the color

image at time t

t, Z

: Ω

Ω

Ω → R the associated depth map

and Ω

Ω

Ω the image plane. The 3D reconstruction pi-

peline can then be divided in 4 steps:

1. First the incoming RGB-D frame is segmented in

superpixels.

2. The resulting superpixels are then used to gene-

rate a set of supersurfels F

F in camera space R

3. After that the 6DOF pose of the camera in world

space R

is estimated from a visual odometry

solution such as ORB-SLAM2 (Mur-Artal and

Tard

os, 2017), a feature-based SLAM system, or

the open source OpenCV RGB-D odometry based

on the direct frame-to-frame registration method

of (Steinbr

ucker et al., 2011).

4. Finally, the pose of the camera is used to trans-

form current frame supersurfels F

F from camera

to world space R

and enable the update of

the model G

The segmentation in superpixels, the generation of

supersurfels and the fusion of data have been imple-

mented on GPU with CUDA library, to beneﬁt from

the high computational power of this device. In order

for our algorithm to be ﬂexible and easily integrated

to a complete robotic system, it has been designed to

work under ROS.

4 GENERATION OF

SUPERSURFELS

This section describes the process that compute the

set F

F of current frame supersurfels from color and

depth data.

4.1 Segmentation in Superpixels

First, the newly acquired RGB-D frame is segmented

into N superpixels C = {C

, h = 1, ...,N}. A super-

pixel is a group of homogeneous pixels u

, for j =

1,2,..., M with M the size of the superpixel, sharing

similar color and for which their 3D reprojections can

be ﬁt to an identical plane in 3D space. A CUDA

implementation of the approach described in (Yama-

guchi et al., 2014) is applied to partition the current

frame into suitable superpixels, that is to say which

preserve as much as possible boundaries and depth

discontinuities. The algorithm starts from the bre-

akdown of an image into a regular grid (for instance

the image is divided into groups of 20x20 pixels) and

iteratively shifts the boundaries of the superpixels by

minimizing a cost function which preserves the topo-

logy.

4.2 Extraction of Supersurfels

Then a supersurfel F

∈ F

F , where F

F deﬁnes the set

of supersurfels associated to the current frame, is ge-

nerated for each suitable superpixel C

, as shown in

Figure 2. The position p

of the supersurfel F

, is

the mean value of the 3D reprojections π(u

)

))

of the u

pixels contained in C

. The color c

is set

as the average colors, in CIELAB color space, of the

pixels contained in the superpixel. Principal Com-

ponent Analysis (PCA) technique is adopted to esti-

mate the orientation R

and the longitudinal and la-

teral elongations l

, l

. The covariance matrix Σ

computed according to the following formula:

M−1

∑

j=1

(π(u

)) − p

)(π(u

)) − p

)

(1)

The matrix is diagonalized, which gives three eigen-

vectors e

and their associated eigenvalues

,λ

(λ

> λ

). The orientation R

cor-

responds to the eigenvectors, with e

the eigenvec-

tor belonging to the smallest eigenvalue being an es-

timate for the normal of the supersurfel. The longi-

tudinal and lateral elongations l

are respectively

deﬁned by the square root of the largest λ

and the

medium eigenvalues λ

, scaled by a constant factor

s. For instance, s

s = 2.4477 gives us the 95% conﬁ-

dence ellipse associated to the set of points deﬁned

by the π(u

)

)) 3D reprojections. The timestamp

simply received the current time t

t. The conﬁdence

weight w

is set to a low value at initialization, cor-

responding to the ratio between the number of pixels

with valid depth (that is to say for which the depth is

provided by the depth map Z

) in superpixel C

and

the total number of pixels it contains:

← pixels

valid

/pixels

total

. (2)

Disproportionate supersurfels, for which l

surpasses a ﬁxed threshold, are subdivided lengthwise

in two parts, so as to ensure supersurfels of regular si-

zes. Like (Keller et al., 2013) with surfels, we discard

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

826

Figure 2: Generation of supersurfels from a superpixels seg-

mented RGB frame and its associated ﬁltered depth (even

if considered as elliptical planar patches, supersurfels are

displayed as rectangular patches to ease and accelerate the

rendering).

any supersurfel having its normal seen under an inci-

dence angle θ

θ larger than θ

max

and we only generate

supersurfels when the depth of the associated super-

pixels is less than z

max

, with z

max

set according to the

accuracy of the sensor.

5 MODEL UPDATE

The system maintains and extends a single global mo-

del G

G as a set of supersurfels G

stored in a ﬂat array

indexed by k

k ∈ N. Newly generated supersurfels F

F ,

from the current frame, are either added in the model

or merged with similar supersurfels from the model.

Merging a supersurfel F

∈ F

F with G

will increase

the conﬁdence weight w

of G

. That way, a supersur-

fel from the model will change its status from unsta-

ble to stable if it is repeatedly observed. As in (Keller

et al., 2013), supersurfels with w

> w

stable

are consi-

dered as stable.

5.1 Data Association

In this step, we look for each newly generated super-

surfel F

, whether a similar supersurfel G

already

exists in the predicted 3D model G

G. The purpose is to

fuse alike supersurfels, in order to avoid duplications,

redundancies, and to reﬁne the model.

First, knowing the current pose of the sensor (pro-

vided by ground-truth measurements or estimated by

an external visual-odometry system), supersurfels F

in F

F are aligned with the model. Then, we carry out

a nearest neighbours search for each F

, through a

Bounding Volume Hierarchy (BVH) (Kay and Kajiya,

1986). The BVH is build based on the supersurfels

from the predicted model which are contained in the

ﬁeld of view of the camera at current time. The search

is performed in camera space and the BVH is genera-

ted following the implementation on GPU by (Karras,

2012). It is an acceleration structure, which allows to

organize the 3D reconstructed scene in a binary se-

arch tree of bounding volumes. It thus reduces the

time complexity of the research. Leaf nodes are su-

persurfels from the model wrapped in minimum axis-

aligned bounding box. The minimum axis-aligned

bounding box of an object is the smallest box contai-

ning all of the points of the object and having its edges

parallel to the reference coordinate system axes. Leaf

nodes are grouped and enclosed by bigger bounding

boxes representing the nodes of the tree. Each node

is an axis aligned bounding box of its children. All

the leaf supersurfels where the associated bounding

volumes contain the center p

of the supersurfel F

from the current frame, are added to the set of nearest

neighbours.

We use the symmetric Kullback-Leibler diver-

gence to ﬁnd out the G

most similar to F

, among

the supersurfels from the model G

G selected as nearest

neighbours. The symmetric Kullback-Leibler diver-

gence can be deﬁned as:

KLD(F

||G

) =

{tr(Σ

−1

+ Σ

−1

)

+ (p

− p

)

(Σ

−1

+ Σ

−1

)(p

− p

)} − 3. (3)

The use of the Kullback-Leiber divergence as a dis-

tance has been proposed by (Davis and Dhillon,

2006), in order to perform gaussian clustering. It al-

lows to value the dissimilarity between two supersu-

fels in terms of position, orientation and shape.

To ensure that supersurfels too different are not

associated we also proceed to a series of checkups:

1. We check that the divergence angle between the

normals is small: arcos(< n

>)×180/π < ε

with ε

usually set to 10 degrees.

2. We compare their areas: |a < l

| < b,

with a,b ∈ R (for instance set to a = 0.5 and b = 2

if we want to excludes supersurfels which are at

least twice bigger or twice smaller than F

3. We look up if their colors are close: ∆E

∗

< ε

where ∆E

∗

is the distance between two colors in

CIELAB color space, ignoring the lightness com-

ponents so as to consider only chromatic informa-

tion and be robust against light intensity variati-

ons. The threshold ε

can be ﬁxed to 10.

A Coarse and Relevant 3D Representation for Fast and Lightweight RGB-D Mapping

827

If no correspondent G

is found in the model G

is simply added into the model. Else, the super-

surfel G

from the model G

G, which minimizes the

symmetric Kullback-Leibler divergence (Equation 3)

is selected as a match and data fusion is applied.

5.2 Fusion

After data association, for each pair of corresponden-

ces (G

) found, we update the attributes of G

from F

ones so as to reﬁne the reconstruction. The

new position p

and covariance Σ

are computed by

applying covariance intersection strategy:

0−1

= αΣ

−1

+ (1 − α)Σ

−1

, (4)

= Σ

(αΣ

−1

+ (1 − α)Σ

−1

), (5)

α =

+ w

. (6)

The updated color c

is obtained by a weighted

average in CIELAB color space:

+ w

. (7)

To compute updated values for the orientation R

and

the longitudinal and lateral elongations l

and l

, the

same PCA procedure as during the supersurfels gene-

ration step is applied. The conﬁdence weight w

incremented:

= min(w

+ w

max

). (8)

The truncation of the conﬁdence weight over a max-

imum value w

max

is needed if we want new super-

surfels with low conﬁdence to still have a minimum

inﬂuence on old stable supersurfels. A supersurfel is

considered stable when its conﬁdence exceed a ﬁxed

threshold w

stable

. The timestamp is also updated with

current time value:

= t. (9)

5.3 Removal of Outliers

Lastly, different strategies are applied to remove out-

liers, due to noisy data, and ﬁlter the 3D model. Su-

persurfels that stay in an unstable state for too long

are removed after ∆t

∆t

∆t steps and the conﬁdence value

of supersurfels from the model in the ﬁeld of view of

the camera, but not updated, is decreased.

A simple detection of free-space violations is also

performed to remove all the supersurfels from the pre-

dicted 3D global model G

G that lie in front of newly

updated supersurfels, with regard to the camera. To

ﬁnd these supersurfels to be removed, we compare the

value of the z

z coordinate of the center p

of each su-

persurfel G

from G

G, expressed in camera space at

time t − 1

t − 1

t − 1, with the value of the current frame depth

map Z

at pixel u

. Pixel u

corresponds to the loca-

tion of the perspective projection of the center p

the image plane Ω

Ω

Ω. A supersurfel G

is removed if

the following relation is veriﬁed:

z < Z

). (10)

6 EXPERIMENTS AND RESULTS

We used an Asus Xtion Live Pro camera (640x480

image resolution, acquisition frequency of 30 fps) for

our experiments and a laptop equipped with an Nvidia

GTX 950M GPU and an Intel Core i5-6300HQ CPU.

We performed quantitative and qualitative evaluations

of our solution using our own video sequences, se-

quences from the TUM RGB-D benchmark dataset

(Sturm et al., 2012) and from the ICL-NUIM data-

set (Handa et al., 2014). Supersurfels are rendered as

rectangular planar patches to accelerate the viewing.

The quantitative results are given for two diffe-

rent levels of resolution of SupersurfelFusion recon-

struction: 3D supersurfels reconstruction obtained

from the segmentation of input RGB-D pairs in super-

pixels of about 100 pixels and the one obtained from

the segmentation in superpixels of about 400 pixels.

Smaller superpixels generate more and ﬁner supersur-

fels and tend to produce a more accurate map, as we

can see on Figure 5, even if that is not the purpose

of our method. In man made environment with large

planar surfaces, wide superpixels allow a sufﬁciently

suitable 3D reconstruction. However when the obser-

ved area is composed of thin and detailed objects (and

according to the level of detail required by the user)

smaller superpixels better ﬁt the boundaries and ena-

ble to build a more complete model.

6.1 Qualitative Results

Figure 3 shows some reconstructed scenes. The obtai-

ned 3D model is dense and even some ﬁne or cur-

ved elements such as the legs of the table or the back

of the chair are well reproduced. This outlines that

our representation is able to preserve meaningfull in-

formations and is not a simple point cloud extracted

from a downsized or decimated RGB-D frame. The

approach is well adapted to local mapping tasks be-

cause it generates large supersurfels for distant ele-

ments which are then reﬁned by the fusion procedure

when the sensor comes closer. Close elements are dis-

played with higher accuracy, represented by smaller

supersurfels, contrary to distant ones.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

828

Figure 3: Supersurfels reconstructions of our ofﬁce envi-

ronments (superpixel size ' 100 pixels).

6.2 Accuracy of the Surface Estimation

Although accuracy is not the purpose of our method,

it is important that the 3D map built by SupersurfelFu-

sion is relevant and represents well the real environ-

ment. We evaluate the quality of the surface recon-

structed by our approach on the living room scene of

the ICL-NUIM dataset of (Handa et al., 2014). This

dataset provides four synthetic noisy RGB-D video

sequences, with associated ground-truth trajectories

and a ground-truth 3D model of the environment. An

example of our system running on this dataset can be

viewed on Figure 4. To evaluate the surface estima-

tion produced by our approach we convert our super-

surfels model to a dense point cloud by oversampling

the surface of each supersurfel. We then compute the

mean distances from each point of the obtained point

cloud to the nearest surface of the ground-truth 3D

model.

Results presented Table 1 show that even if Su-

persurfelFusion generates a rough 3D model, a cer-

tain accuracy can still be maintained in simple indoor

environments. We also add results obtained with the

state of the art accurate solution ElasticFusion (Whe-

lan et al., 2015), running with its default conﬁguration

with provided ground-truth.

Table 1: Surface reconstruction accuracy results on the ICL-

NUIM (Handa et al., 2014) synthetic dataset, for Super-

surfelFusion with superpixels of about 100 pixels (SFusion

100), with superpixels of about 400 pixels (SFusion 400)

and ElasticFusion (EFusion). Mean distances (m) from

each point of the reconstruction to the nearest surface in the

ground-truth 3D model are shown.

System kt0 kt1 kt2 kt3

SFusion 100 0.009 0.011 0.017 0.010

SFusion 400 0.013 0.013 0.020 0.021

EFusion 0.003 0.005 0.002 0.004

We recall that our goal is not to challenge the

accuracy of others state of the art systems. Instead

we want to show that our strategy for approximate 3D

reconstruction, based on the supersurfels representa-

tion, enable a performance improvement in speed and

memory usage.

Figure 4: Supersurfels reconstruction of the living room for

the sequence kt1 of the ICL-NUIM dataset (Handa et al.,

2014).

6.3 Computational Performance

To evaluate the performance of our solution we took

videos from (Sturm et al., 2012). This dataset pro-

vides real RGB-D video sequences with associated

ground-truth measurements of the pose of the camera

that we used as replacement of the visual odometry

block. As a means of comparison, here again we

present results acquired with ElasticFusion (Whelan

et al., 2015), running with its default conﬁguration

with provided ground-truth, on our test platform.

Table 2 presents the execution time of Supersur-

felFusion and ElasticFusion systems for 4 different

videos. Real-time execution is achieved for both le-

vels of resolution with SupersurfelFusion on non pro-

fessional GPU, even if as expected using smaller su-

perpixels slow down the process. There is no in-

terest or advantages of using our method with too

small superpixels because it would run with similar

or worst performance than other standard approaches

A Coarse and Relevant 3D Representation for Fast and Lightweight RGB-D Mapping

829

Figure 5: Comparison between 3D reconstruction results on the fr2/rpy sequence, from the TUM RGB-D benchmark data-

set (Sturm et al., 2012), using SupersurfelFusion conﬁgured to generate superpixels of about 400 pixels (left column) and

superpixels containing about 100 pixels (right column).

using every pixels, due to the time consuming super-

pixels segmentation task which is mainly dependant

of the input image size.

Table 2: Run time (Mean ± Std ms) of SupersurfelFusion

with superpixels of about 100 pixels (SFusion 100), 400

pixels (SFusion 400) and ElasticFusion (EFusion). Evalu-

ation performed on sequences from the TUM RGB-D ben-

chmark dataset (Sturm et al., 2012).

System fr1/room fr1/plant fr2/rpy fr2/desk

SFusion 100 35.3 ± 13.7 33.2 ± 6.83 25.9 ± 4.41 28.3 ± 5.79

SFusion 400 23.7 ± 2.27 23.8 ± 3.72 21.2 ± 9.01 22.2 ± 2.66

EFusion 58.1 ± 9.26 53.0 ± 7.81 50.4 ± 1.06 73.1 ± 2.02

Table 3 shows the memory footprint of the mo-

del reconstructed by SupersurfelFusion and Elasti-

cFusion. The method requires a small amount of

memory to store the model, thanks to the fact that

unlike traditional point-based reconstruction approa-

ches, which generate a surfel for each pixel, we only

considered a restricted set of 3D superpixel-based pri-

mitives for each image. However each supersurfel

uses bigger storage than a basic surfel structure. The

short memory usage offers the possibility to recon-

struct large size 3D models. We can see that Elasti-

cFusion uses way more memory than our coarse 3D

reconstruction method (about 10 times with regard

to SupersurfelFusion using superpixels of about 400

pixels).

Table 3: Maximal memory usage (MB) of the reconstructed

model for SupersurfelFusion with superpixels of about 100

pixels (SFusion 100), 400 pixels (SFusion 400) and Elas-

ticFusion (EFusion). Evaluation performed on sequences

from the TUM RGB-D benchmark dataset (Sturm et al.,

2012).

System fr1/room fr1/plant fr2/rpy fr2/desk

SFusion 100 20.9 18.3 5.83 7.88

SFusion 400 5.23 4.41 1.45 2.37

EFusion 52.3 33.0 19.1 58.6

As expected, the lower the resolution of Supersur-

felFusion is, the better are the results in terms of speed

and memory usage. The user should ﬁx the adapted

resolution according to the task and the working en-

vironment of the robot. The system has been tested

on an NVIDIA Jetson TX1 embedded board too, sho-

wing similar performance and results.

7 CONCLUSION

In this paper, a novel representation for 3D recon-

struction of static environments with RGB-D came-

ras is presented. The reconstructed 3D model is de-

ﬁned as a set of oriented and colored elliptical pla-

nar patches, extracted from the segmentation in su-

perpixels of a live RGB-D video stream. The use of

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

830

superpixels to generate the coarse 3D primitives (su-

persurfels) composing the predicted model guarantee

the preservation of most of the relevant and meaning-

ful information from the observed scene.

The reconstruction system proposed, Supersurfel-

Fusion, based on this representation, rebuild a low-

resolution but pertinent 3D model with high speed

performance and low memory usage. The use of this

system is of interest for robots which do not need a

very accurate but rather an efﬁcient, fast and light-

weight 3D map generation, enabling them to perform

other tasks at the same time without consuming too

much of the resources available.

As future works, we would like to integrate our

own camera tracking solution to this system, as frame

to model registration strategy, and make it robust to

dynamic environments by detecting moving objects

at a superpixel level and tracking them. Further opti-

mization can be achieved to speed up the computation

of supersurfels attributes by minimizing branch diver-

gence.

REFERENCES

odis-Szomor

u, A., Riemenschneider, H., and Gool, L. V.

(2014). Fast, approximate piecewise-planar modeling

based on sparse structure-from-motion and super-

pixels. In 2014 IEEE Conference on Computer Vision

and Pattern Recognition, pages 469–476.

Curless, B. and Levoy, M. (1996). A volumetric method for

building complex models from range images. In Pro-

ceedings of the 23rd Annual Conference on Compu-

ter Graphics and Interactive Techniques, SIGGRAPH

’96, pages 303–312, New York, NY, USA. ACM.

Davis, J. V. and Dhillon, I. (2006). Differential entropic

clustering of multivariate gaussians. In Proceedings

of the 19th International Conference on Neural Infor-

mation Processing Systems, NIPS’06. MIT Press.

Handa, A., Whelan, T., McDonald, J., and Davison, A.

(2014). A benchmark for RGB-D visual odometry,

3D reconstruction and SLAM. In IEEE Intl. Conf. on

Robotics and Automation, ICRA.

Henry, P., Fox, D., Bhowmik, A., and Mongia, R. (2013).

Patch volumes: Segmentation-based consistent map-

ping with rgb-d cameras. In 3DV. IEEE Computer

Society.

Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe,

R., Kohli, P., Shotton, J., Hodges, S., Freeman, D.,

Davison, A., and Fitzgibbon, A. (2011). Kinectfusion:

Real-time 3d reconstruction and interaction using a

moving depth camera. In Proceedings of the 24th An-

nual ACM Symposium on User Interface Software and

Technology. ACM.

Karras, T. (2012). Maximizing parallelism in the con-

struction of bvhs, octrees, and k-d trees. In Procee-

dings of the Fourth ACM SIGGRAPH / Eurographics

Conference on High-Performance Graphics, EGGH-

HPG’12. Eurographics Association.

Kay, T. L. and Kajiya, J. T. (1986). Ray tracing complex

scenes. In Proceedings of the 13th Annual Confe-

rence on Computer Graphics and Interactive Techni-

ques, SIGGRAPH ’86. ACM.

Keller, M., Leﬂoch, D., Lambers, M., Izadi, S., Weyrich,

T., and Kolb, A. (2013). Real-time 3d reconstruction

in dynamic scenes using point-based fusion. In 2013

International Conference on 3D Vision - 3DV 2013.

Mur-Artal, R. and Tard

os, J. D. (2017). ORB-SLAM2: an

open-source SLAM system for monocular, stereo and

RGB-D cameras. IEEE Transactions on Robotics.

Ren and Malik (2003). Learning a classiﬁcation model for

segmentation. In Proceedings Ninth IEEE Internatio-

nal Conference on Computer Vision.

Salas-Moreno, R. F., Glocken, B., Kelly, P. H. J., and Da-

vison, A. J. (2014). Dense planar slam. In 2014 IEEE

International Symposium on Mixed and Augmented

Reality (ISMAR), pages 157–164.

Steinbr

ucker, F., Sturm, J., and Cremers, D. (2011). Real-

time visual odometry from dense rgb-d images. In

2011 IEEE International Conference on Computer Vi-

sion Workshops (ICCV Workshops).

Sturm, J., Engelhard, N., Endres, F., Burgard, W., and Cre-

mers, D. (2012). A benchmark for the evaluation of

rgb-d slam systems. In Proc. of the International Con-

ference on Intelligent Robot Systems (IROS).

Thomas, D. and Sugimoto, A. (2013). A ﬂexible scene

representation for 3d reconstruction using an rgb-d

camera. In 2013 IEEE International Conference on

Computer Vision, pages 2800–2807.

Whelan, T., Kaess, M., Fallon, M., Johannsson, H., Leo-

nard, J., and McDonald, J. (2012). Kintinuous: Spati-

ally extended kinectfusion. In RSS Workshop on RGB-

D: Advanced Reasoning with Depth Cameras, Syd-

ney, Australia.

Whelan, T., Leutenegger, S., Moreno, R. S., Glocker, B.,

and Davison, A. (2015). Elasticfusion: Dense slam

without a pose graph. In Proceedings of Robotics:

Science and Systems.

Yamaguchi, K., McAllester, D., and Urtasun, R. (2014). Ef-

ﬁcient joint segmentation, occlusion labeling, stereo

and ﬂow estimation. In ECCV.

A Coarse and Relevant 3D Representation for Fast and Lightweight RGB-D Mapping

831