5 CONCLUSION AND OUTLOOK
Within this paper, we presented an approach to use
FuseNet for the 3D segmentation of humans in in-
dustrial environments to create corresponding point
clouds for the adaptive path planning of mobile
robots. To automatically generate the training data an-
notations required for the FuseNet training, Mask R-
CNN pre-trained with the COCO dataset was used to
annotate the color and depth information acquired and
superimposed by the camera using the segmentation
masks calculated from the color image. On an eval-
uation dataset with manually annotated ground truth
masks, our trained FuseNet model was able to achieve
a higher mean intersection over union value at a re-
duced computing time compared to competing pre-
trained segmentation models. Due to the low compu-
tation time and good recognition quality, the model is
suitable for real-time 3D segmentation of persons to
enable human-aware path planning for mobile robots
from the resulting point cloud.
For future improvements, we intend to compare
our model which was trained using a weakly super-
vised strategy with the NYU trained FuseNet model
which is trained on expensive groundtruth annotation.
Additionally we are interested in evaluating CNNs
like DA-RNN (Xiang and Fox, 2017) and STD2P (He
et al., 2016) which take the temporal aspect of the data
into account, as tracking humans from frame to frame
instead of segmenting each frame in an isolated way
could yield better results.
ACKNOWLEDGEMENTS
The project receives funding from the German Fed-
eral Ministry of Education and Research under grant
agreement 05K19WEA (project RAPtOr).
REFERENCES
Chen, L., Zhu, Y., Papandreou, G., Schroff, F., and Adam,
H. (2018). Encoder-decoder with atrous separable
convolution for semantic image segmentation. CoRR,
abs/1802.02611.
Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei,
L. (2009). Imagenet: A large-scale hierarchical im-
age database. In 2009 IEEE Conference on Computer
Vision and Pattern Recognition, pages 248–255.
Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-
Martinez, V., and Rodr
´
ıguez, J. G. (2017). A review
on deep learning techniques applied to semantic seg-
mentation. CoRR, abs/1704.06857.
Gupta, S., Arbel
´
aez, P., and Malik, J. (2013). Perceptual or-
ganization and recognition of indoor scenes from rgb-
d images. In 2013 IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 564–571.
Hazirbas, C., Ma, L., Domokos, C., and Cremers, D.
(2016). Fusenet: Incorporating depth into semantic
segmentation via fusion-based cnn architecture. In
Asian Conference on Computer Vision.
Hazirbas, C., Ma, L., Domokos, C., and Cremers, D.
(2017). Fusenet. https://github.com/zanilzanzan/
FuseNet PyTorch.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask r-cnn. In 2017 IEEE International Conference
on Computer Vision (ICCV), pages 2980–2988.
He, Y., Chiu, W., Keuper, M., and Fritz, M. (2016). RGBD
semantic segmentation using spatio-temporal data-
driven pooling. CoRR, abs/1604.02388.
Jafari, O. H., Mitzel, D., and Leibe, B. (2014). Real-time
rgb-d based people detection and tracking for mobile
robots and head-worn cameras. In In ICRA.
Koch, J., Wettach, J., Bloch, E., and Berns, K. (2007).
Indoor localisation of humans, objects, and mobile
robots with rfid infrastructure. In 7th International
Conference on Hybrid Intelligent Systems (HIS 2007),
pages 271–276.
Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick,
R. B., Hays, J., Perona, P., Ramanan, D., Doll
´
ar, P.,
and Zitnick, C. L. (2014). Microsoft COCO: common
objects in context. CoRR, abs/1405.0312.
Liu, J., Liu, Y., Zhang, G., Zhu, P., and Qiu Chen, Y. (2015).
Detecting and tracking people in real time with rgb-d
camera. Pattern Recognition Letters, 53:16–23.
Mosberger, R. and Andreasson, H. (2013). An inexpen-
sive monocular vision system for tracking humans in
industrial environments. In 2013 IEEE International
Conference on Robotics and Automation, pages 5850–
5857.
Munaro, M., Lewis, C., Chambers, D., Hvass, P., and
Menegatti, E. (2015). Rgb-d human detection and
tracking for industrial environments. In Intelligent Au-
tonomous Systems 13, pages 1655–1668.
Nathan Silberman, Derek Hoiem, P. K. and Fergus, R.
(2012). Indoor segmentation and support inference
from rgbd images. In ECCV.
Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T.,
Leibs, J., Wheeler, R., and Ng, A. Y. (2009). Ros: an
open-source robot operating system. In ICRA work-
shop on open source software, volume 3, page 5.
Kobe, Japan.
Shi, D., Collins Jr, E. G., Goldiez, B., Donate, A., and Dun-
lap, D. (2008). Human-aware robot motion planning
with velocity constraints. In 2008 International Sym-
posium on Collaborative Technologies and Systems,
pages 490–497.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio,
M., Moore, R., Kipman, A., and Blake, A. (2011).
Real-time human pose recognition in parts from single
depth images. In CVPR 2011, pages 1297–1304.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
RGB-D-based Human Detection and Segmentation for Mobile Robot Navigation in Industrial Environments
225