In some cases the output of the 4D-NET did not
match the ground truth accurately, as it is shown in
the Fig. 9. We have noticed that the biggest error was
caused by rotation disparity. In most of the cases this
was affected by improper in-place angle prediction,
while the translation influenced the rotational differ-
ence to a relatively small extent. We have established
that such situations make up 8% of all cases during
testing phase.
3.3 Runtime
The biggest overhead in the computation pipeline is
imposed by the Mask R-CNN network. The infer-
ence time for both 4D-NET and 2D-NET are negli-
gible and they jointly provide a result in 7 millisec-
onds on Nvidia Titan XP. Therefore we can report the
runability at approximately 13-15 frames per second.
The complexity of computation is not influenced sig-
nificantly for multi-object pose estimation since the
proposed network architectures are quite lightweight.
4 CONCLUSIONS AND FUTURE
WORK
In this paper, we show that the 6D pose estimation
can be split into two smaller sub-problems. First, the
neural network (4D-NET) estimates the camera trans-
lation and in-place rotation with respect to the ob-
ject. At this stage, we assume that the camera axis
goes through the center of the object. Second, the full
3D pose of the camera with respect to the object is
computed using the mathematical model of the cam-
era. The two-stage solution to the object pose esti-
mation allows simplifying the problem in comparison
to end-to-end solutions (Kehl et al., 2017). As a re-
sult, the neural network used for the pose estimation
(4D-NET) can be more compact and computationally
efficient.
To detect the object on the RGB image we use
the Mask R-CNN method (He et al., 2017). The cen-
ter of the detected bounding box does not correspond
with the projection of the object’s center on the image
plane. Thus, we designed the neural network which
estimates the center of the object on the image plane
(2D-NET). Finally, we show results on the publicly
available dataset. The obtained average translation
error is smaller than 7% and the obtained rotation er-
ror is smaller than 5 degrees. It means that we can
precisely and efficiently (up to 15 frames per second)
estimate the pose of known objects in the 3D space
using RGB image only.
In the future, we are going to verify the method
in real-world robotics applications. We are going to
use the proposed method to detect and grasp the ob-
jects by the mobile manipulating robot. We also plan
to investigate the accuracy of the proposed method in
real-world scenarios and work on the precision of the
pose estimation.
ACKNOWLEDGEMENTS
This work was supported by the National Centre for
Research and Development (NCBR) through project
LIDER/33/0176/L-8/16/NCBR/2017. We gratefully
acknowledge the support of NVIDIA Corporation
with the donation of the Titan Xp GPU used for this
research.
REFERENCES
Bibby, C. and Reid, I. (2016). Robust Real-Time Visual
Tracking Using Pixel-Wise Posteriors. In Forsyth D.,
Torr P., Z. A., editor, Computer Vision – ECCV 2008.
Lecture Notes in Computer Science, vol. 5303, pages
831–844. Springer, Berlin, Heidelberg.
Brachmann, E., Michel, F., Krull, A., Yang, M. Y.,
Gumhold, S., and Rother, C. (2016). Uncertainty-
Driven 6D Pose Estimation of Objects and Scenes
from a Single RGB Image. In IEEE Conference
on Computer Vision and Pattern Recognition, pages
3364–3372. IEEE.
Chang, A., Funkhouser, T., Guibas, L., Hanrahan, P.,
Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S.,
Su, H., Xiao, J., Yi, L., and Yu, F. (2015). Shapenet:
An information-rich 3D model repository. In arXiv
preprint arXiv:1512.03012.
Do, T.-T., Cai, M., Pham, T., and Reid, I. (2018). Deep-
6DPose: Recovering 6D Object Pose from a Single
RGB Image. In arXiv.
He, K., Gkioxari, G., Doll
´
ar, P., and Ross, G. (2017). Mask
R-CNN. In IEEE International Conference on Com-
puter Vision, pages 2980–2988. IEEE.
Hinterstoisser, S., Holzer, S., Cagniart, C., Ilic, S., Kono-
lige, K., Navab, N., and Lepetit, V. (2011). Multi-
modal templates for real-time detection of texture-less
objects in heavily cluttered scenes. In IEEE Inter-
national Conference on Computer Vision, pages 858–
865. IEEE.
Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradsky,
G., Konolige, K., and Navab, N. (2012). Model Based
Training, Detection and Pose Estimation of Texture-
Less 3D Objects in Heavily Cluttered Scenes. In et al.,
L. K., editor, Computer Vision – ACCV 2012, Lecture
Notes in Computer Science, vol. 7724, pages 548–562.
Springer, Berlin, Heidelberg.
Hodan, T., Zabulis, X., Lourakis, M., Obdrzalek, S., and
Matas, J. (2015). Detection and Fine 3D Pose Es-
timation of Textureless Objects in RGB-D Images.
ICINCO 2019 - 16th International Conference on Informatics in Control, Automation and Robotics
548