Table 4 indicates that, without taking account of
iterative pose refinement processes, the proposed
approach shows the top performance in AR, AR
(VSD), and the processing time. Note that the same
testing dataset provided by the benchmark was used
by the approaches under comparative evaluation. In
sum, through experiments, we validated the
effectiveness of the proposed framework in terms of
its top-tier performance in the accuracy of estimated
poses as well as in the computational efficiency.
7 CONCLUSION
In this study, we presented an end-to-end deep
network framework for the 6D pose estimation of
objects under heavy occlusions in cluttered scenes.
The proposed framework integrates the cascaded
YOLO-YOACT-based occlusion-robust panoptic
segmentation network with the dual associative point
AE-based efficient 6D pose estimation network. In
particular, we achieved the occlusion-robust panoptic
segmentation based on such novel regimes as 1) the
depth-based tone mapping of YOLO box images and
2) the scene-level segmentation refinement by fusing
multiple YOLACT segmentations generated from
overlapped YOLO boxes. We also achieved highly
efficient 6D pose estimation by directly transforming
a 3D partial point cloud into the corresponding 6D
full point cloud represented in the combined camera
and object frames. The robustness of the proposed
panoptic segmentation against heavily occluded
scenes is verified by ablation and comparative studies.
The effectiveness of the integrated 6D pose
estimation framework is validated by the experiments
using the standard benchmarking datasets, LM and
LMO. We showed a top-tier performance in the LMO
leader board in terms of both accuracy and efficiency,
in particular, in AR (VSD) and computation time,
despite that no additional 6D pose refinement process
is employed. We are currently developing a 6D pose
refinement process to be attached to the current
framework. We also plan to apply our framework to
various pick-and-place operations for the industry.
ACKNOWLEDGEMENTS
This study was supported, in part, by the ‘‘Intelligent
Manufacturing Solution under Edge-Brain
Framework’’ Project of the Institute for Information
and Communications Technology Promotion (IITP)
under Grant IITP-2022-0-00067 (EdgeBrain-2) and
IITP-2022-0-00187 (EdgeBrain-3) and, in part, by AI
Graduate School Program, Grant No. 2019-0-00421,
and by ICT Consilience Program, IITP-2020-0-
01821, of the Institute of Information and
Communication Technology Planning & Evaluation
(IITP), sponsored by the Korean Ministry of Science
and Information Technology (MSIT).
REFERENCES
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016).
You only look once: Unified, real-time object detection.
In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 779-788).
Bolya, D., Zhou, C., Xiao, F., & Lee, Y. J. (2019). Yolact:
Real-time instance segmentation. In Proceedings of the
IEEE/CVF international conference on computer vision
(pp. 9157-9166).
Lee, S., Mai, K. T., & Jeong, W. (2012, February). Virtual
high dynamic range imaging for robust recognition. In
Proceedings of the 6th International Conference on
Ubiquitous Information Management and
Communication (pp. 1-6).
Kirillov, A., He, K., Girshick, R., Rother, C., & Dollár, P.
(2019). Panoptic segmentation. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition (pp. 9404-9413).
Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel
classification is not all you need for semantic
segmentation. Advances in Neural Information
Processing Systems, 34, 17864-17875.
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., &
Girdhar, R. (2022). Masked-attention mask transformer
for universal image segmentation. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition (pp. 1290-1299).
Jain, J., Li, J., Chiu, M. T., Hassani, A., Orlov, N., & Shi,
H. (2023). Oneformer: One transformer to rule
universal image segmentation. In Proceedings of the
IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 2989-2998).
Wang, G., Manhardt, F., Tombari, F., & Ji, X. (2021).
GDR-net: Geometry-guided direct regression network
for monocular 6d object pose estimation. In
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 16611-
16621).
Hu, Y., Fua, P., & Salzmann, M. (2022, October).
Perspective flow aggregation for data-limited 6d object
pose estimation. In European Conference on Computer
Vision (pp. 89-106). Cham: Springer Nature
Switzerland.
Wang, C., Xu, D., Zhu, Y., Martín-Martín, R., Lu, C., Fei-
Fei, L., & Savarese, S. (2019). Densefusion: 6d object
pose estimation by iterative dense fusion. In
Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition (pp. 3343-3352