addressed. This system combines object localization
and identification to improve robotic grasping. This
work’s main contribution was the development of a
framework integrating two different ML models to
create a system for more complete and autonomous
grasping tasks.
This paper’s main focus was on the integration of
two different ML models for detection and pose es-
timation. Both Mask-RCNN and Densefusion were
correctly integrated and show adequate results in
a real testing environment. The developed system
is capable of detecting movements and differences
between video frames, therefore reducing the need
for constant calculation and inference on each video
frame. The proposed solution runs entirely automat-
ically after an initial user parameters’ selection and
can generate grasping outputs from video data at a
rate of 2.6fps.
An interesting positive aspect of the proposed so-
lution is the ability of a human to be in control of
choosing the optimal grasping points for an object and
creating a configuration file with these points. This
allows for user and application adaptability while in-
creasing the system’s accuracy and speed since it does
not estimate these points during run time. The main
limitations found in the proposed solution were the
overhead work needed to output correct gripper rota-
tions for certain object positions, the impact of light-
ing conditions, and the dependency on previously
scanned 3D models of the actual objects.
ACKNOWLEDGEMENTS
INDTECH 4.0 – New technologies for intelligent
manufacturing. Support on behalf of IS for Techno-
logical Research and Development (SI
`
a Investigac¸
˜
ao
e Desenvolvimento Tecnol
´
ogico). POCI-01-0247-
FEDER-026653.
REFERENCES
Bicchi, A. and Kumar, V. (2000). Robotic grasping and
contact: a review. In Proceedings 2000 ICRA. Mil-
lennium Conference. IEEE International Conference
on Robotics and Automation. Symposia Proceedings
(Cat. No.00CH37065), pages 348–353 vol.1.
Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P.,
and Dollar, A. M. (2015). Benchmarking in manip-
ulation research: Using the yale-cmu-berkeley object
and model set. IEEE Robotics Automation Magazine,
22(3):36–52.
Dai, J., Li, Y., He, K., and Sun, J. (2016). R-FCN:
Object detection via region-based fully convolutional
networks. Advances in Neural Information Processing
Systems, pages 379–387.
Farhadi, A. and Redmon, J. (2018). Yolov3: An incremental
improvement. Computer Vision Pattern Recognition.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask r-cnn.
Lenz, I., Lee, H., and Saxena, A. (2015). Deep learning for
detecting robotic grasps. The International Journal of
Robotics Research, 34(4-5):705–724.
Li, Y., Wang, G., Ji, X., Xiang, Y., and Fox, D. (2019).
Deepim: Deep iterative matching for 6d pose estima-
tion. International Journal of Computer Vision.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,
Fu, C. Y., and Berg, A. C. (2016). SSD: Single shot
multibox detector. Lecture Notes in Computer Sci-
ence, 9905 LNCS:21–37.
Miller, A. T., Knoop, S., Christensen, H. I., and Allen,
P. K. (2003). Automatic grasp planning using shape
primitives. In 2003 IEEE International Conference
on Robotics and Automation, pages 1824–1829 vol.2.
Peng, S., Liu, Y., Huang, Q., Zhou, X., and Bao, H. (2019).
Pvnet: Pixel-wise voting network for 6dof pose es-
timation. 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR).
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.
(2016). You only look once: Unified, real-time object
detection. Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recogni-
tion, 2016-Dec:779–788.
Redmon, J. and Farhadi, A. (2017). YOLO9000: Better,
faster, stronger. Proceedings - 30th IEEE Conference
on Computer Vision and Pattern Recognition, CVPR
2017, 2017-Jan:6517–6525.
Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster R-
CNN:Towards Real-Time Object Detection with Re-
gion Proposal Networks. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 39(6):1137–
1149.
Rusinkiewicz, S. and Levoy, M. (2001). Efficient variants
of the icp algorithm. In Proceedings Third Interna-
tional Conference on 3-D Digital Imaging and Mod-
eling, pages 145–152.
Saxena, A., Driemeyer, J., and Ng, A. Y. (2008). Robotic
grasping of novel objects using vision. The Interna-
tional Journal of Robotics Research, 27(2):157–173.
Song, C., Song, J., and Huang, Q. (2020). Hybridpose: 6d
object pose estimation under hybrid representations.
Wang, C., Xu, D., Zhu, Y., Martin-Martin, R., Lu, C.,
Fei-Fei, L., and Savarese, S. (2019). Densefusion:
6d object pose estimation by iterative dense fusion.
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition.
Xiang, Y., Schmidt, T., Narayanan, V., and Fox, D. (2018).
PoseCNN: A Convolutional Neural Network for 6D
Object Pose Estimation in Cluttered Scenes.
Zakharov, S., Shugurov, I., and Ilic, S. (2019). DPOD: 6D
Pose Object Detector and Refiner.
Automatic 3D Object Recognition and Localization for Robotic Grasping
425