
combining synthetic and real datasets produced the
best overall results. Interestingly, when fine-tuning
the model with real data after initial training on syn-
thetic data, the impact of filtering became less signif-
icant. The randomness introduced by the unfiltered
dataset improved generalization during fine-tuning.
This hybrid approach suggests that synthetic data can
be a valuable supplement in situations where real-
world data is limited or difficult to annotate. We also
demonstrated the general applicability on objects be-
yond simple cuboid shapes.
REFERENCES
Abbas, A., Jain, S., Gour, M., and Vankudothu, S. (2021).
Tomato plant disease detection using transfer learning
with c-gan synthetic images. Computers and Electron-
ics in Agriculture, 187:106279.
Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., and
Grundmann, M. (2021). Objectron: A large scale
dataset of object-centric videos in the wild with pose
annotations. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition.
Delussu, R., Putzu, L., and Fumera, G. (2024). Synthetic
data for video surveillance applications of computer
vision: A review. IJCV, pages 1–37.
Deschaud, J.-E. (2021). Kitti-carla: a kitti-like dataset
generated by carla simulator. arXiv preprint
arXiv:2109.00892.
Hattori, H., Lee, N., Boddeti, V. N., Beainy, F., Kitani,
K. M., and Kanade, T. (2018). Synthesizing a scene-
specific pedestrian detector and pose estimator for
static video surveillance - can we learn pedestrian de-
tectors and pose estimators without real data? Int.
Journal of Computer Vision, 126(9):1027–1044.
Josifovski, J., Kerzel, M., Pregizer, C., Posniak, L., and
Wermter, S. (2018). Object detection and pose estima-
tion based on convolutional neural networks trained
with synthetic data. In 2018 IEEE/RSJ IROS, pages
6269–6276.
Lin, Y., Tremblay, J., Tyree, S., Vela, P. A., and Birchfield,
S. (2022). Single-stage keypoint-based category-level
object pose estimation from an RGB image. In IEEE
ICRA.
Lomurno, E., D’Oria, M., and Matteucci, M. (2024). Stable
diffusion dataset generation for downstream classifi-
cation tasks. arXiv preprint arXiv:2405.02698.
Man, K. and Chahl, J. (2022). A review of synthetic im-
age data and its use in computer vision. Journal of
Imaging, 8(11).
Marullo, G., Tanzi, L., Piazzolla, P., and Vezzetti, E. (2023).
6d object position estimation from 2d images: A lit-
erature review. Multimedia Tools and Applications,
82(16):24605–24643.
Moonen, S., Vanherle, B., de Hoog, J., Bourgana, T., Bey-
Temsamani, A., and Michiels, N. (2023). Cad2render:
A modular toolkit for gpu-accelerated photorealistic
synthetic data generation for the manufacturing indus-
try. In Proceedings of WACV, pages 583–592.
Nikolenko, S. (2021). Synthetic data for deep learning, vol-
ume 174. Springer.
Rad, M. and Lepetit, V. (2017). Bb8: A scalable, accurate,
robust to partial occlusion method for predicting the
3d poses of challenging objects without using depth.
In 2017 IEEE International Conference on Computer
Vision (ICCV), pages 3848–3856.
Rajagopal, B. G., Kumar, M., Alshehri, A. H., Alanazi, F.,
Deifalla, A. F., Yosri, A. M., and Azam, A. (2023).
A hybrid cycle gan-based lightweight road perception
pipeline for road dataset generation for urban mobil-
ity. Plos one, 18(11):e0293978.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. (2022). High-resolution image synthesis
with latent diffusion models. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pages 10684–10695.
Su, H., Qi, C. R., Li, Y., and Guibas, L. J. (2015). Render
for cnn: Viewpoint estimation in images using cnns
trained with rendered 3d model views. In 2015 IEEE
ICCV, pages 2686–2694.
Tekin, B., Sinha, S. N., and Fua, P. (2018). Real-time seam-
less single shot 6d object pose prediction. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR).
Tran, T. (2024). Synthesizing image with high-quality seg-
mentation mask by prompting large vision model. In
CVPR Workshop.
Valvano, G., Agostino, A., De Magistris, G., Graziano, A.,
and Veneri, G. (2024). Controllable image synthesis
of industrial data using stable diffusion. In Proceed-
ings of the IEEE/CVF Winter Conference on Applica-
tions of Computer Vision, pages 5354–5363.
Wood, E., Baltru
ˇ
saitis, T., Hewitt, C., Dziadzio, S., Cash-
man, T. J., and Shotton, J. (2021). Fake it till you
make it: face analysis in the wild using synthetic data
alone. In Proceedings of the IEEE/CVF international
conference on computer vision, pages 3681–3691.
Yu, S., Zhai, D.-H., Xia, Y., Li, D., and Zhao, S. (2024).
Cattrack: Single-stage category-level 6d object pose
tracking via convolution and vision transformer. IEEE
Transactions on Multimedia, 26:1665–1680.
Zhang, H., Tian, Y., Wang, K., He, H., and Wang, F.-Y.
(2019). Synthetic-to-real domain adaptation for object
instance segmentation. In 2019 International Joint
Conference on Neural Networks (IJCNN), pages 1–7.
Zhang, L., Rao, A., and Agrawala, M. (2023). Adding con-
ditional control to text-to-image diffusion models. In
Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 3836–3847.
Zhao, H., Wang, Y., Bashford-Rogers, T., Donzella, V.,
and Debattista, K. (2024). Exploring generative ai
for sim2real in driving data synthesis. arXiv preprint
arXiv:2404.09111.
Zhao, W., Zhang, S., Guan, Z., Luo, H., Tang, L., Peng, J.,
and Fan, J. (2020). 6d object pose estimation via view-
point relation reasoning. Neurocomputing, 389:9–17.
Conditioned Generative AI for Synthetic Training of 6D Object Pose Detection
331