
Efimova, V., Shalamov, V., and Filchenkov, A. (2020). Syn-
thetic dataset generation for text recognition with gen-
erative adversarial networks. In Twelfth International
Conference on Machine Vision (ICMV 2019), volume
11433, pages 310–316. SPIE.
Gaidon, A., Wang, Q., Cabon, Y., and Vig, E. (2016). Vir-
tual worlds as proxy for multi-object tracking analy-
sis. In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pages 4340–
4349.
Ge, Y., Xu, J., Zhao, B. N., Itti, L., and Vineet, V. (2022).
Dall-e for detection: Language-driven context im-
age synthesis for object detection. arXiv preprint
arXiv:2206.09592.
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wich-
mann, F. A., and Brendel, W. (2018). Imagenet-
trained cnns are biased towards texture; increasing
shape bias improves accuracy and robustness. arXiv
preprint arXiv:1811.12231.
Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.-Y., Cubuk,
E. D., Le, Q. V., and Zoph, B. (2021). Simple copy-
paste is a strong data augmentation method for in-
stance segmentation. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recogni-
tion, pages 2918–2928.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Ben-
gio, Y. (2020). Generative adversarial networks. Com-
munications of the ACM, 63(11):139–144.
Hataya, R., Zdenek, J., Yoshizoe, K., and Nakayama,
H. (2020). Faster autoaugment: Learning augmen-
tation strategies using backpropagation. In Com-
puter Vision–ECCV 2020: 16th European Confer-
ence, Glasgow, UK, August 23–28, 2020, Proceed-
ings, Part XXV 16, pages 1–16. Springer.
Hong, M., Choi, J., and Kim, G. (2021). Stylemix: Separat-
ing content and style for enhanced data augmentation.
In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 14862–
14870.
Jackson, P. T., Abarghouei, A. A., Bonner, S., Breckon,
T. P., and Obara, B. (2019). Style augmentation: data
augmentation via style randomization. In CVPR work-
shops, volume 6, pages 10–11.
Jocher, G., Chaurasia, A., and Qiu, J. (2023). YOLO by
Ultralytics.
Kim, J.-H., Choo, W., and Song, H. O. (2020). Puzzle
mix: Exploiting saliency and local statistics for op-
timal mixup. In International Conference on Machine
Learning, pages 5275–5285. PMLR.
Kingma, D. P. and Welling, M. (2013). Auto-encoding vari-
ational bayes. arXiv preprint arXiv:1312.6114.
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte,
R., and Van Gool, L. (2022). Repaint: Inpainting us-
ing denoising diffusion probabilistic models. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 11461–11471.
Luo, S., Tan, Y., Huang, L., Li, J., and Zhao, H. (2023). La-
tent consistency models: Synthesizing high-resolution
images with few-step inference. arXiv preprint
arXiv:2310.04378.
Podell, D., English, Z., Lacey, K., Blattmann, A., Dock-
horn, T., M
¨
uller, J., Penna, J., and Rombach, R.
(2023). Sdxl: Improving latent diffusion models
for high-resolution image synthesis. arXiv preprint
arXiv:2307.01952.
Project, F. (2023). Pothole detection system new
dataset. https://universe.roboflow.com/final-project-
iic7d/pothole-detection-system-new. visited on 2023-
11-22.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., et al. (2021). Learning transferable visual models
from natural language supervision. In International
conference on machine learning, pages 8748–8763.
PMLR.
Rajpal, A., Cheema, N., Illgner-Fehns, K., Slusallek, P., and
Jaiswal, S. (2023). High-resolution synthetic rgb-d
datasets for monocular depth estimation. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 1188–1198.
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., and
Koltun, V. (2020). Towards robust monocular depth
estimation: Mixing datasets for zero-shot cross-
dataset transfer. IEEE transactions on pattern anal-
ysis and machine intelligence, 44(3):1623–1637.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. (2022). High-resolution image synthesis
with latent diffusion models. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pages 10684–10695.
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-
net: Convolutional networks for biomedical image
segmentation. In Medical Image Computing and
Computer-Assisted Intervention–MICCAI 2015: 18th
International Conference, Munich, Germany, October
5-9, 2015, Proceedings, Part III 18, pages 234–241.
Springer.
Sandfort, V., Yan, K., Pickhardt, P. J., and Summers, R. M.
(2019). Data augmentation using generative adversar-
ial networks (cyclegan) to improve generalizability in
ct segmentation tasks. Scientific reports, 9(1):16884.
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C.,
Wightman, R., Cherti, M., Coombes, T., Katta, A.,
Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An
open large-scale dataset for training next generation
image-text models. Advances in Neural Information
Processing Systems, 35:25278–25294.
Takase, T., Karakida, R., and Asoh, H. (2021). Self-paced
data augmentation for training neural networks. Neu-
rocomputing, 442:296–306.
Trabucco, B., Doherty, K., Gurinas, M., and Salakhutdinov,
R. (2023). Effective data augmentation with diffusion
models. arXiv preprint arXiv:2302.07944.
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., and
Guibas, L. J. (2019). Normalized object coordinate
space for category-level 6d object pose and size esti-
mation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
2642–2651.
Image Augmentation for Object Detection and Segmentation with Diffusion Models
819