
solid waste for recycling. Waste Management, 60:56–
74. Special Thematic Issue: Urban Mining and Circu-
lar Economy.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask r-cnn. In Proceedings of the IEEE international
conference on computer vision, pages 2961–2969.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Hong, J., Fulton, M., and Sattar, J. (2020). Trashcan: A
semantically-segmented dataset towards visual detec-
tion of marine debris.
Huang, Y., Kang, D., Jia, W., Liu, L., and He, X. (2022).
Channelized axial attention–considering channel re-
lation within spatial attention for semantic segmenta-
tion. In Proceedings of the AAAI Conference on Arti-
ficial Intelligence, volume 36, pages 1016–1025.
Jo, S. and Yu, I. (2021). Puzzle-cam: Improved localiza-
tion via matching partial and full features. CoRR,
abs/2101.11253.
Jung, H. and Oh, Y. (2021). Towards better explanations
of class activation mapping. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 1336–1344.
Kingma, D. P. and Ba, J. (2017). Adam: A method for
stochastic optimization.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
Imagenet classification with deep convolutional neu-
ral networks. In Pereira, F., Burges, C., Bottou, L.,
and Weinberger, K., editors, Advances in Neural In-
formation Processing Systems, volume 25. Curran As-
sociates, Inc.
Lee, S., Lee, M., Lee, J., and Shim, H. (2021). Railroad
is not a train: Saliency as pseudo-pixel supervision
for weakly supervised semantic segmentation. In Pro-
ceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 5495–5505.
Li, Y., Chen, Y., Wang, N., and Zhang, Z. (2019). Scale-
aware trident networks for object detection. In Pro-
ceedings of the IEEE/CVF international conference
on computer vision, pages 6054–6063.
Liu, S., Zhi, S., Johns, E., and Davison, A. J. (2021a). Boot-
strapping semantic segmentation with regional con-
trast. arXiv preprint arXiv:2104.04465.
Liu, S., Zhi, S., Johns, E., and Davison, A. J. (2022a). Boot-
strapping semantic segmentation with regional con-
trast.
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning,
J., Cao, Y., Zhang, Z., Dong, L., et al. (2022b). Swin
transformer v2: Scaling up capacity and resolution.
In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 12009–
12019.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S.,
and Guo, B. (2021b). Swin transformer: Hierarchical
vision transformer using shifted windows.
Majchrowska, S., Mikołajczyk, A., Ferlin, M.,
Klawikowska, Z., Plantykow, M. A., Kwasigroch, A.,
and Majek, K. (2022). Deep learning-based waste
detection in natural and urban environments. Waste
Management, 138:274–284.
Mao, A., Mohri, M., and Zhong, Y. (2023). Cross-entropy
loss functions: Theoretical analysis and applications.
arXiv preprint arXiv:2304.07288.
”Minaee, S., ”Boykov, Y. Y., ”Porikli, F., ”Plaza, A. J., ”Ke-
htarnavaz, N., and ”Terzopoulos, D. (2021). Image
segmentation using deep learning: A survey.
Ouali, Y., Hudelot, C., and Tami, M. (2020). Semi-
supervised semantic segmentation with cross-
consistency training. In 2020 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 12671–12681.
Pathak, D., Kr
¨
ahenb
¨
uhl, P., and Darrell, T. (2015). Con-
strained convolutional neural networks for weakly su-
pervised segmentation.
Proenc¸a, P. F. and Sim
˜
oes, P. (2020). Taco: Trash annota-
tions in context for litter detection.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556.
Sudre, C. H., Li, W., Vercauteren, T., Ourselin, S., and
Jorge Cardoso, M. (2017). Generalised dice over-
lap as a deep learning loss function for highly unbal-
anced segmentations. In Deep Learning in Medical
Image Analysis and Multimodal Learning for Clini-
cal Decision Support: Third International Workshop,
DLMIA 2017, and 7th International Workshop, ML-
CDS 2017, Held in Conjunction with MICCAI 2017,
Qu
´
ebec City, QC, Canada, September 14, Proceed-
ings 3, pages 240–248. Springer.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-
novich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 1–9.
Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model
scaling for convolutional neural networks. In Interna-
tional conference on machine learning, pages 6105–
6114. PMLR.
Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N.,
Hechtman, B., and Shlens, J. (2021). Scaling lo-
cal self-attention for parameter efficient visual back-
bones.
Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H.
(2020). Linformer: Self-attention with linear com-
plexity.
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q.,
Aggarwal, K., Mohammed, O. K., Singhal, S., Som,
S., et al. (2022a). Image as a foreign language: Beit
pretraining for all vision and vision-language tasks.
arXiv preprint arXiv:2208.10442.
Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu,
X., Lu, T., Lu, L., Li, H., et al. (2022b). Internimage:
Exploring large-scale vision foundation models with
deformable convolutions.
Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X.,
Hu, X., Lu, T., Lu, L., Li, H., Wang, X., and Qiao,
Y. (2023). Internimage: Exploring large-scale vision
foundation models with deformable convolutions.
Enhanced Segmentation of Deformed Waste Objects in Cluttered Environments
579