
REFERENCES
Armeni, I., He, Z.-Y., Gwak, J., Zamir, A. R., Fischer, M.,
Malik, J., and Savarese, S. (2019). 3d scene graph: A
structure for unified semantics, 3d space, and camera.
In Proceedings of the IEEE International Conference
on Computer Vision.
Badrinarayanan, V., Kendall, A., and Cipolla, R. (2017).
Segnet: A deep convolutional encoder-decoder ar-
chitecture for image segmentation. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
39(12):2481–2495.
Caron, M., Bojanowski, P., Joulin, A., and Douze, M.
(2018). Deep clustering for unsupervised learning of
visual features. In European Conference on Computer
Vision.
Caron, M., Touvron, H., Misra, I., J
´
egou, H., Mairal, J., Bo-
janowski, P., and Joulin, A. (2021). Emerging prop-
erties in self-supervised vision transformers. In Pro-
ceedings of the IEEE/CVF international conference
on computer vision, pages 9650–9660.
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and
Hartwig, A. (2018). Encoder-decoder with atrous sep-
arable convolution for semantic image segmentation.
In European Conference on Computer Vision, pages
801–818.
Cheng, B., Schwing, A. G., and Kirillov, A. (2021). Per-
pixel classification is not all you need for semantic
segmentation.
Cho, J. H., Mall, U., Bala, K., and Hariharan, B. (2021).
Picie: Unsupervised semantic segmentation using
invariance and equivariance in clustering. 2021
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 16789–16799.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,
M., Benenson, R., Franke, U., Roth, S., and Schiele,
B. (2016). The cityscapes dataset for semantic urban
scene understanding. In Proc. of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR).
Desai, S. M. and Ghose, D. (2022). Active learning for
improved semi-supervised semantic segmentation in
satellite images. 2022 IEEE/CVF Winter Conference
on Applications of Computer Vision (WACV), pages
1485–1495.
Genova, K., Yin, X., Kundu, A., Pantofaru, C., Cole,
F., Sud, A., Brewington, B., Shucker, B., and
Funkhouser, T. (2021). Learning 3d semantic seg-
mentation with only 2d image supervision. In 2021
International Conference on 3D Vision (3DV), pages
361–372.
Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., and
Freeman, W. T. (2022). Unsupervised semantic seg-
mentation by distilling feature correspondences. In In-
ternational Conference on Learning Representations.
Ji, X., Henriques, J. F., and Vedaldi, A. (2019). Invariant
information clustering for unsupervised image clas-
sification and segmentation. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion (ICCV).
Kim, W., Kanezaki, A., and Tanaka, M. (2020). Unsuper-
vised learning of image segmentation based on dif-
ferentiable feature clustering. IEEE Transactions on
Image Processing, 29:8055–8068.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K.,
Kravitz, J., Chen, S., Kalan-tidis, Y., Li, L.-J.,
Shamma, D. A., Bernstein, M., and Fei-Fei, L. (2017).
Visual genome: Connecting language and vision us-
ing crowdsourced dense image annotations. Interna-
tional Journal of Computer Vision, (123):32–73.
Liang-Chieh, C., Papandreou, G., Kokkinos, I., Murphy, K.,
and Yuille, A. (2015). Semantic Image Segmentation
with Deep Convolutional Nets and Fully Connected
CRFs. In International Conference on Learning Rep-
resentations, San Diego, United States.
Luo, Z., Pan, J., Hu, Y., Deng, L., Li, Y., Qi, C., and Wang,
X. (2024). Rs-dseg: semantic segmentation of high-
resolution remote sensing images based on a diffusion
model component with unsupervised pretraining. Sci-
entific Reports, 14(1):18609.
Mascaro, R., Teixeira, L., and Chli, M. (2021). Dif-
fuser: Multi-view 2d-to-3d label diffusion for se-
mantic scene segmentation. In 2021 IEEE Inter-
national Conference on Robotics and Automation
(ICRA), pages 13589–13595.
Neuhold, G., Ollmann, T., Rota Bul
`
o, S., and Kontschieder,
P. (2017). The mapillary vistas dataset for semantic
understanding of street scenes. In International Con-
ference on Computer Vision (ICCV).
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-
net: Convolutional networks for biomedical im-
age segmentation. In Medical Image Computing
and Computer-Assisted Intervention – MICCAI 2015,
pages 234–241. Springer International Publishing.
Rottensteiner, F., Sohn, G., Jung, J., Gerke, M., Baillard,
C., B
´
enitez, S., and Breitkopf, U. (2012). The is-
prs benchmark on urban object classification and 3d
building reconstruction. ISPRS Annals of Photogram-
metry, Remote Sensing and Spatial Information Sci-
ences, I-3.
Schmitt, M., Ahmadi, S., and H
¨
ansch, R. (2021). There
is no data like more data – current status of machine
learning datasets in remote sensing. In International
Geoscience and Remote Sensing Symposium.
Simonyan, K. and Zisserman, A. (2015). Very deep con-
volutional networks for large-scale image recognition.
In International Conference on Learning Representa-
tions.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-
novich, A. (2015). Going deeper with convolutions. In
2015 IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 1–9.
Yuan, Y., Chen, X., and Wang, J. (2020). Object-contextual
representations for semantic segmentation.
Combining Supervised Ground Level Learning and Aerial Unsupervised Learning for Efficient Urban Semantic Segmentation
441