Chen, K., Zou, Z., and Shi, Z. (2021b). Building extraction
from remote sensing images with sparse token trans-
formers. Remote Sensing, 13(21):4441.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., et al. (2020). An image is
worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv preprint arXiv:2010.11929.
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z.,
Tang, Y., Xiao, A., Xu, C., Xu, Y., Yang, Z., Zhang, Y.,
and Tao, D. (2022). A survey on vision transformer.
IEEE Transactions on Pattern Analysis and Machine
Intelligence.
Hatamizadeh, A., Xu, Z., Yang, D., Li, W., Roth, H.,
and Xu, D. (2022). Unetformer: A unified vi-
sion transformer model and pre-training framework
for 3d medical image segmentation. arXiv preprint
arXiv:2204.00631.
Hou, Q., Zhou, D., and Feng, J. (2021). Coordinate atten-
tion for efficient mobile network design. In Proceed-
ings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 13713–13722.
Jadon, S. (2020). A survey of loss functions for semantic
segmentation. In 2020 IEEE Conference on Compu-
tational Intelligence in Bioinformatics and Computa-
tional Biology (CIBCB), pages 1–7. IEEE.
Ji, S., Wei, S., and Lu, M. (2018). Fully convolu-
tional networks for multisource building extraction
from an open aerial and satellite imagery data set.
IEEE Transactions on Geoscience and Remote Sens-
ing, 57(1):574–586.
Li, C., Yang, J., Zhang, P., Gao, M., Xiao, B., Dai, X., Yuan,
L., and Gao, J. (2021). Efficient self-supervised vi-
sion transformers for representation learning. arXiv
preprint arXiv:2106.09785.
Li, Y., Yuan, G., Wen, Y., Hu, J., Evangelidis, G., Tulyakov,
S., Wang, Y., and Ren, J. (2022). Efficientformer: Vi-
sion transformers at mobilenet speed.
Lin, T.-Y., Doll
´
ar, P., Girshick, R., He, K., Hariharan, B.,
and Belongie, S. (2017). Feature pyramid networks
for object detection. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 2117–2125.
Liu, X., Duh, K., Liu, L., and Gao, J. (2020). Very deep
transformers for neural machine translation. arXiv
preprint arXiv:2008.07772.
Liu, Y., Zhang, Y., Wang, Y., Hou, F., Yuan, J., Tian,
J., Zhang, Y., Shi, Z., Fan, J., and He, Z. (2021).
A survey of visual transformers. arXiv preprint
arXiv:2111.06091.
Maggiori, E., Tarabalka, Y., Charpiat, G., and Alliez, P.
(2017). Can semantic labeling methods generalize
to any city? the inria aerial image labeling bench-
mark. In 2017 IEEE International Geoscience and
Remote Sensing Symposium (IGARSS), pages 3226–
3229. IEEE.
Sariturk, B., Seker, D. Z., Ozturk, O., and Bayram, B.
(2022). Performance evaluation of shallow and deep
cnn architectures on building segmentation from high-
resolution images. Earth Science Informatics, pages
1–23.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L. u., and Polosukhin,
I. (2017). Attention is all you need. In Guyon,
I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-
gus, R., Vishwanathan, S., and Garnett, R., editors,
Advances in Neural Information Processing Systems,
volume 30. Curran Associates, Inc.
Wang, L., Li, R., Duan, C., Zhang, C., Meng, X., and Fang,
S. (2022a). A novel transformer based semantic seg-
mentation scheme for fine-resolution remote sensing
images. IEEE Geoscience and Remote Sensing Let-
ters, 19:1–5.
Wang, L., Li, R., Zhang, C., Fang, S., Duan, C., Meng, X.,
and Atkinson, P. M. (2022b). Unetformer: A unet-like
transformer for efficient semantic segmentation of re-
mote sensing urban scene imagery. ISPRS Journal of
Photogrammetry and Remote Sensing, 190:196–214.
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D.,
Lu, T., Luo, P., and Shao, L. (2021). Pyramid vi-
sion transformer: A versatile backbone for dense pre-
diction without convolutions. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 568–578.
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M.,
and Luo, P. (2021). Segformer: Simple and efficient
design for semantic segmentation with transformers.
Advances in Neural Information Processing Systems,
34:12077–12090.
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng,
J., and Yan, S. (2021). Metaformer is actually what
you need for vision.
Zhang, J., Chang, W.-C., Yu, H.-F., and Dhillon, I. (2021).
Fast multi-resolution transformer fine-tuning for ex-
treme multi-label text classification. Advances in Neu-
ral Information Processing Systems, 34:7267–7280.
Zhang, J., Yang, K., Ma, C., Reiß, S., Peng, K., and Stiefel-
hagen, R. (2022). Bending reality: Distortion-aware
transformers for adapting to panoramic semantic seg-
mentation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition,
pages 16917–16927.
Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Jiang,
Z., Hou, Q., and Feng, J. (2021). Deepvit: To-
wards deeper vision transformer. arXiv preprint
arXiv:2103.11886.
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N., and
Liang, J. (2018). Unet++: A nested u-net architecture
for medical image segmentation. In Deep learning in
medical image analysis and multimodal learning for
clinical decision support, pages 3–11. Springer.
A Comparative Study on Vision Transformers in Remote Sensing Building Extraction
229