incorporate multi-scale structural information and
spatial features of small entities. To overcome this
problem, a novel deformable and structural
representative network (SRN) is proposed for remote
sensing image captioning. Particularly, the
semantically spatial features are obtained from the
backbone network. Subsequently, we have developed
a deformable network on the initial layers and SRN
on the last layers of the CNN to obtain spatially
transformed information and structural
representations of an RSI. Finally, a multi-stage
caption decoder is utilised to produce meaningful
captions. In our approach, a stack of LSTMs in the
decoder helps to deal with the vanishing gradient
problem and also includes mid-path monitoring. Our
approach performed better than the well-known RSIC
methods.
REFERENCES
X. Zhang, X. Li, J. An, L. Gao, B. Hou, and C. Li, “Natural
language description of remote sensing images based
on deep learning,” in 2017 IEEE International
Geoscience and Remote Sensing Symposium, IGARSS
2017, Fort Worth, TX, USA, July 23-28, 2017, 2017,
Conference Proceedings, pp. 4798–4801.
X. Lu, B. Wang, and X. Zheng, “Sound Active Attention
Framework for Remote Sensing Image Captioning,”
IEEE Trans. Geosci. Remote. Sens., vol. 58, no. 3, pp.
1985–2000, 2020. (Online). Available: https://doi.org/
10.1109/TGRS.2019.2951636.
R. Zhao, Z. Shi, and Z. Zou, “High-resolution remote
sensing image captioning based on structured
attention,” IEEE Transactions on Geo-science and
Remote Sensing, pp. 1–14, 2021.
Q. Wang, W. Huang, X. Zhang, and X. Li, “Word-sentence
framework for remote sensing image captioning,” IEEE
Transactions on Geoscience and Remote Sensing, pp.
1–12, 2020.
X. Ma, R. Zhao, and Z. Shi, “Multiscale methods for optical
remote-sensing image captioning,” IEEE Geoscience
and Remote Sensing Letters, pp. 1–5, 2020.
Z. Shi and Z. Zou, “Can a machine generate human-like
language descriptions for a remote sensing image?”
IEEE Transactions on Geoscience and Remote Sensing,
vol. 55, no. 6, pp. 3623–3634, 2017.
X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models
and data for remote sensing image caption generation,”
IEEE Transactions on Geoscience and Remote Sensing,
vol. 56, no. 4, pp. 2183–2195, 2018.
X. Zhang, X. Wang, X. Tang, H. Zhou, and C. Li,
“Description generation for remote sensing images
using attribute attention mechanism,” Remote Sensing,
vol. 11, no. 6, p. 612, 2019.
Z. Zhang, W. Diao, W. Zhang, M. Yan, X. Gao, and X. Sun,
“Lam:Remote sensing image captioning with label-
attention mechanism,” Remote Sensing, vol. 11, no. 20,
p. 2349, 2019.
G. Sumbul, S. Nayak, and B. Demir, “Sd-rsic:
Summarization-driven deep remote sensing image
captioning,” IEEE Transactions on Geo-science and
Remote Sensing, pp. 1–13, 2020.
Ma, Chao, Jia-Bin Huang, Xiaokang Yang, and Ming-
Hsuan Yang.,“Robust visual tracking via hierarchical
convolutional features,” arXiv preprint
arXiv:1707.03816, 2017.
Jaderberg, Max, Karen Simonyan, and Andrew Zisserman.,
“Spatial transformer networks,” In Advances in neural
details processing systems, pp. 2017-2025, 2015.
Si, Haiyang, Zhiqiang Zhang, Feifan Lv, Gang Yu, and
Feng Lu., “Real-Time Semantic Segmentation via
Multiply Spatial Fusion Network,” arXiv preprint
arXiv:1911.07217, 2019.
Banerjee, Satanjeev, and Alon Lavie. "METEOR: An
automatic metric for MT evaluation with improved
correlation with human judgments." Proceedings of the
acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or
summarization. 2005.
Chen, Liang-Chieh, Yukun Zhu, George Papandreou,
Florian Schroff, and Hartwig Adam., “Encoder-decoder
with atrous separable convolution for semantic image
segmentation,” in Proceedings of the European
conference on computer vision (ECCV), pp. 801-818,
2018.
Wang, B., Zheng, X., Qu, B. and Lu, X., 2020. Retrieval
topic recurrent memory network for remote sensing
image captioning. IEEE Journal of Selected Topics in
App. Earth Obs. and Remote Sensing, 13, pp.256-270.
Gu, Jiuxiang, Jianfei Cai, Gang Wang, and Tsuhan Chen.,
“Stack-captioning: Coarse-to-fine learning for image
captioning,” in Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 32, no. 1. 2018.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R.
Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend
and tell: Neural image caption generation with visual
attention,” in Int. Conf. on machine learning, 2015, pp.
2048–2057.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and ZbigniewWojna., “Rethinking the
inception architecture for computer vision,” in
Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 2818–2826, 2016.
Wei, Haiyang, Zhixin Li, Canlong Zhang, Tao Zhou, and
Yu Quan., “Image captioning based on sentence-level
and word-level attention,” in 2019 Int. Joint Conference
on Neural Networks (IJCNN), pp.1-8. IEEE, 2019.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun., “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 770-778.2016.
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and
Dumitru Erhan., “Show and tell: A neural image
caption generator,” in Proceedings of the IEEE
conference on computer vision and pattern recognition,
pp. 3156-3164. 2015.