Towards Robust Multimodal Land Use Classification: A Convolutional Embedded Transformer
Muhammad Zia Ur Rehman, Syed Mohammed Shamsul Islam, Anwaar UlHaq, David Blake, Naeem Janjua
2025
Abstract
Multisource remote sensing data has gained significant attention in land use classification. However, effectively extracting both local and global features from various modalities and fusing them to leverage their complementary information remains a substantial challenge. In this paper, we address this by exploring the use of transformers for simultaneous local and global feature extraction while enabling cross-modality learning to improve the integration of complementary information from HSI and LiDAR data modalities. We propose a spatial feature enhancer module (SFEM) that efficiently captures features across spectral bands while preserving spatial integrity for downstream learning tasks. Building on this, we introduce a cross-modal convolutional transformer, which extracts both local and global features using a multi-scale convolutional embedded encoder (MSCE). The convolutional layers embedded in the encoder facilitate the blending of local and global features. Additionally, cross-modal learning is incorporated to effectively capture complementary information from HSI and LiDAR modalities. Evaluation on the Trento dataset highlights the effectiveness of the proposed approach, achieving an average accuracy of 99.04% and surpassing comparable methods.
DownloadPaper Citation
in Harvard Style
Rehman M., Islam S., UlHaq A., Blake D. and Janjua N. (2025). Towards Robust Multimodal Land Use Classification: A Convolutional Embedded Transformer. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP; ISBN 978-989-758-728-3, SciTePress, pages 143-153. DOI: 10.5220/0013191300003912
in Bibtex Style
@conference{visapp25,
author={Muhammad Rehman and Syed Islam and Anwaar UlHaq and David Blake and Naeem Janjua},
title={Towards Robust Multimodal Land Use Classification: A Convolutional Embedded Transformer},
booktitle={Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP},
year={2025},
pages={143-153},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013191300003912},
isbn={978-989-758-728-3},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP
TI - Towards Robust Multimodal Land Use Classification: A Convolutional Embedded Transformer
SN - 978-989-758-728-3
AU - Rehman M.
AU - Islam S.
AU - UlHaq A.
AU - Blake D.
AU - Janjua N.
PY - 2025
SP - 143
EP - 153
DO - 10.5220/0013191300003912
PB - SciTePress