
Figure 7: (a) shows the pseudo-color map of the HSI data
using selected bands (20, 15, 5), (b) shows the grayscale vi-
sualization of the LiDAR data, (c) shows the ground truth
available for six unique classes, (d) represents the classifi-
cation map generated using the proposed technique.
the outermost edge (indicating higher values) on the
radar chart for overall accuracy, average accuracy, and
kappa. Similarly, optimal performance is achieved
when the number of attention heads is 8, with each
axis representing different attention heads, i.e., 2, 4,
6, and 8. Figure 8(b) shows that optimal performance
is attained using 8 attention heads.
5 CONCLUSION
In this study, we presented a novel transformer-based
technique for multimodal land use classification, uti-
lizing HSI and LiDAR modalities. The proposed
dual-stream architecture effectively extracts features
from both modalities, with the cross-modal convolu-
tional transformer demonstrating its ability to learn
local and global features. The incorporation of a
cross-modal attention module enables joint learning
between modalities, utilizing complementary infor-
mation to extract more discriminative features. The
proposed method enhances classification accuracy for
impactful applications in environmental monitoring,
urban planning, and other areas where precise land
use classification is essential.
Our experimental results demonstrate the superior
performance of this technique over comparable exist-
ing methods, showcasing its potential to advance the
field of remote sensing. Future work will focus on in-
corporating additional modalities and further improv-
ing the model to handle complex datasets, ensuring
broader applicability and continued advancements in
performance.
REFERENCES
Chen, C.-F. R., Fan, Q., and Panda, R. (2021). Crossvit:
Cross-attention multi-scale vision transformer for im-
age classification. In Proceedings of the IEEE/CVF
international conference on computer vision, pages
357–366.
Ding, K., Lu, T., Fu, W., Li, S., and Ma, F. (2022). Global–
local transformer network for hsi and lidar data joint
classification. IEEE Transactions on Geoscience and
Remote Sensing, 60:1–13.
Ding, Z., Zhou, D., Li, H., Hou, R., and Liu, Y.
(2021). Siamese networks and multi-scale local ex-
trema scheme for multimodal brain medical image
fusion. Biomedical Signal Processing and Control,
68:102697.
Du, X., Zheng, X., Lu, X., and Doudkin, A. A. (2021). Mul-
tisource remote sensing data classification with graph
fusion network. IEEE Transactions on Geoscience
and Remote Sensing, 59(12):10062–10072.
Fan, Y., Qian, Y., Qin, Y., Wan, Y., Gong, W., Chu, Z., and
Liu, H. (2022). Mslaenet: Multiscale learning and at-
tention enhancement network for fusion classification
of hyperspectral and lidar data. IEEE Journal of Se-
lected Topics in Applied Earth Observations and Re-
mote Sensing, 15:10041–10054.
Feng, M., Gao, F., Fang, J., and Dong, J. (2021). Hy-
perspectral and lidar data classification based on lin-
ear self-attention. In 2021 IEEE International Geo-
science and Remote Sensing Symposium IGARSS,
pages 2401–2404.
Feng, Q., Zhu, D., Yang, J., and Li, B. (2019). Multisource
hyperspectral and lidar data fusion for urban land-use
mapping based on a modified two-branch convolu-
tional neural network. ISPRS International Journal
of Geo-Information, 8(1):28.
Feng, Y., Zhu, J., Song, R., and Wang, X. (2024). S2eft:
Spectral-spatial-elevation fusion transformer for hy-
perspectral image and lidar classification. Knowledge-
Based Systems, 283:111190.
Ghamisi, P., Yokoya, N., Li, J., Liao, W., Liu, S., Plaza, J.,
Rasti, B., and Plaza, A. (2017). Advances in hyper-
spectral image and signal processing: A comprehen-
sive overview of the state of the art. IEEE Geoscience
and Remote Sensing Magazine, 5(4):37–78.
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu,
J., Han, W., Wang, S., Zhang, Z., Wu, Y., et al. (2020).
Conformer: Convolution-augmented transformer for
speech recognition. arXiv preprint arXiv:2005.08100.
G
´
omez-Chova, L., Tuia, D., Moser, G., and Camps-Valls,
G. (2015). Multimodal classification of remote sens-
ing images: A review and future directions. Proceed-
ings of the IEEE, 103(9):1560–1584.
Towards Robust Multimodal Land Use Classification: A Convolutional Embedded Transformer
151