Towards Robust Multimodal Land Use Classiﬁcation: A Convolutional

Embedded Transformer

Muhammad Zia Ur Rehman

1 a

, Syed Mohammed Shamsul Islam

1 b

, Anwaar UlHaq

2 c

David Blake

1 d

and Naeem Janjua

3 e

School of Science, Edith Cowan University, Australia

School of Engineering and Technology Centre for Intelligent Systems, Central Queensland University, Australia

College of Science and Engineering, Flinders University, Australia

{m.ziaurrehman, syed.islam, d.blake}@ecu.edu.au, a.anwaarulhaq@cqu.edu.au, naeem.janjua@ﬂinders.edu.au

Keywords:

Land Use Classiﬁcation, Multimodal, Transformer, Cross-Modal Learning.

Abstract:

Multisource remote sensing data has gained signiﬁcant attention in land use classiﬁcation. However, effec-

tively extracting both local and global features from various modalities and fusing them to leverage their

complementary information remains a substantial challenge. In this paper, we address this by exploring the

use of transformers for simultaneous local and global feature extraction while enabling cross-modality learn-

ing to improve the integration of complementary information from HSI and LiDAR data modalities. We

propose a spatial feature enhancer module (SFEM) that efﬁciently captures features across spectral bands

while preserving spatial integrity for downstream learning tasks. Building on this, we introduce a cross-modal

convolutional transformer, which extracts both local and global features using a multi-scale convolutional em-

bedded encoder (MSCE). The convolutional layers embedded in the encoder facilitate the blending of local

and global features. Additionally, cross-modal learning is incorporated to effectively capture complementary

information from HSI and LiDAR modalities. Evaluation on the Trento dataset highlights the effectiveness of

the proposed approach, achieving an average accuracy of 99.04% and surpassing comparable methods.

1 INTRODUCTION

The use of remote sensing in land use classiﬁcation

is necessary for effective environmental management,

urban planning, or precision agriculture. It is one of

the critical tasks that leads to the execution of man-

agement plans through remote sensing technology.

Over the last decade, hyperspectral imaging (HSI)

has been widely used for land use classiﬁcation that

has the ability to capture spectral information (Hong

et al., 2021) (Ghamisi et al., 2017). This allows HSI

to distinguish between different objects on the ground

using their unique spectral signatures. However, HSI

faces challenges in distinguishing objects that share

similar spectral signatures but different spatial struc-

tures, reducing its effectiveness in diverse landscapes.

The recent technological advancements in remote

https://orcid.org/0000-0001-9531-1941

https://orcid.org/0000-0002-3200-2903

https://orcid.org/0000-0002-5145-7276

https://orcid.org/0000-0003-3747-2960

https://orcid.org/0000-0003-0483-8196

sensing systems have signiﬁcantly enhanced the pos-

sibilities of gathering multimodal remote sensing data

omez-Chova et al., 2015). The utilization of mul-

timodal data has already proved its signiﬁcance for

classiﬁcation tasks in different ﬁelds (Tang et al.,

2022) (Ding et al., 2021) (Hermessi et al., 2021).

These advancements have shifted research towards

the multimodal domain in this ﬁeld as well, where in-

tegration of HSI with other imaging modalities such

as LiDAR provides complementary information in

terms of height (Ding et al., 2022), thereby improv-

ing classiﬁcation abilities of objects on the ground fa-

cilitating precise and effective management and plan-

ning. However, effective fusion of modalities is a cur-

rent challenge along with feature extraction.

Traditional classiﬁcation techniques, employed in

the early stages of research within this domain (Liao

et al., 2014), provide a baseline for further research

advancements. However, the complexity of HSI data

and the limitations associated with shallow feature ex-

traction reduces their effectiveness. Recently, con-

volutional neural networks (CNNs) have been widely

recognized in land use classiﬁcation due to their dis-

Rehman, M. Z. U., Islam, S. M. S., UlHaq, A., Blake, D. and Janjua, N.

Towards Robust Multimodal Land Use Classiﬁcation: A Convolutional Embedded Transformer.

DOI: 10.5220/0013191300003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

143-153

ISBN: 978-989-758-728-3; ISSN: 2184-4321

143

criminative feature extraction capabilities (Xue et al.,

2021) (Du et al., 2021) (Zhang et al., 2021). Further-

more, encoder-decoder based architectures such as

EndNet (Hong et al., 2020) and CCR-Net (Wu et al.,

2021) have been employed to enhance feature extrac-

tion and the fusion of modalities, with their strength

lying in their compact architecture that facilitates ef-

ﬁcient data processing and integration. Although

these techniques are good at extracting local features,

they lack global context and semantics among pixels

(Hong et al., 2021). To ﬁll this gap, various studies

have integrated additional attention-based modules

with CNNs, which have demonstrated better perfor-

mance over conventional CNN-based models (Feng

et al., 2021) (Mohla et al., 2020). Despite these ad-

vancements in CNN-based framework challenges per-

sists in ﬁnding global and long-range feature depen-

dencies (Hong et al., 2022).

Vision Transformer (ViT), a variant of the trans-

former developed for natural language processing

(NLP) tasks, has emerged as a potential solution

to address limitations in extracting global infor-

mation through its unique self-attention mechanism

(Vaswani, 2017). Similar to conventional trans-

formers, ViT generates patches and through its self-

attention mechanism, it attends each patch to facil-

itate ﬁnding global correlations among them. Sev-

eral transformer-based techniques are proposed to im-

prove its feature extracting capabilities. However,

transformers struggle to capture local information and

features. To address this limitation, CNNs are being

integrated with transformers to complement and en-

hance their ability to process local details (Yu et al.,

2022) (Zhang et al., 2022), which shows improve-

ments. However, current studies lack interaction be-

tween local and global features that could provide bet-

ter semantic information, thus potentially improving

the performance.

To address this shortcoming, a transformer model

is proposed with a convolutional block embedded in

the encoder of the transformer. The embedded convo-

lutional block contributes towards capturing local fea-

tures, whereas the conventional transformer’s multi-

head self-attention (MHSA) captures global features.

Therefore, the proposed model has the capability to

capture both local and global features. Moreover, it

facilitates the interaction of local and global features

within the transformer, thus providing enriched se-

mantic information, which as a result improves clas-

siﬁcation accuracy. The main contributions of this

study are as follows:

1. A spatial feature enhancer module (SFEM) is pro-

posed to preserve spatial information for subse-

quent processing while enriching the spectral fea-

ture set.

2. A transformer model is proposed for HSI and Li-

DAR land use classiﬁcation with a transformer

encoder that consists of a multiscale convolution

block, enabling the simultaneous extraction of lo-

cal and global features.

3. A cross-modal attention module is employed for

the feature interaction between HSI and LiDAR

modalities that help learn complementary infor-

mation.

Several land use classiﬁcation techniques are dis-

cussed in Section 2. Section 3 discusses the detailed

methodology of the proposed technique. Experimen-

tal results are presented in Section 4 while Section 5

concludes the study.

2 RELATED WORK

Effective land use classiﬁcation remains a challenge

due to the heterogeneous nature of land. Researchers

have proposed several techniques to address the chal-

lenges associated with land use classiﬁcation. Over

time, many traditional and deep learning-based tech-

niques for land use classiﬁcation have been proposed.

Traditional techniques utilized ﬁlter and proﬁle-based

approaches. In ﬁlter-based techniques, morphological

and Gabor ﬁlters are commonly used for feature ex-

traction (Jia et al., 2021) (Rasti and Ghamisi, 2020).

On the other hand, proﬁle-based techniques are com-

binations of various ﬁlters that are used together to ex-

tract features. Morphological attribute proﬁles (Liao

et al., 2014) and extinction attribute proﬁles (Rasti

et al., 2017b) were commonly used. After extract-

ing features using the aforementioned techniques, fea-

tures are either stacked together and classiﬁed us-

ing different classiﬁers such as support vector ma-

chines (SVM) and random forest (RF) or classiﬁers

are applied independently, with the ﬁnal decision be-

ing made through a voting-based decision strategy.

Recently, deep learning-based techniques have

overtaken traditional methods due to their superior

feature extraction capabilities and are extensively

used for land use classiﬁcation. Generally, dual-

stream frameworks are proposed (Feng et al., 2019),

which include encoder-decoder-based architectures as

well. EndNet (Hong et al., 2020) and CCR-Net (Wu

et al., 2021) both follow the encoder-decoder archi-

tecture; for the mutual exchange of information be-

tween modalities, reconstruction-based fusion strate-

gies are adopted. Some proposed techniques are

composed of triple streams (Feng et al., 2019) (Li

et al., 2022), primarily to process modalities differ-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

144

ently to extract spatial-spectral features from them.

Some advanced techniques utilize attention mecha-

nisms along with CNNs to extract robust features.

(Mohla et al., 2020) proposed FusAtNeT, one of the

initial attention-based techniques employed for land

use classiﬁcation. Specialized spatial and spectral

attention modules have been used along with CNN-

based feature extractors; the incorporation of atten-

tion modules showed improved classiﬁcation perfor-

mance. Similarly, (Wang et al., 2022) proposed

modality-speciﬁc attention modules to highlight im-

portant features. To integrate features of different

modalities, a modality attention (MA) module is pro-

posed that provides better feature interaction between

modalities compared to traditional early and late fu-

sion strategies. (Fan et al., 2022) proposed composite

attention followed by self-calibrated convolution that

can extract multiscale features. To preserve the com-

plementary information of modalities, an attention-

based feature fusion (FF) module is employed that

adaptively fuses heterogeneous features. These tech-

niques have signiﬁcantly improved classiﬁcation per-

formance; yet, they do not adequately handle the vo-

luminous spectral information in HSI data due to the

intrinsic limitations of CNNs. This limitation results

in pixels with minor spectral differences not being ef-

fectively distinguishable. Transformers, with their ca-

pability to process sequential data, could effectively

distinguish between these pixels in spectral data.

(Ni et al., 2024) proposed a multiscale head selec-

tion transformer network with a hierarchical convo-

lutional feature extraction module. Features are ex-

tracted independently from both modalities and com-

bined together before further processing through mul-

tiscale 2D convolutional layers and a conventional

transformer. Finally, a classiﬁcation decision is per-

formed using a CNN and transformer-based class to-

ken. (Zhang et al., 2023) proposed a transformer and

multiscale fusion network for joint HSI and LiDAR

classiﬁcation. Initially, it extracts multiscale features

from both modalities using multiscale attention. Sub-

sequently, the extracted features are fused together us-

ing transformer-based attention calculation. Later on,

the multiscale and fused features are classiﬁed using

a specially designed multi-output module. (Roy et al.,

2023) introduced a multimodal fusion transformer for

classifying HSI and LiDAR data. In their approach,

LiDAR data is utilized as a learnable token, combined

with HSI tokens for feature learning. However, this

method does not completely integrate the valuable in-

formation from both modalities, which consequently

restricts the potential accuracy of the classiﬁcation.

In addition to these, several transformer-based tech-

niques have been proposed for land use classiﬁcation

(Roy et al., 2024) (Yang et al., 2024). Despite these

advancements, challenges still persist in the effective

interaction between local and global features during

the feature learning process and their fusion, which

is crucial for capturing the distinctive characteristics

arising from subtle differences in land cover types.

3 METHODOLOGY

In this study, a new transformer-based model is pro-

posed for multimodal land use classiﬁcation that is

capable of extracting local and global features simul-

taneously. The framework of the proposed technique

is presented in Figure 1 and described in the following

subsections. Subsection 3.1 explains the preprocess-

ing steps, while Subsection 3.2 describes shallow fea-

ture extraction and tokenization required for the trans-

former model to further process the features. Subsec-

tion 3.3 discusses the proposed cross-modal convolu-

tional transformer.

3.1 Pre-Prossessing

The original HSI data can be denoted as X

∈

H×W ×D

, and LiDAR data can be represented as X

∈

H×W

. The spatial dimensions—height and width—

are represented by H and W , respectively, while D

represents the spectral dimensions of the HSI data.

HSI data has redundant spectral information within

hundreds of spectral dimensions. Therefore, principle

component analysis (PCA) is used to select informa-

tive spectral bands of the original HSI data. The HSI

data with reduced spectral dimensions is then repre-

sented as X

∈R

H×W ×D

red

, where D

red

is 30. Both the

HSI and LiDAR data are further processed; padding

is applied to preserve edge information, and patches

of dimension S ×S, where S is 11, are generated. The

center pixel of each patch represents the class label.

The resulting HSI and LiDAR data is represented as

∈ R

S×S×D

red

and X

∈ R

S×S×1

, respectively.

3.2 Spatial Feature Enhancer Module

and Tokenization

In this study, we propose a convolution-based spa-

tial feature enhancer module (SFEM) for initial fea-

ture extraction. This module utilizes pointwise con-

volutions, which are particularly important for HSI

data characterized by rich spectral information at each

spatial location. Pointwise convolutions facilitate the

learning of inter-band dependencies among different

spectral bands, a critical aspect in HSI where corre-

Towards Robust Multimodal Land Use Classiﬁcation: A Convolutional Embedded Transformer

145

HSI

LiDAR

SFEM

Tokenizer

Cross Modal Convolutional

Transformer

Cross Modal Attention ModuleCross Modal Attention Module

SFEM

Tokenizer

Softmax

ReLU

Batch

Normalization

2D Point-wise

Convolution

ReLU

Batch

Normalization

2D Point-wise

Convolution

2D Flatten feature

Linear

Layer

GeLU

Dropout

Linear

Layer

GeLU

Dropout

Linear

Layer

GeLU

Dropout

Multi-scale Convolutional Encoder

Classification Tokens

Spectral-Spatial Tokens

Learnable CLS token

Learnable CLS tokenLearnable CLS token

MSCE

(a) (b) (c)

Softmax

MLP

Head

MLP

Head

MLP

Head

MLP

Head

Figure 1: The proposed framework for land use classiﬁcation using HSI and LiDAR modalities. The internal architecture of

the spatial feature enhancer module (SFEM) is shown in (a), the tokenizer is presented in (b), and the internal architecture of

the multi-layer perceptron (MLP) head is shown in (c).

lations between different spectral bands aid in under-

standing complex patterns within the data. Though

pointwise convolutions do not extract spatial features

directly, the enriched spectral features they produce

can indirectly enhance spatial features for subsequent

processing. Our motivation is to preserve spatial de-

tails while providing a rich feature set to downstream

modules, enabling more effective and sophisticated

feature learning. The pointwise convolutions apply

a 1×1 kernel to each spatial location, preserving the

original spatial dimensions of the data. This approach

enhances the overall framework’s capability to pro-

cess and synthesize the information across channels

without distorting the spatial information, unlike tech-

niques (Roy et al., 2023) (Roy et al., 2024) that do not

preserve the spatial information due to the use of large

kernel sizes in convolution layers. The pointwise con-

volution layers in the SFEM are followed by batch

normalization and a rectiﬁed linear unit (ReLU) layer

to stabilize the learning process and introduce neces-

sary non-linearity, respectively. The architecture of

SFEM is shown in Figure 1(a).

Initial feature extraction is performed through the

SFEM for both HSI and LiDAR modalities. For sub-

sequent processing of these extracted features in the

classiﬁcation task through the transformer module,

tokenization is necessary. This tokenization is per-

formed by a tokenizer shown in Figure 1(b), which

converts feature maps into vectors by ﬂattening them.

The 2D ﬂattened feature map for HSI is represented

as X

f lat

∈R

xy×z

, and similarly, ﬂattened LiDAR fea-

tures are represented as X

f lat

∈ R

xy×z

, where x de-

notes the height, y the width, and z

, z

the num-

ber of spectral bands for HSI and LiDAR, respec-

tively. The tokenization process employs two separate

learnable weights W

and W

, one for each modal-

ity, with weights initialized using a normal distribu-

tion through Xavier initialization (Zhao et al., 2022).

These weights are pointwise multiplied with feature

maps and softmax is applied to convert its scores.

The output is then transposed and multiplied with a

2D ﬂattened feature map. The tokenization process

results in high-level semantic tokens of HSI and Li-

DAR. The process of tokenization HSI data is mathe-

matically expressed as follows:

hsi

= softmax(X

f lat

)

f lat

(1)

3.3 Cross-Modal Convolutional

Transformer

The proposed cross-modal convolutional transformer

consists of two parts: a multiscale convolutional em-

bedded encoder and a cross-modal attention module.

Subsubsection 3.3.1 provides a detailed discussion

of the encoder, while Subsubsection 3.3.2 discusses

cross-modal attention module for cross-modal feature

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

146

learning and the ﬁnal classiﬁcation.

3.3.1 Multi-Scale Convolution Embedded

Encoder

The semantic tokens generated by the tokenizer are

utilized by the transformer encoder to identify cor-

relations between tokens through a multi-head self-

attention mechanism, which comprises multiple self-

attention mechanisms. Each self-attention mech-

anism computes correlations between tokens using

three learnable weights: W

for queries, W

for keys,

and W

for values. These weights are multiplied with

the feature tokens and projected onto three linearly

mapped matrices: Q for queries, K for keys, and V for

values. Using Q and K, attention scores are then com-

puted with the softmax function. The self-attention is

mathematically represented as follows:

SA = Attention(Q,K,V ) = softmax



√



V (2)

In the case of multi-head self-attention, the pro-

jection of tokens is divided by the number of heads,

i.e., D =

, where h is the number of heads. How-

ever, after concatenation, the dimensions are restored

by applying a projection matrix W

to combine the

outputs from all heads back to the original dimen-

sionality, ensuring that the multi-head self-attention

effectively integrates diverse perspectives from each

head. The multi-head self-attention is represented in

Equation 3.

MHSA(Q,K,V ) = Concat(SA

,SA

,..., SA

(3)

where SA

= softmax



√



represents the output

from the i-th self-attention head, Q

, K

, and V

are

parts of Q, K, and V that correspond to the i-th head,

and W

is the output projection matrix that recom-

bines the attention outputs into a single tensor. The

MHSA computes correlations in a global manner but

lacks correlation between local semantic tokens.

To address this issue, inspired by (Gulati et al.,

2020), a convolutional block is embedded in the trans-

former encoder after the multi-head self-attention to

ﬁnd correlations between local tokens from those al-

ready processed through MHSA. The architecture of

the MSCE is shown in Figure 2. The adapted convo-

lution block (Khotimah et al., 2023) starts with the

pointwise convolution followed by dual stream 1D

convolutions that ﬁnd correlations between tokens at

different scales for local semantics. These streams

utilize 1D dilated convolutions with kernel sizes of 3

and 5. Subsequently, batch normalization and a swish

activation function are applied. To enrich local se-

mantics further, the outputs of both convolutions are

merged using element-wise summation. Finally, the

merged output is passed through a pointwise convo-

lution again to transform the features back to their

original size. The convolutional block is sandwiched

between two feed-forward networks. Residual con-

nection has also been used to avoid the vanishing gra-

dient problem.

Figure 2: The architecture of the multi-scale convolutional

embedded encoder (MSCE). The left shows the transformer

encoder embedded with a convolutional block, while the

right shows the internal layer structure of the convolutional

block.

3.3.2 Cross-Modal Feature Learning and

Classiﬁcation

As both modalities have been treated separately thus

far, capturing information speciﬁc to their corre-

sponding modality, it is important to learn comple-

mentary details from each other. Cross learning can

improve the learning capability of the model to ef-

fectively utilize complementary information for ﬁnal

classiﬁcation (Chen et al., 2021), (Xue et al., 2022).

In the proposed cross-modal feature learning strat-

egy, classiﬁcation tokens containing rich informa-

tion speciﬁc to each modality are exchanged between

modalities to learn the complementary information

present in each. In this process, the HSI classiﬁca-

tion token attends to all LiDAR tokens, and the Li-

DAR classiﬁcation token attends to all HSI tokens.

After this exchange of information, classiﬁcation to-

kens are projected back to their respective modalities,

but now these classiﬁcation tokens contain comple-

mentary information from the other modality. These

updated classiﬁcation tokens interact with their own

tokens in the next encoder, thus passing learned in-

formation from the other modality and this interac-

Towards Robust Multimodal Land Use Classiﬁcation: A Convolutional Embedded Transformer

147

tion further enriches the exchange of information. For

simplicity and to avoid confusion, we have explained

the cross-modal feature learning for HSI. This process

is also illustrated in Figure 3. As we know, LiDAR

tokens are of two types: feature tokens and classiﬁca-

tion (cls) tokens. The cls token of LiDAR is replaced

with the HSI cls token. This gives a new set of tokens

that we call cross-modal tokens. Mathematically, the

new set of feature tokens is shown in Equation 4.

hsi

= [ f

hsi

′

cls

) ∥ x

lidar

], (4)

where f

(·) is the linear transformation on the

HSI cls token for dimensionality alignment. Sub-

sequently, cross-attention is performed between x

hsi

′

cls

and X

hsi

. The cls token (x

hsi

′

cls

) is the query used for

cross-attention that contains information of the HSI

feature tokens. The cross-attention is mathematically

represented as follows:

q = x

hsi’

cls

, k = X

hsi

, v = X

hsi

A = softmax

C/h

, CA(X

hsi

) = Av

(5)

Where W

, W

, and W

are learnable weights. Sim-

ilar to self-attention, multiple heads have been used

for cross-attention. The multi-head cross-attention is

mathematically shown in Equation 6. The output of

cross-attention is shown in Equation 7.

hsi

cls

= f

hsi

cls

) + MHCA(LN f

hsi

cls

) ∥ x

lidar

)) (6)

hsi

= [g

hsi

cls

) ∥ x

hsi

] (7)

The HSI classiﬁcation token (y

hsi

cls

) and LiDAR

classiﬁcation token (y

lidar

cls

), obtained by the process

discussed above, are passed through a linear transfor-

mation layer g

(·) as shown in Figure 3. Following

that, these tokens passed through two separate MLP

heads, as depicted in Figure 1. Each MLP head is

composed of two linear layers, a GeLU activation,

and a dropout layer. The last linear layer employs

softmax activation. The arrangement of MLP layers

is shown in Figure 1 (c). The outputs of both MLP

heads are added together for the ﬁnal classiﬁcation.

4 EXPERIMENTAL DETAILS

AND RESULTS

This section presents the results obtained using the

proposed technique. The dataset employed for evalu-

ation is discussed in Subsection 4.1, while the experi-

mental setup is deﬁned in Subsection 4.2. Subsection

Softmax

concat

HSI Tokens

LiDAR Tokens

HSI classification token

HSI feature token

LiDAR classification token

LiDAR feature token

Updated HSI classification token

Figure 3: Internal structure of the cross-modal attention

module, which jointly learns features from HSI and LiDAR

data to extract complementary information.

4.3 presents the classiﬁcation results, including com-

parisons with other techniques, to demonstrate the ef-

fectiveness of the proposed approach. Subsection 4.4

discusses the ablation study to further evaluate the im-

pact of different parameters.

4.1 Dataset

The Trento dataset (Rasti et al., 2017a), utilized for

evaluating the proposed technique, consists of mul-

timodal data that have HSI and LiDAR modalities.

This dataset captures diverse land cover types across

southern Italy, collected using the ASIA sensor for

HSI and the Optech ALTM 3100EA sensor for Li-

DAR. The HSI data contains 63 spectral bands span-

ning the electromagnetic spectrum from 0.42 to 0.99

micrometers. The dataset is categorized into six dis-

tinct classes, with both HSI and LiDAR data featuring

a spatial resolution of 166 ×166. Detailed distribu-

tions of the training and testing samples used to assess

the proposed technique are presented in Table 1.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

148

Table 1: Description of used Trento dataset (Rasti et al.,

2017a).

No Classes Training Testing

1 Apple Trees 129 3905

2 Buildings 125 2778

3 Ground 105 374

4 Wood 154 8969

5 Vineyards 184 10317

6 Roads 122 3052

- - 819 29395

4.2 Experimental Setup

The training of the proposed model was executed on

a system equipped with an Intel Core i7 processor,

complemented by 32GB of RAM and an 8GB GPU.

Experiments were conducted using a learning rate of

0.001 and a batch size of 64, aligning with the param-

eters commonly adopted in similar research studies.

The optimization was facilitated by the Adam opti-

mizer, selected for its robust performance in compa-

rable scenarios. This setup maintained a consistent

learning rate and batch size throughout the experi-

ments. The model’s performance was evaluated using

the cross-entropy loss function, mathematically rep-

resented in Equation 8.

(y,p) = −

∑

i=1

log(p

) (8)

where y is the one-hot encoded true label vector,

p is the predicted probability distribution across C

classes, y

is the true label for class i , and p

is the

predicted probability for class i.

4.3 Classiﬁcation Results

In this work, three evaluation metrics have been used

to assess the performance of the proposed technique

on the Trento dataset. The evaluation metrics are

overall accuracy, average accuracy, and kappa. Over-

all accuracy evaluates the performance by considering

the total correct predictions, whereas average accu-

racy and kappa account for class-wise correct predic-

tions.

The results of the proposed technique are com-

pared against CNN-based (Hong et al., 2020) (Zhang

et al., 2018) and transformer-based approaches (Feng

et al., 2024) (Roy et al., 2023) (Roy et al., 2024). All

these techniques used data samples for training and

testing, as in Table 1. The performance comparison

of our proposed technique with the mentioned tech-

niques is presented in Table 2. The table depicts the

superiority of our proposed technique across all three

metrics used (highlighted in the table). We repeated

the experiments ﬁve times, reporting the mean and

standard deviation. Figure 4 shows a boxplot that

illustrates the deviation from the mean value. The

boxplot indicates that overall accuracy and kappa are

more stable compared to average accuracy.

Figure 4: Boxplot showing the variability and median val-

ues of overall accuracy, kappa, and average accuracy across

ﬁve experimental iterations.

The Cross-HL technique shows the second-best

results in terms of overall accuracy and kappa, and

slightly less average accuracy compared to PToP

CNN. Furthermore, to conduct a thorough analysis,

class-wise accuracy is presented in Table 3, which

provides further insights into the results of the pro-

posed technique. The class-wise accuracies of our

proposed technique are consistent across all classes,

showcasing its ability to learn discriminative features.

The proposed technique attained the highest accu-

racy in two classes, namely Wood and Roads. It can

be observed from the table that all other techniques

show limitations in recognizing Buildings and Roads

classes.

Unlike other techniques, which perform well ei-

ther on Buildings or Roads class, both Cross-HL and

our proposed technique exhibit nearly equal accuracy

on these classes. However, our technique shows sig-

niﬁcant improvements of 3.29% for Buildings and

4.51% for Roads compared to Cross-HL. MFT, de-

spite attaining the highest accuracy for the Buildings

class, demonstrated decreased accuracy in the Roads

class, reporting only 88.72%, which was the lowest

among all evaluated techniques. It can be noticed that

some of the other compared techniques perform well

on certain classes; however, they are unable to rec-

ognize difﬁcult classes—such as Buildings, Ground,

and Roads—effectively, reducing their overall perfor-

mance. The results of our proposed method can also

be validated with a confusion matrix presented in Fig-

ure 5, which provides similar insights; Buildings and

Roads classes are the most misclassiﬁed, likely due

Towards Robust Multimodal Land Use Classiﬁcation: A Convolutional Embedded Transformer

149

Table 2: Overall performance comparison with other techniques.

Evaluation metrics

CNN-based techniques Transformer-based techniques

EndNet PToP CNN S2EFT MFT Cross-HL Ours

OA 94.17 98.34 98.45 97.76 ± 0.40 98.69 ± 0.31 99.40 ± 0.22

AA 93.88 97.53 97.62 95.91 ± 0.41 97.47 ± 0.30 99.04 ± 0.33

k 92.22 97.79 97.92 97.00 ± 0.53 98.25 ± 0.26 99.19 ± 0.30

Table 3: Class-wise accuracy comparison with other techniques.

Classes

CNN-based techniques Transformer-based techniques

EndNet PToP CNN S2EFT MFT Cross-HL Ours

Apple Tree 88.19 99.60 97.95 98.23 ± 0.38 99.32 ± 0.31 98.43 ± 0.57

Building 98.49 93.90 97.77 99.34 ± 0.02 95.32 ± 1.42 98.61 ± 1.03

Ground 95.19 100 98.40 89.84 ± 9.00 97.62 ± 0.17 99.36± 0.52

Wood 99.30 99.27 99.73 99.82 ± 0.26 99.89 ± 0.10 100 ± 0.00

Vineyard 91.96 100 99.50 99.93 ± 0.05 99.97 ± 0.04 99.74 ± 0.44

Road 90.14 97.28 92.37 88.72 ± 0.94 94.17 ± 1.36 98.68 ± 0.67

to similar spectral information from their same con-

struction materials. The Apple Trees class also shows

some instances of misclassiﬁcation with the Ground

and Vineyard classes. Woods and Vineyard classes

are accurately classiﬁed.

Figure 5: The confusion matrix of the proposed technique

illustrates the misclassiﬁcation between different classes.

This matrix represents the results achieved from one of the

ﬁve experimental runs, offering insight into the model’s

classiﬁcation performance.

Moreover, we have used a t-SNE plot to visual-

ize the predictions on the test set. It also displays

the number of test samples per class with the clus-

ter size; a larger cluster indicates a larger test sample

for that particular class. The trend of misclassiﬁca-

tion can also be seen through the t-SNE plot in Figure

6, where the Buildings class (purple) overlaps with

the Roads class (yellow), depicting misclassiﬁcation

between them. Both the confusion matrix and t-SNE

plots support our results presented in Tables 2 and 3.

Figure 7 visually presents the HSI and LiDAR modal-

ities along with the ground truth. It also shows the

classiﬁcation map generated using the proposed tech-

nique in Figure 7(d).

Figure 6: t-SNE visualization of predicted class distribu-

tions from the test set. Colors indicate predicted classes:

Apple tree (blue), Building (purple), Ground (pink), Wood

(red), Vineyard (orange), and Roads (yellow). (Best viewed

in color).

4.4 Analysis of the Impact of Patch Size

and Attention Heads

Several experiments were conducted to assess the im-

pact of varying patch sizes and the number of atten-

tion heads in the transformer encoder on model per-

formance. The patch sizes considered were 7, 9, 11,

and 13. The performance based on patch size can be

analyzed using a radar chart, where each axis rep-

resents a patch size, as shown in Figure 8(a). The

analysis revealed that the optimal performance was

achieved with a patch size of 11, as it is closest to

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

150

Figure 7: (a) shows the pseudo-color map of the HSI data

using selected bands (20, 15, 5), (b) shows the grayscale vi-

sualization of the LiDAR data, (c) shows the ground truth

available for six unique classes, (d) represents the classiﬁ-

cation map generated using the proposed technique.

the outermost edge (indicating higher values) on the

radar chart for overall accuracy, average accuracy, and

kappa. Similarly, optimal performance is achieved

when the number of attention heads is 8, with each

axis representing different attention heads, i.e., 2, 4,

6, and 8. Figure 8(b) shows that optimal performance

is attained using 8 attention heads.

5 CONCLUSION

In this study, we presented a novel transformer-based

technique for multimodal land use classiﬁcation, uti-

lizing HSI and LiDAR modalities. The proposed

dual-stream architecture effectively extracts features

from both modalities, with the cross-modal convolu-

tional transformer demonstrating its ability to learn

local and global features. The incorporation of a

cross-modal attention module enables joint learning

between modalities, utilizing complementary infor-

mation to extract more discriminative features. The

proposed method enhances classiﬁcation accuracy for

impactful applications in environmental monitoring,

urban planning, and other areas where precise land

use classiﬁcation is essential.

Our experimental results demonstrate the superior

performance of this technique over comparable exist-

ing methods, showcasing its potential to advance the

ﬁeld of remote sensing. Future work will focus on in-

corporating additional modalities and further improv-

ing the model to handle complex datasets, ensuring

broader applicability and continued advancements in

performance.

REFERENCES

Chen, C.-F. R., Fan, Q., and Panda, R. (2021). Crossvit:

Cross-attention multi-scale vision transformer for im-

age classiﬁcation. In Proceedings of the IEEE/CVF

international conference on computer vision, pages

357–366.

Ding, K., Lu, T., Fu, W., Li, S., and Ma, F. (2022). Global–

local transformer network for hsi and lidar data joint

classiﬁcation. IEEE Transactions on Geoscience and

Remote Sensing, 60:1–13.

Ding, Z., Zhou, D., Li, H., Hou, R., and Liu, Y.

(2021). Siamese networks and multi-scale local ex-

trema scheme for multimodal brain medical image

fusion. Biomedical Signal Processing and Control,

68:102697.

Du, X., Zheng, X., Lu, X., and Doudkin, A. A. (2021). Mul-

tisource remote sensing data classiﬁcation with graph

fusion network. IEEE Transactions on Geoscience

and Remote Sensing, 59(12):10062–10072.

Fan, Y., Qian, Y., Qin, Y., Wan, Y., Gong, W., Chu, Z., and

Liu, H. (2022). Mslaenet: Multiscale learning and at-

tention enhancement network for fusion classiﬁcation

of hyperspectral and lidar data. IEEE Journal of Se-

lected Topics in Applied Earth Observations and Re-

mote Sensing, 15:10041–10054.

Feng, M., Gao, F., Fang, J., and Dong, J. (2021). Hy-

perspectral and lidar data classiﬁcation based on lin-

ear self-attention. In 2021 IEEE International Geo-

science and Remote Sensing Symposium IGARSS,

pages 2401–2404.

Feng, Q., Zhu, D., Yang, J., and Li, B. (2019). Multisource

hyperspectral and lidar data fusion for urban land-use

mapping based on a modiﬁed two-branch convolu-

tional neural network. ISPRS International Journal

of Geo-Information, 8(1):28.

Feng, Y., Zhu, J., Song, R., and Wang, X. (2024). S2eft:

Spectral-spatial-elevation fusion transformer for hy-

perspectral image and lidar classiﬁcation. Knowledge-

Based Systems, 283:111190.

Ghamisi, P., Yokoya, N., Li, J., Liao, W., Liu, S., Plaza, J.,

Rasti, B., and Plaza, A. (2017). Advances in hyper-

spectral image and signal processing: A comprehen-

sive overview of the state of the art. IEEE Geoscience

and Remote Sensing Magazine, 5(4):37–78.

Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu,

J., Han, W., Wang, S., Zhang, Z., Wu, Y., et al. (2020).

Conformer: Convolution-augmented transformer for

speech recognition. arXiv preprint arXiv:2005.08100.

omez-Chova, L., Tuia, D., Moser, G., and Camps-Valls,

G. (2015). Multimodal classiﬁcation of remote sens-

ing images: A review and future directions. Proceed-

ings of the IEEE, 103(9):1560–1584.

Towards Robust Multimodal Land Use Classiﬁcation: A Convolutional Embedded Transformer

151

(a) (b)

Figure 8: Radar charts depicting the analysis of model performance: (a) shows the impact of patch size, while (b) illustrates

the effect of the number of attention heads. The chart highlights optimal performance at a patch size of 11 and 8 attention

heads. (Best viewed in color).

Hermessi, H., Mourali, O., and Zagrouba, E. (2021). Mul-

timodal medical image fusion review: Theoretical

background and recent advances. Signal Processing,

183:108036.

Hong, D., Gao, L., Hang, R., Zhang, B., and Chanussot,

J. (2020). Deep encoder–decoder networks for clas-

siﬁcation of hyperspectral and lidar data. IEEE Geo-

science and Remote Sensing Letters, 19:1–5.

Hong, D., Han, Z., Yao, J., Gao, L., Zhang, B., Plaza, A.,

and Chanussot, J. (2021). Spectralformer: Rethinking

hyperspectral image classiﬁcation with transformers.

IEEE Transactions on Geoscience and Remote Sens-

ing, 60:1–15.

Hong, D., Han, Z., Yao, J., Gao, L., Zhang, B., Plaza, A.,

and Chanussot, J. (2022). Spectralformer: Rethinking

hyperspectral image classiﬁcation with transformers.

IEEE Transactions on Geoscience and Remote Sens-

ing, 60:1–15.

Jia, S., Zhan, Z., Zhang, M., Xu, M., Huang, Q., Zhou, J.,

and Jia, X. (2021). Multiple feature-based superpixel-

level decision fusion for hyperspectral and lidar data

classiﬁcation. IEEE Transactions on Geoscience and

Remote Sensing, 59(2):1437–1452.

Khotimah, W. N., Bennamoun, M., Boussaid, F., Xu, L.,

Edwards, D., and Sohel, F. (2023). Mce-st: Classi-

fying crop stress using hyperspectral data with a mul-

tiscale conformer encoder and spectral-based tokens.

International Journal of Applied Earth Observation

and Geoinformation, 118:103286.

Li, J., Ma, Y., Song, R., Xi, B., Hong, D., and Du, Q. (2022).

A triplet semisupervised deep network for fusion clas-

siﬁcation of hyperspectral and lidar data. IEEE Trans-

actions on Geoscience and Remote Sensing, 60:1–13.

Liao, W., Pi

zurica, A., Bellens, R., Gautama, S., and

Philips, W. (2014). Generalized graph-based fusion of

hyperspectral and lidar data using morphological fea-

tures. IEEE Geoscience and Remote Sensing Letters,

12(3):552–556.

Mohla, S., Pande, S., Banerjee, B., and Chaudhuri, S.

(2020). Fusatnet: Dual attention based spectrospatial

multimodal fusion network for hyperspectral and lidar

classiﬁcation. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

Workshops, pages 92–93.

Ni, K., Wang, D., Zheng, Z., and Wang, P. (2024). Mhst:

Multiscale head selection transformer for hyperspec-

tral and lidar classiﬁcation. IEEE Journal of Selected

Topics in Applied Earth Observations and Remote

Sensing.

Rasti, B. and Ghamisi, P. (2020). Remote sensing image

classiﬁcation using subspace sensor fusion. Informa-

tion Fusion, 64:121–130.

Rasti, B., Ghamisi, P., and Gloaguen, R. (2017a). Hy-

perspectral and lidar fusion using extinction pro-

ﬁles and total variation component analysis. IEEE

Transactions on Geoscience and Remote Sensing,

55(7):3997–4007.

Rasti, B., Ghamisi, P., Plaza, J., and Plaza, A. (2017b). Fu-

sion of hyperspectral and lidar data using sparse and

low-rank component analysis. IEEE Transactions on

Geoscience and Remote Sensing, 55(11):6354–6365.

Roy, S. K., Deria, A., Hong, D., Rasti, B., Plaza, A., and

Chanussot, J. (2023). Multimodal fusion transformer

for remote sensing image classiﬁcation. IEEE Trans-

actions on Geoscience and Remote Sensing, 61:1–20.

Roy, S. K., Sukul, A., Jamali, A., Haut, J. M., and Ghamisi,

P. (2024). Cross hyperspectral and lidar attention

transformer: An extended self-attention for land use

and land cover classiﬁcation. IEEE Transactions on

Geoscience and Remote Sensing.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

152

Tang, W., He, F., Liu, Y., and Duan, Y. (2022). Matr: Mul-

timodal medical image fusion via multiscale adaptive

transformer. IEEE Transactions on Image Processing,

31:5134–5149.

Vaswani, A. (2017). Attention is all you need. Advances in

Neural Information Processing Systems.

Wang, X., Feng, Y., Song, R., Mu, Z., and Song, C. (2022).

Multi-attentive hierarchical dense fusion net for fusion

classiﬁcation of hyperspectral and lidar data. Informa-

tion Fusion, 82:1–18.

Wu, X., Hong, D., and Chanussot, J. (2021). Convolutional

neural networks for multimodal remote sensing data

classiﬁcation. IEEE Transactions on Geoscience and

Remote Sensing, 60:1–10.

Xue, Z., Tan, X., Yu, X., Liu, B., Yu, A., and Zhang, P.

(2022). Deep hierarchical vision transformer for hy-

perspectral and lidar data classiﬁcation. IEEE Trans-

actions on Image Processing, 31:3095–3110.

Xue, Z., Yu, X., Tan, X., Liu, B., Yu, A., and Wei, X.

(2021). Multiscale deep learning network with self-

calibrated convolution for hyperspectral and lidar data

collaborative classiﬁcation. IEEE Transactions on

Geoscience and Remote Sensing, 60:1–16.

Yang, B., Wang, X., Xing, Y., Cheng, C., Jiang, W., and

Feng, Q. (2024). Modality fusion vision transformer

for hyperspectral and lidar data collaborative classiﬁ-

cation. IEEE Journal of Selected Topics in Applied

Earth Observations and Remote Sensing.

Yu, Y., Jiang, T., Gao, J., Guan, H., Li, D., Gao, S., Tang, E.,

Wang, W., Tang, P., and Li, J. (2022). Capvit: Cross-

context capsule vision transformers for land cover

classiﬁcation with airborne multispectral lidar data.

International Journal of Applied Earth Observation

and Geoinformation, 111:102837.

Zhang, M., Gao, F., Zhang, T., Gan, Y., Dong, J., and Yu,

H. (2023). Attention fusion of transformer-based and

scale-based method for hyperspectral and lidar joint

classiﬁcation. Remote Sensing, 15(3):650.

Zhang, M., Li, W., Du, Q., Gao, L., and Zhang, B. (2018).

Feature extraction for classiﬁcation of hyperspectral

and lidar data using patch-to-patch cnn. IEEE trans-

actions on cybernetics, 50(1):100–111.

Zhang, M., Li, W., Tao, R., Li, H., and Du, Q. (2021). Infor-

mation fusion for classiﬁcation of hyperspectral and

lidar data using ip-cnn. IEEE Transactions on Geo-

science and Remote Sensing, 60:1–12.

Zhang, Y., Peng, Y., Tu, B., and Liu, Y. (2022). Local

information interaction transformer for hyperspectral

and lidar data classiﬁcation. IEEE Journal of Se-

lected Topics in Applied Earth Observations and Re-

mote Sensing, 16:1130–1143.

Zhao, G., Ye, Q., Sun, L., Wu, Z., Pan, C., and Jeon,

B. (2022). Joint classiﬁcation of hyperspectral and

lidar data using a hierarchical cnn and transformer.

IEEE Transactions on Geoscience and Remote Sens-

ing, 61:1–16.

Towards Robust Multimodal Land Use Classiﬁcation: A Convolutional Embedded Transformer

153