
like edges and corners, while subsequent layers inter-
pret these elements to recognize objects. This hierar-
chical learning process allows Convolutional Neural
Networks (CNNs) to excel at capturing global con-
texts and spatial relationships. Despite these capabili-
ties, CNNs can struggle to preserve very fine-grained
details like small textures or subtle variations in color
due to the pooling operations that often follow convo-
lutional layers.
In this work, we aim to investigate and quantify
the potential benefits of synergizing the complemen-
tary strengths of handcrafted SIFT descriptors and
features learned from CNNs. Fusing these features
is intended to leverage both the local detailed cues
provided by SIFT and the global context captured
by CNNs. Specifically, we investigate three fusion
strategies: a straightforward early fusion approach,
and novel late and mid-level fusion approaches inte-
grating attention mechanisms. Attention mechanisms
enable neural networks to prioritize informative input
elements by assigning them weights that reflect their
relative importance. The late fusion approach uses
attention to dynamically weigh and integrate salient
features from both the CNN and SIFT modalities be-
fore making final classification decisions.The mid-
level fusion approach generates two distinct feature
maps: one prioritizing global context and another
locally-attended feature map weighted according to
SIFT features, emphasizing local details. Our exper-
imental study on the real-world EuroSAT dataset re-
veals that the different fusion approaches vary in ef-
fectiveness. Our study also suggests, that while the
prevalent fine-tuning of pre-trained models remains
a powerful tool for LULC classification, alternatives
such as integrating handcrafted and CNN-learned fea-
tures warrant exploration.
The remainder of this paper is organized as fol-
lows. Section 2 provides a concise overview of previ-
ous research. Section 3 introduces our features fusion
approaches. Section 4 presents a comparative experi-
mental analysis. Finally, Section 5 concludes the pa-
per.
2 RELATED WORK
The EuroSAT dataset (Helber et al., 2018) is a widely
recognized and extensively used dataset for LULC
classification. It includes 27,000 geotagged image
patches, each covering an area of 64x64 meters with
a spatial resolution of 10 meters. The dataset com-
prises ten distinct classes, with each class including
2,000 to 3,000 images. As illustrated in Figure 1,
these classes represent a diverse range of land use and
land cover types. For the sake of conciseness and due
to lack of space, we mainly focus in the sequel on ap-
proaches presenting similarities with our work or that
use EuroSAT. Existing remote sensing image classifi-
cation approaches and studies can be broadly classi-
fied into two families: Machine Learning (ML)-based
and Deep Learning (DL)-based methods.
The study by Chen & Tian (Chen and Tian, 2015),
and Thakur & Panse (Thakur and Panse, 2022) are
representative of ML-based approaches. (Chen and
Tian, 2015) introduced the Pyramid of Spatial Rela-
tions (PSR) model, designed to incorporate both rela-
tive and complete spatial information into the BoVW
(i.e. Bag of Visual Words) framework. Experiments
conducted on a high-resolution remote sensing im-
age revealed that the PSR model achieves an aver-
age classification accuracy of 89.1%. In (Thakur and
Panse, 2022), the performance of four machine learn-
ing algorithms was evaluated on the EuroSAT dataset:
Decision Tree (DT), K-Nearest Neighbour (KNN),
Support Vector Machine (SVM), and Random Forest
(RF). The study revealed distinct performance levels
among the algorithms: RF achieved the highest over-
all accuracy of 56.70%, significantly outperforming
DT and KNN.
The studies (Temenos et al., 2023), (Dewangkoro
and Arymurthy, 2021), (Helber et al., 2019), (Wang
et al., 2024) and (Neumann et al., 2020) are rep-
resentative of DL-based approaches. In (Temenos
et al., 2023), the authors introduce an interpretable
DL framework for LULC classification using SHap-
ley Additive exPlanations (SHAPs). They employ
a compact CNN model for image classification, fol-
lowed by feeding the results to a SHAP deep ex-
plainer, achieving an overall accuracy of 94.72% on
EuroSAT. The approach in (Dewangkoro and Ary-
murthy, 2021) utilizes different CNN architectures for
feature extraction, including VGG19, ResNet50, and
InceptionV3. These extracted features are then recal-
ibrated using the Channel Squeeze & Spatial Excita-
tion (sSE) block, with Twin SVM (TWSVM) serving
as classifier, achieving an accuracy of 94.39% on Eu-
roSAT. In (Helber et al., 2019), various CNN archi-
tectures were compared, including a shallow CNN,
a ResNet50-based model, and a GoogleNet-based
model. The achieved classification accuracies on Eu-
roSAT were 89.03%, 98.57%, and 98.18%, respec-
tively. (Neumann et al., 2020) explored in-domain
fine-tuning using five diverse remote sensing datasets
and the ResNet50V2 architecture. (Neumann et al.,
2020) demonstrated that models fine-tuned on in-
domain datasets significantly outperform those pre-
trained on general purpose datasets like ImageNet.
The pretrained ResNet50v2 fine-tuned on in-domain
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
70