
free images. This model employs multimodal diffu-
sion for training and sequential inference, integrating
multi-temporal information in a time-invariant man-
ner. Experiments conducted on the four bands of the
public SEN12MS-CR-TS dataset demonstrated that
this model outperforms other approaches in the lit-
erature, highlighting its flexibility in processing se-
quences of arbitrary length.
Zhao et al. (2023) propose a CNN-based method
that takes advantage of radio frequency signals in
the ultra-high and super-high frequency bands, allow-
ing it to “see” through clouds to assist in image re-
construction. This innovative multimodal and multi-
temporal approach demonstrated effectiveness in pro-
ducing cloud-free images in experiments with public
satellite data.
Ebel et al. (2023) present UnCRtainTS, a mul-
titemporal cloud removal method that combines
attention-based features with specialized architec-
tures to predict multivariate uncertainties. Experi-
ments conducted on two public datasets showed this
approach’s high effectiveness in reconstructing im-
ages obscured by clouds.
The study by Wang et al. (2023) introduces a
cloud removal algorithm based on (i) time-series ref-
erence imagery, (ii) selection of similar pixels through
weighted temporal and spectral distances, and (iii)
residual image estimation. The algorithm creates two
“buffer zones” around clouded areas, enabling auto-
matic selection of an “optimal” set of time-series ref-
erence images. Experiments across four diverse lo-
cations, such as urban, rural, and humid areas, which
demonstrated the model’s quantitative effectiveness,
adaptability to varying cloud sizes, and superior per-
formance compared to other methods, with efficient
computational time that makes it suitable for large
datasets.
3 MATERIAL AND METHODS
3.1 Image Dataset
For our experiments, we used the multitemporal
SEN2 MTC New dataset, a heterogeneous collection
of images from various Earth regions (Huang and Wu,
2022). This dataset consists of 50 tiles, each divided
into 256×256 patches across four bands: Red, Green,
Blue, and Near Infrared, including both thin and thick
cloud coverage. Areas with thin clouds contain more
land information, which is crucial for the reconstruc-
tion process; however, thin clouds pose challenges in
cloud segmentation, potentially impacting cloud re-
moval accuracy. In contrast, thick clouds simplify
segmentation, but the land information is more lim-
ited in these images.
The dataset contains 2, 380 image patches for
training, 350 for validation, and 687 for testing, with
each patch containing pairs of cloud-covered images
and their corresponding cloud-free counterparts. For
our experiments, we considered the same quantity as
the original dataset. Figure 1 shows samples of the
patches in the dataset.
3.2 DiffCR
Zou et al. (2024) introduced DiffCR, a diffusion
model for cloud removal that generates Gaussian
noise from cloud-free images and uses this noise to
produce new synthetic cloudless images for a given
cloudy input image. This Gaussian noise represents
a latent space where encoding and decoding occur
through a U-Net architecture (Ronneberger et al.,
2015). A key innovation in DiffCR is the integra-
tion of the Time and Condition Fusion Block (TCF-
Block) in place of traditional transformer mechanisms
used in latent diffusion models. TCFBlock reduces
computational costs and enhances the model’s per-
formance on cloud removal tasks by improving the
visual correspondence between the generated image
and the ground truth.
DiffCR comprises three main components: (i) the
condition encoder, (ii) the time encoder, and (iii) the
denoising autoencoder. The condition and time en-
coders extract spatial and multiscale features from
clouded images and incorporate temporal features
based on noise levels from the diffusion model. These
features then guide the denoising autoencoder, aiding
in the gradual reduction of noise to create clear im-
ages. The authors emphasize that the choice of loss
function is essential in directing the generation of re-
alistic, cloud-free synthetic images, a motivation that
also supports this research.
3.3 Optimizers
Due to the complexity of deep learning models, espe-
cially with regard to the large number of trainable pa-
rameters, optimizers play a crucial role in the learning
process. In general terms, optimizers iteratively ad-
just model weights, helping guide the model toward
an efficient and optimal solution (Ruder, 2017).
Stochastic Gradient Descent (SGD) (Robbins and
Monro, 1951) is an optimization algorithm that up-
dates weights by moving in the opposite direction of
the loss function’s gradient. In general terms, Equa-
tion 1 defines SGD weights update:
Evaluating Combinations of Optimizers and Loss Functions for Cloud Removal Using Diffusion Models
649