Evaluating Combinations of Optimizers and Loss Functions for Cloud

Removal Using Diffusion Models

Leandro Henrique Furtado Pinto Silva

1,2

, Jo

ao Fernando Mari

, Mauricio C. Escarpinati

and Andr

e R. Backes

Institute of Exact and Technological Sciences, Federal University of Vic¸osa - UFV, Rio Parana

ıba-MG, Brazil

School of Computer Science, Federal University of Uberl

andia, Uberl

andia, Brazil

Department of Computing, Federal University of S

ao Carlos, S

ao Carlos-SP, Brazil

{leandro.furtado, joaof.mari}@ufv.br, mauricio@ufu.br, arbackes@yahoo.com.br

Keywords:

Cloud Removal, Diffusion Model, Remote Sensing, Optimizers, Loss Functions.

Abstract:

Cloud removal is crucial for photogrammetry applications, including urban planning, precision agriculture,

and climate monitoring. Recently, generative models, especially those based on latent diffusion, have shown

remarkable results in high-quality synthetic image generation, making them suitable for cloud removal tasks.

These approaches require optimizing numerous trainable parameters with various optimizers and loss func-

tions. This study evaluates the impact of combining three optimizers (SGD, Adam, and AdamW) with the

MAE, MSE, and Huber loss functions. For evaluation, we used the SEN MTC New dataset, which contains

pairs of 4-band images with and without clouds, divided into training, validation, and test sets. The results,

measured in terms of PSNR and SSIM, show that the diffusion model combining AdamW and the Huber loss

function delivers exceptional performance in cloud removal.

1 INTRODUCTION

Remote sensing, particularly through satellite im-

agery, is vital for environmental monitoring, urban

planning, and precision agriculture. However, these

applications are signiﬁcantly hindered by cloud pres-

ence, as clouds can obscure areas of interest, af-

fecting measurement accuracy (Jeppesen et al., 2019;

Arakaki et al., 2023; Ferreira et al., 2024). This is-

sue is especially impactful given that clouds with var-

ious characteristics cover approximately 60% of the

Earth’s surface and continuously shift across regions.

Consequently, cloud removal techniques to reduce

these artifacts have gained prominence in recent re-

search (Xie et al., 2023; Podsiadlo et al., 2020).

To address cloud removal, researchers have em-

ployed various approaches, including Convolutional

Neural Networks (CNNs), Generative Adversarial

Networks (GANs), and, more recently, Diffusion

Models (Dong et al., 2021). Although each method

has unique attributes, they are united by their re-

liance on deep learning, and tuning these models of-

ten involves empirical considerations (Barbosa et al.,

2024). Choosing the right optimizers and loss func-

tions is essential to improving the learning process

by accurately updating model weights, which sup-

ports better generalization (Seyrek and Uysal, 2024).

Notably, while optimizer and loss function choices

are crucial for traditional Convolutional Neural Net-

works, these elements require further investigation for

emerging techniques like latent diffusion models.

Through analysis of nine experiments, this study

examines which combinations of optimizers and loss

functions yield optimal results for cloud removal. Our

investigation focuses on the performance impact of

these combinations in diffusion models for cloud re-

moval tasks.

The remaining of this work is organized as fol-

lows: Section 2 presents related works, providing def-

initions that support and motivate this study. Section

3 details the Materials and Methods, including the

experimental setup, dataset, and evaluation metrics.

Section 4 presents the quantitative and qualitative re-

sults and discussions. Finally, Section 5 presents the

conclusions, perspectives of this research, and future

works.

2 RELATED WORKS

The work by Zhao and Jia (2023) introduces a

sequence-based diffusion model for generating cloud-

648

Silva, L. H. F. P., Mari, J. F., Escarpinati, M. C. and Backes, A. R.

Evaluating Combinations of Optimizers and Loss Functions for Cloud Removal Using Diffusion Models.

DOI: 10.5220/0013252100003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

648-656

ISBN: 978-989-758-728-3; ISSN: 2184-4321

free images. This model employs multimodal diffu-

sion for training and sequential inference, integrating

multi-temporal information in a time-invariant man-

ner. Experiments conducted on the four bands of the

public SEN12MS-CR-TS dataset demonstrated that

this model outperforms other approaches in the lit-

erature, highlighting its ﬂexibility in processing se-

quences of arbitrary length.

Zhao et al. (2023) propose a CNN-based method

that takes advantage of radio frequency signals in

the ultra-high and super-high frequency bands, allow-

ing it to “see” through clouds to assist in image re-

construction. This innovative multimodal and multi-

temporal approach demonstrated effectiveness in pro-

ducing cloud-free images in experiments with public

satellite data.

Ebel et al. (2023) present UnCRtainTS, a mul-

titemporal cloud removal method that combines

attention-based features with specialized architec-

tures to predict multivariate uncertainties. Experi-

ments conducted on two public datasets showed this

approach’s high effectiveness in reconstructing im-

ages obscured by clouds.

The study by Wang et al. (2023) introduces a

cloud removal algorithm based on (i) time-series ref-

erence imagery, (ii) selection of similar pixels through

weighted temporal and spectral distances, and (iii)

residual image estimation. The algorithm creates two

“buffer zones” around clouded areas, enabling auto-

matic selection of an “optimal” set of time-series ref-

erence images. Experiments across four diverse lo-

cations, such as urban, rural, and humid areas, which

demonstrated the model’s quantitative effectiveness,

adaptability to varying cloud sizes, and superior per-

formance compared to other methods, with efﬁcient

computational time that makes it suitable for large

datasets.

3 MATERIAL AND METHODS

3.1 Image Dataset

For our experiments, we used the multitemporal

SEN2 MTC New dataset, a heterogeneous collection

of images from various Earth regions (Huang and Wu,

2022). This dataset consists of 50 tiles, each divided

into 256×256 patches across four bands: Red, Green,

Blue, and Near Infrared, including both thin and thick

cloud coverage. Areas with thin clouds contain more

land information, which is crucial for the reconstruc-

tion process; however, thin clouds pose challenges in

cloud segmentation, potentially impacting cloud re-

moval accuracy. In contrast, thick clouds simplify

segmentation, but the land information is more lim-

ited in these images.

The dataset contains 2, 380 image patches for

training, 350 for validation, and 687 for testing, with

each patch containing pairs of cloud-covered images

and their corresponding cloud-free counterparts. For

our experiments, we considered the same quantity as

the original dataset. Figure 1 shows samples of the

patches in the dataset.

3.2 DiffCR

Zou et al. (2024) introduced DiffCR, a diffusion

model for cloud removal that generates Gaussian

noise from cloud-free images and uses this noise to

produce new synthetic cloudless images for a given

cloudy input image. This Gaussian noise represents

a latent space where encoding and decoding occur

through a U-Net architecture (Ronneberger et al.,

2015). A key innovation in DiffCR is the integra-

tion of the Time and Condition Fusion Block (TCF-

Block) in place of traditional transformer mechanisms

used in latent diffusion models. TCFBlock reduces

computational costs and enhances the model’s per-

formance on cloud removal tasks by improving the

visual correspondence between the generated image

and the ground truth.

DiffCR comprises three main components: (i) the

condition encoder, (ii) the time encoder, and (iii) the

denoising autoencoder. The condition and time en-

coders extract spatial and multiscale features from

clouded images and incorporate temporal features

based on noise levels from the diffusion model. These

features then guide the denoising autoencoder, aiding

in the gradual reduction of noise to create clear im-

ages. The authors emphasize that the choice of loss

function is essential in directing the generation of re-

alistic, cloud-free synthetic images, a motivation that

also supports this research.

3.3 Optimizers

Due to the complexity of deep learning models, espe-

cially with regard to the large number of trainable pa-

rameters, optimizers play a crucial role in the learning

process. In general terms, optimizers iteratively ad-

just model weights, helping guide the model toward

an efﬁcient and optimal solution (Ruder, 2017).

Stochastic Gradient Descent (SGD) (Robbins and

Monro, 1951) is an optimization algorithm that up-

dates weights by moving in the opposite direction of

the loss function’s gradient. In general terms, Equa-

tion 1 deﬁnes SGD weights update:

Evaluating Combinations of Optimizers and Loss Functions for Cloud Removal Using Diffusion Models

649

Figure 1: Samples of SEN2 MTC New dataset. The ﬁrst row shows images with clouds, and the second shows the corre-

sponding cloudless images.

t+1

= θ

−η∇

J(θ

), (1)

where η is the learning rate, θ are the weights, and

∇

J(θ

) is the gradient for updating the weights.

Momentum is an essential hyperparameter for

SGD, as it helps reduce large oscillations and acceler-

ates training convergence. In general terms, momen-

tum accumulates past gradients to smooth the weight

updates. This accumulation of gradients (v

) to mo-

mentum is deﬁned according to Equation 2:

= γv

t−1

+ η∇

J(θ

), (2)

where γ assigns the contribution of the previous gra-

dient (v

t−1

Thus, with the use of Momentum, the SGD is de-

ﬁned according to Equation 3:

t+1

= θ

−v

, (3)

Adam Kingma (2014) is an optimizer that com-

bines elements of both RMSProp and Momentum.

Speciﬁcally, it tracks an average of past gradients as

well as a mean of the squares of these gradients, en-

abling more adaptive weight updates.

Thus, the following are stored respectively: (i) ex-

ponential mean of the gradients (m

) and (ii) expo-

nential mean of the squares of the gradients (s

), as

deﬁned by Equations 4 and 5:

= β

t−1

+ (1 −β

)∇

J(θ

), (4)

= β

t−1

+ (1 −β

)(∇

J(θ

))

, (5)

where β

and β

are the exponential decay factors.

Thus, the update of the weights in Adam occurs

according to Equations 6 and 7:

ˆm

1 −β

, ˆs

1 −β

, (6)

t+1

= θ

−η

ˆm

√

ˆs

+ ε

, (7)

where ε is a small value to avoid zero division.

The AdamW optimizer (Loshchilov and Hutter,

2019) adapts Adam, which inserts weight decay di-

rectly into the weight update. For AdamW, the stor-

age components are the same as those in Adam: an

average of past gradients and a mean of the gradients’

squares (Equations 4 and 5). Thus, the update of the

weights of AdamW is deﬁned according to Equation

t+1

= θ

−η



√

+ ε

+ λθ



, (8)

where λ is the weight decay factor.

3.4 Loss Function

Loss functions play a key role in model training by

guiding the updates of weights and gradients. In our

experiments, we used the Mean Squared Error (MSE),

Mean Absolute Error (MAE), and Huber loss func-

tions, employing their default settings in PyTorch

MAE calculates the absolute difference between

the predicted and actual values (ground truth), while

https://pytorch.org/docs/stable/nn.html#loss-functions

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

650

MSE calculates the square of this difference. These

relationships are shown in Equations 9 and 10:

MAE =

∑

i=1

−y

|, (9)

MSE =

∑

i=1

−y

)

, (10)

where x

is the predicted pixel value; y

is the the ex-

pected pixel value; and N is the number of pixels in

an image.

The Huber loss function combines aspects of

MAE and MSE, depending on a speciﬁed value of δ.

It penalizes larger errors more heavily while smooth-

ing smaller errors, as deﬁned in Equation 11:

Huber =

(

0.5 ·(x

−y

)

, if |x

−y

| < δ

δ ·(|x

−y

|−0.5 ·δ), otherwise

(11)

For our experiments, as with Pytorch’s default

conﬁguration, we used δ = 1.0.

3.5 Experimental Setup

We conducted nine experimental setups using the Dif-

fCR baseline

. For model training, we tested three

optimizers: AdamW, Adam, and SGD. We also eval-

uated three loss functions: Huber, MSE, and MAE. In

all experiments, we used a learning rate of 5 ×10

−5

a weight decay of 0.01 to mitigate overﬁtting, and a

batch size of 16. It is important to note that weight

decay affects each optimizer differently and directly

inﬂuences weight update dynamics. For instance,

weight decay impacts only the weights in AdamW,

whereas in SGD and Adam, it also affects the gradi-

ent. For SGD, we additionally used a momentum of

0.9.

Training was conducted for 3, 000 epochs, with

validation occurring every 200 epochs to monitor

learning progress in each experiment, using valida-

tion losses based on MAE, MSE, and Huber func-

tions. Additionally, we set a seed of 42 for all experi-

ments to ensure reproducibility. Figure 2 summarizes

the experimental design of this study.

3.6 Evaluation Metrics

The Structural Similarity Index Measure (SSIM) con-

sists of a metric used to evaluate the quality of an im-

age using a reference image. We considered a Gaus-

sian ﬁlter with a standard deviation of 1.0 for this in-

https://xavierjiezou.github.io/DiffCR/

dex. The SSIM metric is deﬁned according to Equa-

tion 12

SSIM(x, y) =

(2µ

)(2σ

)

(µ

+ µ

)(σ

+ σ

)

, (12)

where x and y are two patches obtained from respec-

tive images to be evaluated. Values of x and y are

deﬁned as non-negative values representing a signal,

which must be aligned; µ

is the pixel sample mean of

signal x; µ

is the pixel sample mean of signal y; σ

the variance of signal x; σ

is the variance of signal y;

is the covariance of signals x and y; C

and C

are

two constants included to avoid instability to values

very close to zero.

The Peak signal-to-noise ratio (PSNR) is another

metric for evaluating distortion between images. The

PSNR is deﬁned in terms of MSE acording to Equa-

tion 13

PSNR = 10 log



255

MSE



(13)

3.7 Computational Environment

The experiments were performed on a PC with a 4.4

GHz Core i5-12400 CPU and 32 GB of RAM. The

PC has an NVIDIA RTX 4090 GPU (24 GB mem-

ory). The experiments used Python 3.10.14 program-

ming language and NumPy and Matplotlib libraries

for numerical processing and visualization of images

and data. The Scikit-learn library was used to manip-

ulate the set of images and analyze the classiﬁcation

results. The library used to implement the deep neural

network models was PyTorch 2.3.1 and CUDA 12.1.

4 RESULTS

Validation is a crucial step in monitoring a model’s

learning process to understand its behavior, avoid

overﬁtting, and make targeted adjustments. In this

study, we evaluated validation loss for all experiments

using the MAE, MSE, and Huber loss functions, as

shown in Figure 3. The ﬁgure indicates that the model

overﬁts only in the conﬁguration of SGD + MAE loss.

For the other conﬁgurations, a reduction in validation

loss suggests effective learning. Validation also re-

veals that the model performs better with the Adam

and AdamW optimizers across all loss functions, par-

ticularly AdamW, which exhibits less severe oscilla-

tions.

We assessed the trained models on the full

SEN2

MTC New test set using PSNR and SSIM met-

rics. Table 1 consolidates these results, showing

Evaluating Combinations of Optimizers and Loss Functions for Cloud Removal Using Diffusion Models

651

DiffCR

Optimizers:

...

Train

...

Val.

...

Test

Losses:

PSNR

SSIM

AdamW

Adam

SGD

MAE

MSE

Hubber

TRAINED MODELS

SEN2_MTC_New

MAE MSE Hubber

AdamW

Adam

SGD

Figure 2: Experimental design of this work.

(a) (b) (c)

Figure 3: Model Validation every 200 epochs. Columns (a), (b), and (c) present the validation in terms of Huber Loss

Validation, MSE Loss Validation, and MAE Loss Validation, respectively. Each row represents the AdamW, Adam, and SGD

optimizers, respectively, with each optimizer having three training loss functions.

the mean, standard deviation, and maximum value

achieved by each model for both metrics. Figures 4

and 5 further illustrate these results visually.

Overall, the best results were obtained with the

AdamW optimizer. This optimizer generally provides

a stable PSNR and SSIM across the three loss func-

tions. However, while the mean values for PSNR re-

main close (around 17–18 dB) for MAE and MSE

losses, the Huber loss shows a slight increase in SSIM

(0.5871) and a signiﬁcantly higher maximum PSNR

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

652

(32.3323). The Huber function’s selective penaliza-

tion of larger errors may have contributed to the supe-

rior performance, as it avoids uniform penalization re-

gardless of error magnitude. This conﬁguration effec-

tively addresses cloud removal in images containing

varied contexts (e.g., rivers, forests, urban areas) and

diverse cloud types and densities, treating areas with

heavier cloud cover more stringently to produce real-

istic images. Additionaly, these results may indicate

that the Huber loss can better handle noisy or outlier

data, which could explain the signiﬁcant increase in

PSNR.

In the experiments with the Adam optimizer, the

differences between loss functions are relatively mi-

nor. While there is some variation, no conﬁguration

offers a signiﬁcant performance boost. This implies

that Adam might not beneﬁt as much from tuning

the loss function in this application, offering limited

ﬂexibility for improvement without switching opti-

mizers. Unlike AdamW, however, the worst overall

result was recorded with Huber Loss, while the best

was achieved with MAE Loss, suggesting that Adam

might introduce more abrupt updates during training.

This optimizer generally achieves lower PSNR and

SSIM values across all loss functions when compared

to AdamW. This suggests that Adam may not be as

effective for optimizing these image quality metrics,

potentially due to the optimizer’s sensitivity to hyper-

parameter settings.

Finally, for the SGD optimizer, both MSE and Hu-

ber losses offer a more stable and moderate perfor-

mance for PSNR and SSIM, except for the overﬁtted

SGD + MAE model. Given the dataset’s heteroge-

neous nature, we observed substantial standard devi-

ations across all experiments, which underscores the

challenges posed by diverse land covers. SGD with

MAE loss yields the lowest average PSNR (5.3825 ±

3.2556) and a signiﬁcantly lower SSIM (0.1996 ±

0.0944), indicating instability or poor convergence

in this combination. However, the maximum PSNR

achieved (17.8467) suggests that this combination can

sometimes produce reasonable outputs, although it’s

inconsistent. Further research into hyperparameter

tuning could help improve performance in such var-

ied contexts, reﬁning the model to reduce deviations

and enhance diffusion model generalization.

For qualitative evaluation, Figures 6 and 7 present

results from two image patches (with corresponding

cloudy and cloudless references) across all experi-

ments. These ﬁgures show how the results align with

the quantitative ﬁndings. The examples represent dif-

ferent land cover types, cloud densities, and shad-

owing, highlighting challenges such as shadowed re-

gions and heavy cloud cover. Figure 6 depicts a highly

Figure 4: Visualization and comparison of SSIM metric re-

sults in the test set for each model.

Figure 5: Visualization and comparison of PSNR metric re-

sults in the test set for each model.

clouded area with minimal visible ground informa-

tion, while Figure 7 shows a less clouded region with

prominent shadows. Across all experiments, han-

dling shadowed areas remains a consistent challenge,

akin to the difﬁculty observed with thick clouds. The

“AdamW + Huber” combination achieves a visual re-

sult that closely matches the expected outcome, with

the most notable differences occurring in areas of

denser cloud cover. Other combinations exhibit vari-

ous reconstruction issues, including loss of informa-

tion that was originally visible and unobstructed by

clouds. Additionally, other optimizer combinations

struggle to accurately reconstruct aspects of the im-

age, such as tonality and texture, even in visible re-

gions, with “SGD + MAE” failing to generate a co-

herent image.

5 CONCLUSIONS

The selection of an optimizer and cost function

has a signiﬁcant impact on training latent diffusion

models. For the cloud removal task, this study

demonstrated that these choices directly affect model

performance, either enhancing or hindering results.

Evaluating Combinations of Optimizers and Loss Functions for Cloud Removal Using Diffusion Models

653

Table 1: Experimental Results for Test Set.

Optimizer Loss PSNR ↑ SSIM ↑ Max PSNR Max SSIM

AdamW

MAE 18.0040 ±4.2008 0.5463 ±0.1210 28.4146 0.8115

MSE 17.7674 ±4.1784 0.5447 ±0.1181 27.8317 0.8433

Huber 18.4395 ±4.4706 0.5871 ±0.1313 32.3323 0.8903

Adam

MAE 15.2478 ±3.0464 0.3745 ±0.0873 22.6487 0.6597

MSE 14.9229 ±2.9440 0.3704 ±0.0902 21.0076 0.6169

Huber 14.4507 ±2.9423 0.3653 ±0.0926 20.0854 0.6143

SGD

MAE 5.3825 ±3.2556 0.1996 ±0.0944 17.8467 0.6674

MSE 15.0421 ±3.1325 0.3748 ±0.0929 21.1497 0.6193

Huber 14.5604 ±3.0442 0.3688 ±0.0946 20.2948 0.6237

Cloud AdamW + MAE Adam + MAE SGD + MAE

Ground Truth AdamW + MSE Adam + MSE SGD + MSE

AdamW + Huber Adam + Huber SGD + Huber

Figure 6: Qualitative results of our experiments. The ﬁrst column presents one of three multitemporal patches with clouds and

respective Ground Truth (cloudless). The other patches present visualizations of each of the nine experiments of this work.

In particular, the combination of the Huber func-

tion with the AdamW optimizer proved effective

for cloud removal, given the complex nature of

the problem, which involves varying cloud types,

ground cover, and shadows, each posing unique lo-

cal challenges for reconstruction. AdamW facilitated

smoother training, while the Huber loss function ef-

fectively emphasized regions with higher cloud occlu-

sion, preserving areas that were already cloud-free.

For future work, we plan to explore alternatives

to the TCFBlock from the DiffCR baseline, broaden

the hyperparameter search space (e.g., learning rate

and weight decay), and implement an early stopping

strategy.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

654

Cloud AdamW + MAE Adam + MAE SGD + MAE

Ground Truth AdamW + MSE Adam + MSE SGD + MSE

AdamW + Huber Adam + Huber SGD + Huber

Figure 7: Qualitative results of our experiments. The ﬁrst column presents one of three multitemporal patches with clouds and

respective Ground Truth (cloudless). The other patches present visualizations of each of the nine experiments of this work.

ACKNOWLEDGEMENTS

We would like to thank FAPEMIG, Brazil (Grant

number CEX - APQ-02964-17) for ﬁnancial support.

Andr

e R. Backes gratefully acknowledges the ﬁnan-

cial support of CNPq (National Council for Scien-

tiﬁc and Technological Development, Brazil) (Grant

#307100/2021-9). This study was ﬁnanced in part by

the Coordenac¸

ao de Aperfeic¸oamento de Pessoal de

ıvel Superior - Brazil (CAPES) - Finance Code 001.

REFERENCES

Arakaki, L. G., Silva, L. H. F. P., da Silva, M. V., Melo,

B. M., Backes, A. R., Escarpinati, M. C., and Mari,

J. F. (2023). Evaluation of u-net backbones for cloud

segmentation in satellite images. In Proceedings of

the 18th International Joint Conference on Computer

Vision, Imaging and Computer Graphics Theory and

Applications (VISIGRAPP 2023) - Volume 4: VIS-

APP, pages 452–458. INSTICC, SciTePress.

Barbosa, G., Moreira, L., de Sousa, P. M., Moreira, R., and

Backes, A. (2024). Optimization and learning rate in-

ﬂuence on breast cancer image classiﬁcation. In Pro-

ceedings of the 19th International Joint Conference

on Computer Vision, Imaging and Computer Graphics

Theory and Applications - Volume 3: VISAPP, pages

792–799. INSTICC, SciTePress.

Dong, S., Wang, P., and Abbas, K. (2021). A survey on

deep learning and its applications. Computer Science

Review, 40:100379.

Ebel, P., Garnot, V. S. F., Schmitt, M., Wegner, J. D., and

Zhu, X. X. (2023). Uncrtaints: Uncertainty quantiﬁ-

cation for cloud removal in optical satellite time se-

ries. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

2085–2095.

Ferreira, J. R., Silva, L. H. F. P., Escarpinati, M. C., Backes,

A. R., and Mari, J. F. (2024). Evaluating multi-

ple combinations of models and encoders to segment

clouds in satellite images. In Proceedings of the 19th

Evaluating Combinations of Optimizers and Loss Functions for Cloud Removal Using Diffusion Models

655

International Joint Conference on Computer Vision,

Imaging and Computer Graphics Theory and Applica-

tions - Volume 3: VISAPP, pages 233–241. INSTICC,

SciTePress.

Huang, G.-L. and Wu, P.-Y. (2022). Ctgan: Cloud

transformer generative adversarial network. In 2022

IEEE International Conference on Image Processing

(ICIP), pages 511–515. IEEE.

Jeppesen, J. H., Jacobsen, R. H., Inceoglu, F., and Tofte-

gaard, T. S. (2019). A cloud detection algorithm for

satellite imagery based on deep learning. Remote

sensing of environment, 229:247–259.

Kingma, D. P. (2014). Adam: A method for stochastic op-

timization. arXiv preprint arXiv:1412.6980.

Loshchilov, I. and Hutter, F. (2019). Decoupled weight de-

cay regularization.

Podsiadlo, I., Paris, C., and Bruzzone, L. (2020). A study

of the robustness of the long short-term memory clas-

siﬁer to cloudy time series of multispectral images.

In Image and Signal Processing for Remote Sensing

XXVI, volume 11533, pages 335–343. SPIE.

Robbins, H. and Monro, S. (1951). A stochastic approxi-

mation method. The annals of mathematical statistics,

pages 400–407.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation.

Ruder, S. (2017). An overview of gradient descent opti-

mization algorithms.

Seyrek, E. C. and Uysal, M. (2024). A comparative anal-

ysis of various activation functions and optimizers in

a convolutional neural network for hyperspectral im-

age classiﬁcation. Multimedia Tools and Applications,

83(18):53785–53816.

Wang, Z., Zhou, D., Li, X., Zhu, L., Gong, H., and

Ke, Y. (2023). Virtual image-based cloud removal

for landsat images. GIScience & Remote Sensing,

60(1):2160411.

Xie, Y., Li, Z., Bao, H., Jia, X., Xu, D., Zhou, X., and

Skakun, S. (2023). Auto-cm: Unsupervised deep

learning for satellite imagery composition and cloud

masking using spatio-temporal dynamics. In Proceed-

ings of the AAAI Conference on Artiﬁcial Intelligence,

volume 37, pages 14575–14583.

Zhao, M., Olsen, P., and Chandra, R. (2023). Seeing

through clouds in satellite images. IEEE Transactions

on Geoscience and Remote Sensing, 61:1–16.

Zhao, X. and Jia, K. (2023). Cloud removal in remote sens-

ing using sequential-based diffusion models. Remote

Sensing, 15(11):2861.

Zou, X., Li, K., Xing, J., Zhang, Y., Wang, S., Jin, L., and

Tao, P. (2024). Diffcr: A fast conditional diffusion

framework for cloud removal from optical satellite im-

ages. IEEE Transactions on Geoscience and Remote

Sensing, 62:1–14.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

656