AKDT: Adaptive Kernel Dilation Transformer for Effective Image

Denoising

Alexandru Brateanu

1 a

, Raul Balmez

1 b

, Adrian Avram

and Ciprian Orhei

2 c

University of Manchester, Manchester, U.K.

Politehnica University of Timis¸oara, Timis¸oara, Romania

{alexandru.brateanu, raul.balmez}@student.manchester.ac.uk, {adrian.avram, ciprian.orhei}@upt.ro

Keywords:

Image Denoising, Transformer Models, Dilated Convolutions, Deep Learning, Computer Vision.

Abstract:

Image denoising is a fundamental yet challenging task, especially when dealing with high-resolution images

and complex noise patterns. Most existing methods rely on standard Transformer architectures, which often

suffer from high computational complexity and limited adaptability to varying noise levels. In this paper,

we introduce the Adaptive Kernel Dilation Transformer (AKDT), a novel Transformer-based model that fully

harnesses the power of learnable dilation rates within convolutions. AKDT consists of several layers and

custom-designed blocks, including our novel Learnable Dilation Rate (LDR) module, which is utilized to

construct a Noise Estimator module (NE). At the core of AKDT, the NE is seamlessly integrated within

standard Transformer components to form the Noise-Guided Feed-Forward Network (NG-FFN) and Noise-

Guided Multi-Headed Self-Attention (NG-MSA). These noise-modulated Transformer components enable

the model to achieve unparalleled denoising performance while signiﬁcantly reducing computational costs.

Extensive experiments across multiple image denoising benchmarks demonstrate that AKDT sets a new state-

of-the-art, effectively handling both real and synthetic noise. The source code and pre-trained models are

publicly available at https://github.com/albrateanu/AKDT.

1 INTRODUCTION

Image enhancement encompasses various techniques

aimed at improving the quality, clarity, and percep-

tibility of images. The main goal is to create visu-

ally appealing images, correct an image, or manip-

ulate certain features for speciﬁc use cases. Spe-

ciﬁc categories of enhancements can be highlighted

as such: image sharpening (Orhei and Vasiu, 2023;

Bogdan et al., 2024), low-light enhancement (Wang

et al., 2020; Brateanu et al., 2024), de-hazing (Ancuti

et al., 2017), denoising (Liang et al., 2021).

Image denoising is a crucial ﬁeld within image

restoration which aims to enhance image quality by

eliminating noise introduced by digital and natural

factors alike. The noise, manifesting as random varia-

tions of brightness and color information, often com-

plicates the task of maintaining image characteristics

such as sharpness and texture. As such, the complex-

ity of this problem has led to the adoption of deep

https://orcid.org/0009-0001-2752-2357

https://orcid.org/0009-0003-6446-2669

https://orcid.org/0000-0002-0071-958X

neural networks, which have shown success in var-

ious image denoising tasks, as evidenced by recent

benchmarks

Figure 1: AKDT outperforms SOTA, including BRDNet

(Tian et al., 2020), SwinIR (Liang et al., 2021), and

Restormer (Zamir et al., 2022), GRL-B (Li et al., 2023),

and HWFormer (Tian et al., 2024) on denoising benchmark

McMaster at various noise levels.

The ﬁeld of image denoising has seen a paradigm

shift with the introduction of Transformer models

(Dosovitskiy et al., 2021), which excel in capturing

extensive contextual information, having larger recep-

tive ﬁelds when compared to earlier Convolutional

Neural Networks (CNN) based methods. However,

the high computational cost of Transformer architec-

tures remains a barrier for high-resolution tasks such

as image denoising.

418

Brateanu, A., Balmez, R., Avram, A. and Orhei, C.

AKDT: Adaptive Kernel Dilation Transformer for Effective Image Denoising.

DOI: 10.5220/0013157700003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

418-425

ISBN: 978-989-758-728-3; ISSN: 2184-4321

Dilated kernels, also known as atrous convolution

or convolution with holes, are a type of convolutional

kernel used in deep learning architectures or classi-

cal processing. The hypothesis behind this opera-

tion is that, by dilating, rather than expanding ﬁlters,

the region covered by the transformation increases in

terms of pixel distance and not pixel density. In re-

cent years, dilated ﬁlters proved particularly useful in

domains ranging from image segmentation (Yu and

Koltun, 2016; Yu et al., 2017) to edge detection (Bog-

dan et al., 2020; Orhei et al., 2021).

In this paper we propose the Adaptive Kernel Di-

lation Transformer (AKDT), a Transformer architec-

ture that incorporates convolutions within the stan-

dard Transformer components. AKDT innovates

through dilated convolutions that employ a mecha-

nism which allows them to learn and adapt the dila-

tion rate of kernels. Through the use of learnable di-

lation rate kernels, we can harness a weighted expan-

sion of the informational domain without increasing

computational cost.

Through the use of image denoising benchmarks,

with real and synthetic noise, we demonstrate AKDT

has state-of-the-art (SOTA) performance and high-

light its computational efﬁciency.

The main contributions of our work can be sum-

marized as follows:

• AKDT, a Transformer architecture tailored to per-

form highly-effective image denoising by em-

ploying the strengths of dilated convolutions.

• A novel approach of using dilated convolutions

in Transformer architectures in order to produce

dynamic and adaptable modules that tailor to the

speciﬁcs of the task.

• Two novel components in the Transformer ar-

chitecture: Noise-Guided Feed-Forward Network

(NG-FFN) and Noise-Guided Multi-Headed Self-

Attention (NG-MSA).

• Through quantitative and qualitative experiments,

AKDT demonstrates SOTA performance on stan-

dardized denoising benchmarks.

2 RELATED WORK

Image denoising is a critical area in computer vi-

sion that aims to remove noise from corrupted im-

ages to recover the original, uncorrupted content (Fat-

tal, 2007; HeK and SUNJ, 2011; Kopf et al., 2008;

Michaeli and Irani, 2013). Traditional approaches,

often based on CNNs, have been pivotal in advanc-

ing early image restoration techniques. These meth-

ods leverage spatial hierarchies of learned ﬁlters to

address various levels of degradation in images (Tu

et al., 2022; Zamir et al., 2022; Zhang et al., 2020;

Zamir et al., 2020b; Chen et al., 2021; Zamir et al.,

2020a). CNNs have been effective due to their ability

to enforce local connectivity and share weights across

spatial domains.

With the advent of Transformer architectures, the

computer vision paradigm has drastically shifted.

Originally developed for natural language processing

tasks (Vaswani et al., 2017), Transformers apply self-

attention mechanisms to capture long-range depen-

dencies in data, a signiﬁcant advantage over CNNs

when dealing with complex image structures (Doso-

vitskiy et al., 2021; Ramachandran et al., 2017). In

image denoising, Transformers tokenize images and

learn intricate relationships, offering enhanced capa-

bilities for handling detailed textures and patterns in

high-resolution images (Touvron et al., 2021; Yuan

et al., 2021).

Despite their advantages, the computational de-

mands of Transformers increase quadratically with

the input size, presenting challenges for practical ap-

plications. Recent research has focused on developing

more efﬁcient Transformer models to mitigate these

issues. Techniques such as locality-constrained self-

attention mechanisms introduced in Swin Transform-

ers (Liu et al., 2021) and innovative attention schemes

like those in CAT (Chen et al., 2022), which employ

rectangular-window self-attention, and channel-wise

attention, proposed in Restormer (Zamir et al., 2022),

have shown promising results in enhancing the com-

putational efﬁciency of Transformers.

3 PROPOSED METHOD

In Figure 2, we detail the architecture of AKDT,

which features a foundational Transformer struc-

ture. However, AKDT diverges from conventional

models by incorporating two novel components:

the Noise-Guided Multi-headed Self-attention (NG-

MSA) and the Noise-Guided Feed-Forward Network

(NG-FFN). Both blocks utilize our proposed Noise

Estimator Module (NE), which consists of two Learn-

able Dilation Rate (LDR) submodules.

The process begins as the input image in RGB

format is ﬁrst subjected to a conv3×3 layer, which

projects the image into a high-dimensional feature

space, setting the stage for more complex trans-

formations. Following this initial projection, the

enhanced feature map is introduced into the core

of the Transformer architecture, which is organized

in a U-shaped conﬁguration utilizing skip connec-

tions (Ronneberger et al., 2015). Up-sampling and

AKDT: Adaptive Kernel Dilation Transformer for Effective Image Denoising

419

Skip Connections

Noisy Input Denoised Image

𝐻 × W × 3

𝐻 × W × 𝐶

𝐻

𝑊

× 2𝐶

𝐻

𝑊

× 4𝐶

𝐻

𝑊

× 8𝐶

𝐻

𝑊

× 4𝐶

𝐻 × W × 3

Softmax

LDR

Local

LDR

Global

out

𝐻 × W × 𝐶

𝐻 × W × 3𝐶

𝐶 × 𝐻𝑊

𝐻𝑊 × 𝐶

𝐶 × 𝐶

𝐻𝑊 × 𝐶

LDR

Local

LDR

Global

Gating Mechanism

𝐻 × W × 2𝐶

𝐻 × W × 𝐶

out

𝐻 × W × 𝐶

Noise Estimator Module

out

𝐻 × W × 𝐶

𝐻

𝑊

× 8𝐶

𝐻

𝑊

× 4𝐶

𝐻

𝑊

× 4𝐶

𝐻

𝑊

× 2𝐶

𝐻

𝑊

× 4𝐶

𝐻 × W × 𝐶

Figure 2: Framework of AKDT. Our Transformer architecture utilizes the proposed Noise Estimator module (NE) within the

Noise-Guided Feed-Forward Network (NG-FFN) and the Noise-Guided Multi-headed Self-attention (NG-MSA).

down-sampling are achieved through pixel-shufﬂe

and pixel-unshufﬂe operations (Shi et al., 2016).

After the feature map traverses through the U-

shaped Transformer sequence, it enters an advanced

reﬁnement stage. This stage employs additional

Transformer blocks that further reﬁne the features, en-

suring that even subtle nuances are captured and en-

hanced. This reﬁnement is critical in restoration, par-

ticularly in the case of heavily degraded images.

The reﬁned feature map is then processed through

another conv3×3 layer, which compresses the high-

dimensional features back into the standard RGB

color space. Additionally, a residual connection from

the original input is incorporated at this ﬁnal stage.

This connection aids in preserving essential image de-

tails by allowing the original data to ﬂow directly into

the output, thereby enhancing the ﬁdelity of the re-

stored image and ensuring that important textural and

color details are maintained.

3.1 Learnable Dilation Rate Module

The LDR Module, is deﬁned as the weighted concate-

nation (

⃝) of N convolutions with N different dilation

rates. The input feature map is ﬁrst projected through

a conv1×1. The projected input F

∈ R

H×W ×C

is then

passed through multiple parallel dilated convolutions.

Each of the dilated convolution operations has a learn-

able weight that is adjusted during training. As such,

the output feature map F

LDR

can be expressed as:

LDR

=conv1×1



⃝

i=1

× conv3×3

)



(1)

The choice of dilation rates in the LDR is impor-

tant, as they directly inﬂuence how the module cap-

tures and integrates information from different spa-

tial scales of the input data. This structured approach

to combining multiple dilated convolutions, each ad-

justed for a speciﬁc scale of feature extraction, en-

hances the ability of the model to discern ﬁner details

and contributes to more robust and scale-invariant fea-

ture representations.

This implementation ensures that the LDR Mod-

ule is easily adjustable for different use-cases. In con-

sequence, our Global and Local LDR Modules have

the same implementation but employ different-scale

dilation rates.

3.2 Noise Estimator Module

The NE in the proposed architecture serves a critical

role by integrating both global and local context un-

derstanding through its unique structure. This mod-

ule consists of two distinct parallel components: the

Global and Local LDR modules.

The LDR

Global

module is designed to employ

higher-scale dilation rates, which enables it to grasp

the broader context and underlying patterns across the

entire image. We deﬁne F

Global

LDR

as follows:

Global

LDR

= LDR

Global

), F

Global

LDR

∈ R

H×W×C

(2)

Conversely, the LDR

Local

module utilizes lower-

scale dilation rates, focusing on capturing ﬁner, more

detailed aspects of the image. This attention to detail

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

420

Figure 3: Context captured by the NE at various levels.

is crucial for restoring speciﬁc features and textures.

Local

LDR

is deﬁned as:

Local

LDR

= LDR

Local

), F

Local

LDR

∈ R

H×W×C

(3)

Where F

∈ R

H×W ×C

is the input feature map in

both Equations 2 and 3.

Both of these modules operate in parallel, allow-

ing the NE to efﬁciently combine insights from both

the global and local perspectives. The resulting fea-

ture map F

is represented in equation 4:

Global

LDR

⃝ F

Local

LDR

, F

∈ R

H×W×C

(4)

Where

⃝ is the Noise Estimation Fusion opera-

tion, consisting of a convolutional block that merges

both local and global noise-modulated contexts.

In Figure 3, we illustrate the context captured by

the NE at various stages in the network. AKDT

leverages these NE across all stages of processing,

ensuring that both local nuances and global pat-

terns are consistently considered, preventing the over-

specialization of the model.

3.3 Noise-Guided Attention Block

Given an input image H × W × C, traditional atten-

tion mechanisms reach quadratic complexity with re-

spect to the spatial resolution (i.e., O(H

)). In

this work, we utilize a channel-wise attention mecha-

nism (Zamir et al., 2022) to reduce the complexity to

O(C

), allowing our model to work seamlessly with

high-resolution images.

In our novel transformer architecture, we prior-

itize the integration of the NE to reﬁne the atten-

tion mechanism. Starting with a layer-normalized in-

put feature F

∈ R

H×W ×C

, we apply the NE to pro-

duce an enhanced feature representation F

. Uti-

lizing F

, we then compute the query (Q) key (K),

and value (V) components essential for the attention

mechanism as:

Q = W

, K = W

, V = W

, (5)

where W

, W

, and W

denote the learnable pa-

rameters for projecting F

into the respective com-

ponents.

This method underscores that our architecture

departs from conventional convolution-based ap-

proaches, spotlighting the NE role in facilitating a nu-

anced, context-aware generation of attention compo-

nents. Following the computation of Q, K, and V, the

attention operation is deﬁned as:

Attention(Q, K, V) = V ⊙ Softmax





, (6)

where α is a learnable scaling parameter that ﬁne-

tunes the inﬂuence of the dot product between K and

Q prior to the softmax application. Channels are di-

vided into multiple ’heads’, resulting in multiple fea-

ture maps, as seen in the traditional multi-headed self-

attention (Dosovitskiy et al., 2021). This approach

ensures that our attention mechanism is both compu-

tationally efﬁcient and capable of capturing extensive

contextual interactions within the input features.

3.4 Noise-Guided Feed-Forward

Network

Leveraging the NE to enrich the input features, our

NG-FFN signiﬁcantly reﬁnes the feature transforma-

tion process. After applying the NE to the input fea-

ture map F

∈ R

H×W ×C

to obtain F

= NE(F

the NG-FFN employs a gating mechanism (Zamir

et al., 2022) which enhances control over the feature

transformation. The gating mechanism is articulated

through the element-wise multiplication of two par-

allel transformation paths, where one of the paths in-

corporates the GELU (Hendrycks and Gimpel, 2016)

non-linearity.

The process is mathematically represented as:

NG-FFN

= φ(W

) ⊙W

+ F

, (7)

where φ denotes the GELU activation function,

⊙ represents element-wise multiplication, and W

AKDT: Adaptive Kernel Dilation Transformer for Effective Image Denoising

421

Figure 4: Qualitative results on Color Gaussian Denoising - Urban100 dataset.

are the learnable parameters of the parallel paths.

Here, F

NG-FFN

symbolizes the output feature map of

the feed-forward network, showcasing an enhanced

representation for subsequent processing stages.

This approach distinctly bypasses the conven-

tional reliance on convolutions or fully-connected

layers, focusing instead on the synergy between the

NE-enhanced features and the gating mechanism.

4 EXPERIMENTS AND RESULTS

4.1 Implementation Details

In our evaluation, we select image denoising bench-

marks like CBSD68 (Martin et al., 2001), Urban100

(Huang et al., 2015), and McMaster (Zhang et al.,

2011) for synthetic cases, alongside SIDD (Abdel-

hamed et al., 2018) for real noise environments.

Our transformer-based model is architecturally

devised with a quad-level framework, hosting

[1, 2, 2, 4] transformer blocks, complemented by

[1, 2, 4, 8] attention heads at respective levels. The at-

tention dimensions are assigned as [34, 68, 134, 268]

across the levels. A set of 3 reﬁnement blocks are

employed at the end of the encoder-decoder setup.

A distinctive feature of our methodology is the

adoption of progressive training, starting with image

patches of 128 × 128 pixels and gradually advancing

through sizes of 194 × 194, 256 × 256, 320 × 320,

and ultimately 384×384 pixels. This strategy accom-

modates a comprehensive learning spectrum from lo-

cal to global image characteristics and enhances the

model’s adaptability to diverse noise patterns.

To improve the training dynamics, random rota-

tions and ﬂipping are employed as data augmenta-

tion techniques, strengthening the robustness of the

model. We utilize AdamW as the optimizer, parame-

terized by β

= 0.9, β

= 0.999, and a weight decay

of 1e − 4, across 300K iterations. The learning rate

begins at 3e − 4 and is methodically tapered to 1e − 6

via cosine annealing (Loshchilov and Hutter, 2017),

ensuring a smooth and effective model reﬁnement.

Our evaluation framework employ the classical

PSNR (peak Signal-to-Noise Ratio) and SSIM (Struc-

tural Similarity Index Measure) metrics to quantita-

tively measure the denoising capability of the model,

ensuring a thorough appraisal of its capability to re-

construct clean images from noisy inputs. We also

utilize GMACs (Giga Multiply-Accumulate Opera-

tions) to highlight the computational complexity of

different methods.

4.2 Results

Real Image Denoising. Table 2 shows the per-

formance of multiple models on SIDD dataset.

Restormer achieves the highest PSNR of 40.02, mak-

ing it potentially the best choice. On the other hand,

our AKDT model showcases outstanding efﬁciency

with the lowest computational demand, measured at

56.15 GMACs, and the highest SSIM of 0.961. The

results highlight our method and showcase its SOTA

efﬁciency. Qualitative results in Fig. 4 further illus-

trate the performance of our method despite its re-

duced computational load.

Gaussian Denoising. Table 1 shows PSNR scores

of different models on various benchmark datasets.

We test color gaussian denoising capabilities, by in-

cluding well determined synthetic noise levels (σ) 15,

25 and 50. Qualitative results are presented in Fig. 5.

From the data presented we can conclude that

AKDT emerges as a standout, particularly in the

CBSD68, McMaster, and Urban100 datasets, where

it consistently outperforms other models across all

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

422

Table 1: Gaussian Color Denoising benchmarks results (PSNR). “-”: not reported. Red and blue represent best and second-

best values, respectively.

CBSD68 McMaster Urban100

Model

σ=15 σ=25 σ=50 σ=15 σ=25 σ=50 σ = 15 σ = 25 σ = 50

MACs

DnCNN(Zhang et al., 2017) 33.90 31.24 27.95 33.45 31.52 28.62 32.98 30.81 27.59 37 G

FFDNet(Zhang et al., 2018) 33.87 31.21 27.96 34.66 32.35 29.18 33.83 31.40 28.05 –

DSNet(Peng et al., 2019) 33.91 31.28 28.05 34.67 32.40 29.28 – – – –

BRDNet(Tian et al., 2020) 34.10 31.43 28.16 35.08 32.75 29.52 34.42 31.99 28.56 –

DRUNet(Zhang et al., 2021) 34.30 31.69 28.51 35.40 33.14 30.08 34.81 32.60 29.61 144 G

SwinIR(Liang et al., 2021) 34.42 31.78 28.56 35.61 33.20 30.22 35.13 32.90 29.82 759 G

Restormer(Zamir et al., 2022) 34.40 31.79 28.60 35.61 33.34 30.30 35.13 32.96 30.02 141 G

NAFNet-RCD(Zhang et al., 2023) 34.14 31.49 28.26 35.11 32.84 29.81 34.45 32.12 29.09 –

GRL-B(Li et al., 2023) 34.45 31.82 28.62 35.73 33.46 30.36 35.54 33.35 30.46 –

DCANet(Wu et al., 2024) 34.05 31.45 28.28 34.84 32.62 29.59 – – – 75 G

HWFormer(Tian et al., 2024) – – – 35.64 33.36 30.24 35.26 33.10 30.14 303 G

AKDT (Ours) 34.64 31.94 28.68 36.71 34.21 30.95 35.63 33.14 29.82 56 G

Figure 5: Qualitative results on Real Image Denoising - SIDD dataset.

Table 2: Performance of different models on the SIDD

dataset. Red and blue text indicate best and second-best

values, respectively.

Model PSNR ↑ SSIM ↑ GMACs ↓

MPRNet (Zamir et al., 2021) 39.17 0.958 588

CycleISP (Zamir et al., 2020a) 39.52 0.957 189.5

HINet (Chen et al., 2021) 39.99 0.23 170.7

MAXIM (Tu et al., 2022) 39.96 0.960 169.5

Restormer (Zamir et al., 2022) 40.02 0.960 141

MIRNet (Zamir et al., 2020b) 39.72 0.959 786

Uformer (Wang et al., 2021) 39.89 0.960 88.8

PCformer (Wan et al., 2023) 39.80 0.959 152

DCANet (Wu et al., 2024) 39.27 0.956 75.30

AKDT (Ours) 39.70 0.961 56.15

noise levels. This underscores the effectiveness of

AKDT in handling color denoising tasks, even in

challenging noisy environments.

The comparative analysis also brings to light the

trade-offs between performance and computational

efﬁciency. For instance, SwinIR, while achieving

near top marks, demands a substantial computational

cost with 759 GMACs, whereas AKDT maintains a

competitive edge with signiﬁcantly lower computa-

tional needs (56 GMACs). This aspect is crucial for

practical applications where computational resources

are limited or cost-effectiveness is a priority. The data

suggests that AKDT not only provides denoising ca-

pabilities but does so with remarkable efﬁciency.

From a qualitative perspective, AKDT demon-

strates its superior performance by producing sharper

images while not introducing artifacts, as seen in the

outputs of DRUNet and SwinIR, while also not over-

smoothing details, as in the case of Restormer.

By leveraging our proposed NE, we construct

a robust and efﬁcient Transformer architecture

(AKDT) that sets SOTA performance on various de-

noising benchmarks.

5 ABLATION STUDY

As ablation studies, we provide details into our archi-

tectural choices, accompanied by performance met-

rics on the SIDD dataset.

In Table 3 we present the impact of various LDR

conﬁgurations within the NE, employing specialized

paths for local and/or global context (i.e. Dual-

Path).The Dual-Path implementation enforces the im-

portance of specialization upon the model, by hav-

ing dedicated dilation rates for both types of contexts,

preventing excessive focus on one of the types of in-

formation.

Table 4 shows experiments with different com-

pression/expansion rates within the NE. C represents

the input dimensions of the NE. C×K means that the

NE has inner dimensions equal to those of the in-

put multiplied by a factor of K. As the results sug-

gest, performing feature compression within the NE

AKDT: Adaptive Kernel Dilation Transformer for Effective Image Denoising

423

Table 3: Impact of LDR.

Local Global PSNR

✓ 39.47

✓ 39.32

✓ ✓ 39.70

Table 4: Impact of feature expansion.

Change PSNR

C → C × 0.125 39.28

C → C × 0.25 39.70

C → C × 0.5 39.67

C → C × 2 38.63

Table 5: Transformer Block varia-

tions impact.

MSA FFN PSNR GMACs

V V 35.29 51.13

NG V 37.48 53.27

V NG 36.46 65.35

NG NG 39.59 67.49

NG NG+G 39.70 56.16

performs better. This study suggests that the NE is

able to perform best by extracting only the most rele-

vant features from both the Global and Local contexts,

through the LDR

Global

and LDR

Local

respectively.

Table 5 presents a more extensive study on poten-

tial Transformer Block implementations. V represents

vanilla MSA/FFN implementations. NG is the noise-

guidance integration. +G represents the gating mech-

anism. As illustrated, the addition of the NE used for

the NG-MSA improves performance over the vanilla

(V) MSA by 2.19 dB. Similarly, the NG-FFN outper-

forms vanilla FFN by 1.17 dB. Furthermore, the gat-

ing mechanism (+G) improves performance over the

non-gated NG-FFN by 0.11 dB while also reducing

GMACs by 17%.

The ablation studies underscore the critical impact

of our architectural choices on the performance of the

Adaptive Kernel Dilation Transformer (AKDT).

6 CONCLUSION & FUTURE

WORKS

We introduced a novel approach of utilising learn-

able dilation convolutions, the LDR block, and used

it to develop a mechanism that is able to effectively

distinguish and learn between local and global noise

patterns, the NE. We proposed the integration of the

NE into the attention and feed-forward mechanisms

prevalent in Transformer architectures, producing an

efﬁcient SOTA model.

Our comprehensive evaluation demonstrates that

our proposed method has superior performance and

computational efﬁciency in comparison to existing

denoising methods. Furthermore, our experiments

serve as theoretical backing to the proposed design,

thereby proving the positive impact of learnable dila-

tion rate convolutions in Transformer architectures.

REFERENCES

Abdelhamed, A., Lin, S., and Brown, M. S. (2018). A high-

quality denoising dataset for smartphone cameras. In

Proceedings of the IEEE conference on computer vi-

sion and pattern recognition, pages 1692–1700.

Ancuti, C. O., Ancuti, C., De Vleeschouwer, C., and

Bekaert, P. (2017). Color balance and fusion for un-

derwater image enhancement. IEEE Transactions on

image processing, 27(1):379–393.

Bogdan, V., Bonchis, C., and Orhei, C. (2024). An im-

age sharpening technique based on dilated ﬁlters and

2d-dwt image fusion. In Proceedings of the 19th

International Joint Conference on Computer Vision,

Imaging and Computer Graphics Theory and Applica-

tions - Volume 3: VISAPP, pages 591–598. INSTICC,

SciTePress.

Bogdan, V., Bonchis¸, C., and Orhei, C. (2020). Cus-

tom dilated edge detection ﬁlters. Journal of WSCG,

28:161–168.

Brateanu, A., Balmez, R., Avram, A., and Orhei, C. (2024).

Lyt-net: Lightweight yuv transformer-based network

for low-light image enhancement. arXiv preprint

arXiv:2401.15204.

Chen, L., Lu, X., Zhang, J., Chu, X., and Chen, C. (2021).

Hinet: Half instance normalization network for image

restoration. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR) Workshops, pages 182–192.

Chen, Z., Zhang, Y., Gu, J., Kong, L., Yuan, X., et al.

(2022). Cross aggregation transformer for image

restoration. Advances in Neural Information Process-

ing Systems, 35:25478–25490.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., et al. (2021). An image is

worth 16x16 words: Transformers for image recogni-

tion at scale. In ICLR.

Fattal, R. (2007). Image upsampling via imposed edge

statistics. In ACM SIGGRAPH 2007 papers, pages

95–es.

HeK, M. and SUNJ, T. X. O. (2011). Single image haze

removal using dark channel prior. IEEE Transac-

tionson Pattern Analysis and Machine Intelligence,

33(12):2341.

Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear

units (GELUs). arXiv:1606.08415.

Huang, J.-B., Singh, A., and Ahuja, N. (2015). Single image

super-resolution from transformed self-exemplars. In

CVPR.

Kopf, J., Neubert, B., Chen, B., Cohen, M., Cohen-Or,

D., Deussen, O., Uyttendaele, M., and Lischinski, D.

(2008). Deep photo: Model-based photograph en-

hancement and viewing. ACM transactions on graph-

ics (TOG), 27(5):1–10.

Li, Y., Fan, Y., Xiang, X., Demandolx, D., Ranjan, R.,

Timofte, R., and Van Gool, L. (2023). Efﬁcient

and explicit modelling of image hierarchies for im-

age restoration. In Proceedings of the IEEE/CVF Con-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

424

ference on Computer Vision and Pattern Recognition,

pages 18278–18289.

Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and

Timofte, R. (2021). SwinIR: Image restoration using

swin transformer. In ICCV Workshops.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z.,

Lin, S., and Guo, B. (2021). Swin transformer: Hi-

erarchical vision transformer using shifted windows.

arXiv:2103.14030.

Loshchilov, I. and Hutter, F. (2017). SGDR: Stochastic gra-

dient descent with warm restarts. In ICLR.

Martin, D., Fowlkes, C., Tal, D., and Malik, J. (2001). A

database of human segmented natural images and its

application to evaluating segmentation algorithms and

measuring ecological statistics. In ICCV.

Michaeli, T. and Irani, M. (2013). Nonparametric blind

super-resolution. In Proceedings of the IEEE Inter-

national Conference on Computer Vision, pages 945–

952.

Orhei, C., Bogdan, V., Bonchis, C., and Vasiu, R. (2021).

Dilated ﬁlters for edge-detection algorithms. Applied

Sciences, 11(22):10716.

Orhei, C. and Vasiu, R. (2023). An analysis of extended and

dilated ﬁlters in sharpening algorithms. IEEE Access,

11:81449–81465.

Peng, Y., Zhang, L., Liu, S., Wu, X., Zhang, Y., and Wang,

X. (2019). Dilated residual networks with symmetric

skip connection for image denoising. Neurocomput-

ing.

Ramachandran, P., Zoph, B., and Le, Q. V. (2017).

Swish: a self-gated activation function. arXiv preprint

arXiv:1710.05941.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net:

convolutional networks for biomedical image segmen-

tation. In MICCAI.

Shi, W., Caballero, J., Husz

ar, F., Totz, J., Aitken, A. P.,

Bishop, R., Rueckert, D., and Wang, Z. (2016). Real-

time single image and video super-resolution using an

efﬁcient sub-pixel convolutional neural network. In

CVPR.

Tian, C., Xu, Y., and Zuo, W. (2020). Image denoising

using deep cnn with batch renormalization. Neural

Networks.

Tian, C., Zheng, M., Lin, C.-W., Li, Z., and Zhang, D.

(2024). Heterogeneous window transformer for im-

age denoising. IEEE Transactions on Systems, Man,

and Cybernetics: Systems, 54(11):6621–6632.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,

A., and J

egou, H. (2021). Training data-efﬁcient im-

age transformers & distillation through attention. In

ICML.

Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik,

A., and Li, Y. (2022). Maxim: Multi-axis mlp for

image processing. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 5769–5780.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. In NeurIPS.

Wan, Y., Shao, M., Cheng, Y., Meng, D., and Zuo, W.

(2023). Progressive convolutional transformer for im-

age restoration. Engineering Applications of Artiﬁcial

Intelligence, 125:106755.

Wang, W., Wu, X., Yuan, X., and Gao, Z. (2020). An

experiment-based review of low-light image enhance-

ment methods. Ieee Access, 8:87884–87917.

Wang, Z., Cun, X., Bao, J., and Liu, J. (2021). Uformer:

A general u-shaped transformer for image restoration.

arXiv:2106.03106.

Wu, W., Lv, G., Duan, Y., Liang, P., Zhang, Y., and Xia,

Y. (2024). Dual convolutional neural network with

attention for image blind denoising.

Yu, F. and Koltun, V. (2016). Multi-scale context aggrega-

tion by dilated convolutions. In International Confer-

ence on Learning Representations.

Yu, F., Koltun, V., and Funkhouser, T. (2017). Dilated

residual networks. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 472–480.

Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.,

Tay, F. E., Feng, J., and Yan, S. (2021). Tokens-to-

token vit: Training vision transformers from scratch

on imagenet. arXiv:2101.11986.

Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan,

F. S., and Yang, M.-H. (2022). Restormer: Efﬁcient

transformer for high-resolution image restoration. In

CVPR.

Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S.,

Yang, M.-H., and Shao, L. (2020a). Cycleisp: Real

image restoration via improved data synthesis. In Pro-

ceedings of the IEEE/CVF conference on computer vi-

sion and pattern recognition, pages 2696–2705.

Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S.,

Yang, M.-H., and Shao, L. (2020b). Learning enriched

features for real image restoration and enhancement.

In ECCV.

Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S.,

Yang, M.-H., and Shao, L. (2021). Multi-stage pro-

gressive image restoration. In CVPR.

Zhang, K., Li, Y., Zuo, W., Zhang, L., Van Gool, L., and

Timofte, R. (2021). Plug-and-play image restoration

with deep denoiser prior. TPAMI.

Zhang, K., Luo, W., Zhong, Y., Ma, L., Stenger, B., Liu, W.,

and Li, H. (2020). Deblurring by realistic blurring. In

CVPR.

Zhang, K., Zuo, W., Chen, Y., Meng, D., and Zhang, L.

(2017). Beyond a gaussian denoiser: Residual learn-

ing of deep cnn for image denoising. TIP.

Zhang, K., Zuo, W., and Zhang, L. (2018). FFDNet: To-

ward a fast and ﬂexible solution for CNN-based image

denoising. TIP.

Zhang, L., Wu, X., Buades, A., and Li, X. (2011). Color de-

mosaicking by local directional interpolation and non-

local adaptive thresholding. JEI.

Zhang, Z., Jiang, Y., Shao, W., Wang, X., Luo, P., Lin, K.,

and Gu, J. (2023). Real-time controllable denoising

for image and video. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 14028–14038.

AKDT: Adaptive Kernel Dilation Transformer for Effective Image Denoising

425