REPVSR: Efﬁcient Video Super-Resolution via Structural

Re-Parameterization

KunLei Hu

1 a

and Dahai Yu

2 b

Tetras.Ai, China

TCL Corporate Research (HK) Co., Ltd, China

Keywords:

Video Super-Resolution, Re-Parameterization Method, Efﬁciency Network.

Abstract:

Recent advances in video super-resolution (VSR) explored the power of deep learning to achieve a bet-

ter reconstruction performance. However, the high computational cost still hinders it from practical usage

that demands real-time performance (24 fps). In this paper, we propose a re-parameterization video super-

resolution(REPVSR) to accelerate the reconstruction speed with efﬁcient and generic network. Speciﬁcally,

we propose re-parameterizable building blocks, namely Super-Resolution Multi-Branch block (SRMB) for ef-

ﬁcient SR part design and FlowNet Multi-Branch block (FNMB) for optical ﬂow estimation part. The blocks

extract features in multiple paths in the training stage, and merge the multiple operations into one single 3×3

convolution in the inference stage. We then propose an extremely efﬁcient VSR network based on SRMB and

FNMB, namely REPVSR. Extensive experiments demonstrate the effectiveness and efﬁciency of REPVSR.

1 INTRODUCTION

Video super-resolution (VSR) is developed from sin-

gle image super-resolution, it aims to generate a high-

resolution (HR) video from its corresponding low-

resolution (LR) observation by ﬁlling in missing de-

tails, trying to restore the deﬁnition of video and

improve the subjective visual quality. Thanks to

deep learning, VSR based on neural networks expe-

rienced signiﬁcant improvements over the last few

years. However, the main research directions(Wang

et al., 2019; Chan et al., 2021; Liu et al., 2021) lie

in the pursuit of high ﬁdelity scores by employing a

very deep and complicated network structure, ignor-

ing computational efﬁciency and memory constraints.

In order to deploy VSR models on resource-

limited devices, latest research demonstrated mean-

ingful advances in terms of lightweight model struc-

ture design(Xia et al., 2023; Fuoli et al., 2023), mod-

els with fewer FLOPs may have even larger latency

because of the deployment of hardware-unfriend op-

erators(Wang et al., 2019), some tiny VSR models

such as VESPCN(Caballero et al., 2017) can reach

nearly real-time speed, in the meantime, their VSR

performance measured by PSNR is quite limited.

https://orcid.org/0009-0005-1309-0951

https://orcid.org/0000-0003-1427-8807

Thus, model parameters reduction and hardware-

friendly operators design have attracted more and

more attention. It is always challenging to design both

light-weight and inference efﬁcient VSR model due to

the very limited hard-ware resources, but along with

growing commercial and industrial demand, it is also

very necessary to design a lightweight VSR model

with fewer parameters and efﬁcient structures.

In this paper, inspired by Ding.et.al(Ding et al.,

2021b; Ding et al., 2022; Zhou et al., 2023), We

propose a rigorous and effective framework SRMB

and FNMB that is theoretically veriﬁed and exper-

imentally validated. Based on SRMB and FNMB

structure, we further propose recurrent VSR net-

work(REPVSR) using super-light model design and

re-parameterization technique to accelerate the infer-

ence speed and enhance reconstructive quality. The

contributions of this study are as listed:

(1) The SRMB and FNMB blocks proposed in this

paper can be used to improve the super resolution per-

formance and optical ﬂow estimation results respec-

tively, without introducing any extra burden on infer-

ence or deployment.

(2) We propose a super efﬁcient and lightweight

VSR model termed REPVSR by embedding the

SRMB and FNMB blocks into recurrent end-to-end

trainable VSR framework. Extensive experiments and

comparisons validate the computational efﬁciency

540

Hu, K. and Yu, D.

REPVSR: Efﬁcient Video Super-Resolution via Structural Re-Parameterization.

DOI: 10.5220/0013186900003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

540-546

ISBN: 978-989-758-728-3; ISSN: 2184-4321

RepSRNet

RepFlowNet

Up↑

Warp

Flowloss

SpacetoDepth

SRloss

VSRRecurrentFramework

Figure 1: Overall of the RepVSR model. In the ﬁgure, green rectangles and red dash-lined rectangles represent the LR input

frames and HR predicted frames, respectively.

and effectiveness of our proposed REPVSR network,

which surpass recent re-parameterization schemes

and lightweight VSR models, close to the large pa-

rameter model.

2 RELATED WORK

2.1 Deep-Learning Based Video

Super-Resolution

Recently, deep-learning based VSR algorithms have

risen rapidly. Existing VSR approaches can be mainly

divided into sliding-window methods and recurrent

methods. Sliding-window framework compute opti-

cal ﬂows between multi-frames to aggregate informa-

tion and perform spatial warping for alignment(Haris

et al., 2019; Xue et al., 2019). Deformable con-

volution networks have been developed to address

feature misalignment(Wang et al., 2019; Tian et al.,

2020). Recurrent VSR structures can pass the pre-

vious HR estimate directly to the next step, recreat-

ing ﬁne details and producing temporally consistent

videos. FRVSR(Sajjadi et al., 2018) stores the HR es-

timate of the previous frame and uses it to generate the

subsequent frame. Some bidirectional recurrents such

as BasicVSR(Chan et al., 2021; Chan et al., 2022) can

enforce the forward and backward consistency of the

LR warped inputs and HR-predicted frames.

2.2 Structural Re-Parameterization

Techniques

There are several studies on re-parameterization have

shown their effectiveness on high-level vision tasks

such as image classiﬁcation, object detection and se-

mantic segmentation. DiracNet(Zagoruyko and Ko-

modakis, 2017) builds deep plain models by encod-

ing the kernel of convolution layers, getting com-

parable performance of ResNet. Related to Dirac-

Net, RepVGG(Ding et al., 2021b) ﬁrstly proposed

a structural re-parameterization technique. AC-

Net(Ding et al., 2019) and ExpandNet(Marnerides

et al., 2018) can also be viewed as structural re-

parameterization. Previous re-parameterizaton meth-

ods are mainly employed on high-level vision tasks

and super-resolution tasks. In this paper, we embed

re-parameterization mechanism into recurrent video

super-resolution framework, proposing light-weight

VSR model without introducing additional cost in the

inference stage.

3 PROPOSED METHOD

Our REPVSR is based on recurrent framework

as Figure1 illustrates. Speciﬁcally, we employ

re-parameterization mechanism to design optical

ﬂow estimation network (RepFlowNet) and super-

resolution network (RepSRNet).

3.1 Multi-Branch Training Block

As Figure 2 (a) and (b) show, the design of RepSRNet

follows residual architectures (Sajjadi et al., 2017)

and RepFlowNet uses encoder-decoder style architec-

ture. Inspired by Diverse Branch Block (DBB) (Ding

et al., 2021a) which enhances the representational ca-

pacity of a single convolution by combining diverse

branches of different scales and complexities, we in-

REPVSR: Efﬁcient Video Super-Resolution via Structural Re-Parameterization

541

troduce SRMB and FNMB in this paper. Figure 2 (c)

illustrates the architecture of SRMB and FNMB block

which are summarized as follows:

Component I: Common 3 ×3 Convolution. A com-

mon 3 × 3 convolution W

∈ R

D×C×3×3

is employed

to C-channel input I ∈ R

C×H×W

to ensure the base

performance. The bias B

is added onto the results of

convolution. The convolution operation is formulated

as:

O = W

∗ I + B

(1)

Component II: A conv for Sequential Convolu-

tions. We merge a sequence of 1 × 1 conv - 3 × 3

conv, 3 × 3 conv - 1 × 1 conv and 1 × 1 conv - 3 × 3

conv - 1 ×1 conv into one 3×3 conv as wider features

can improve the expressions. Take the ﬁrst sequence

as example, W

(1)

∈ R

C×D×3×3

and W

(2)

∈ R

D×C×1×1

represent 1 × 1 and 3 × 3 convolution kernel respec-

tively to expand and squeeze features. The feature is

extracted as:

′

= W

(2)

∗ (W

(1)

∗ I + B

(1)

) + B

(2)

The other two sequence of can be merged follow-

ing the same mechanism detailed above.

Component III: A conv for Convolution with

Laplacian. Since the Laplacian ﬁlter is useful for

ﬁnding the ﬁne details of a video frame (Jian et al.,

2008), we ﬁrst employ 1 × 1 conv (the weights and

bias are W

and B

) and then use the Laplacian ﬁlter

(denoted as D

lap

) to extract spatial derivative (Zhang

et al., 2021). The edge information feature is formu-

lated as follows.

lap

= (S

lap

· D

lap

) ⊗ (W

∗ I + B

) + B

lap

(3)

where S

lap

and B

lap

respectively represent scal-

ing factors and bias of depth-wise convolution, and ⊗

means depth-wise convolution (DWConv).

In general, the output of FNMB is the combination of

the ﬁrst two components and the output of SRMB is

the combination of all three components.

3.2 Re-Parameterization for VSR

Inference

We re-parameterize FNMB and SRMB into a sin-

gle 3 × 3 convolution for efﬁcient inference. The se-

quence of 1× 1 conv - 3 × 3 conv in component II can

be merged into one single normal convolution with

parameters W

, B

= perm(W

(1)

) ∗W

(2)

= W

(2)

∗ rep(B

(1)

) + B

(2)

, (4)

where perm represents the permute operation and

rep means using spatial transmission to replicate the

bias to speciﬁed dimension. Similarly, the sequence

3 × 3 conv - 1 × 1 conv and 1 × 1 conv - 3 × 3 conv

- 1 × 1 conv can be merged as W

, B

and W

, B

. As

for component III that employ 1 × 1 conv and 3 × 3

DWConv, we have:

lap

[i, i, :, :] = (S

lap

· D

lap

)[i, 1, :, :]

lap

[i, j, :, :] = 0, i ̸= j, (5)

where W

lap

denotes the weight of convolution

which is equal to DWConv and i, j represent the num-

ber of channel. Thus, the weights of FNMB after re-

parameterization is:

FNMB

∑

i=0

}, B

FNMB

∑

i=0

} (6)

and the weights of SRMB after re-parameterization

is:

SRMB

∑

i=0

} + perm(W

) ∗W

lap

SRMB

∑

i=0

} + perm(B

) ∗ B

lap

(7)

The output feature of the multibranch architecture

can be obtained by using single normal convolution in

inference time by re-parameterzation technique.

3.3 Loss Function

As Figure 1 illustrates, there are two streams during

training stage: the HR and LR frames. The loss on HR

frames L

is compute between the output of RepSR-

Net and the HR frames. I

denotes the ground truth

frame and

denotes the generated frame at time t.

Since optical ﬂow of our video dataset do not have

ground truth, we utilize the warped LR frames from

t − 1 to t as the loss function of RepFlowNet L

Flow

For each recurrent step, the SR loss and Flow loss are

calculated as:

= ∥

− I

∥

(8)

and also:

Flow

= ∥warp(I

t−1

, F

t−1→t

) − I

∥

. (9)

Where warp(·) represents warp operation. In all,

the overall loss function for training are combined as:

total

= L

+ L

Flow

(10)

4 EXPERIMENTS

4.1 Experiment Settings

4.1.1 Baseline Methods

The most popular dataset for testing is Vid4, more

high-frequency details included than other datasets.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

542

Conv

ReLU

SRMB

ReLU

PixelShuffle

6Residualblocksintotal

Conv

FNMB

LeakyRelu

Maxpool2x

Encoder

Decoder

Tanh

FNMB

LeakyRelu

Biliner2x

Conv3x3

Conv1x1

Laplacian

Conv3x3

Conv1x1

Conv3x3

Conv1x1

FNMB

SRMB

FNMBSRMB SRMulti‐BranchBlock FlowNetMulti‐BranchBlock

(a)

(b)

(c)

TrainingTime

InferenceTime

Figure 2: Network architectures for RepSRNet and RepFlowNet. The sub-ﬁgure (c) detail the re-parameterization block

embedded in RepSRNet and RepFlowNet (SRMB and FNMB, respectively).

Table 1: Quantitative comparisons of several benchmark on PSNR|SSIM values.

Scale

Dataset

PSNR/SSIM

Bicubic VESPCN SOFVSR FRVSR TecoGAN BasicVSR REPVSR

Vid4 23.53/0.628 25.35/0.756 26.01/0.772 26.69/0.822 25.89/0.737 27.24/0.825 26.85/0.817

Vimeo-90k 31.32/0.868 33.55/0.907 34.89/0.923 35.64/0.932 34.27/0.925 37.18/0.945 35.62/0.928

Set14 31.85/0.802 32.99/0.872 33.23/0.916 32.18/0.917 32.22/0.922 33.63/0.949 33.02/0.926

Vimeo-90k 36.52/0.871 37.76/0.899 37.53/0.938 37.71/0.941 38.01/0.945 38.27/0.960 37.65/0.953

Thus, Vid4 is frequently used for evaluating the per-

formance of VSR methods. Vimeo-90K and set15

includes videos with hard and real scenes, which is

challenging for VSR methods. So we choose these

three datasets as testing data in the following section.

Several DL-based methods are selected for com-

parison, including VESPCN(Caballero et al., 2017),

SOFVSR(Wang et al., 2020), FRVSR(Sajjadi et al.,

2018), TecoGAN(Chu et al., 2020). The reason for

this selection is that we take the number of model pa-

rameters into consideration, the parameters of the se-

lected models is similar to or larger than the model

proposed in this paper. the BasicVSR(Chan et al.,

2021) we chosen here is to verify the numerical met-

rics gap between our proposed method and leading

large parameter model like BasicVSR.

4.1.2 Implementation Details

We conduct experiments on data captured from 40

high-resolution videos (720p, 1080p and 4K) down-

loaded from vimeo.com. We apply Gaussain blur

with standard deviation σ = 1.5 to the HR frames and

downsample them by 4× to produce the input LR

videos, also knows as Blur Down(BD). Our model

is implemented with Pytorch framework on the PC

with a single NVIDIA GeForce GTX 2080Ti GPU.

The Adam optimizer is used to train the network with

= 0.9 and β

= 0.999 with a basic learning rate of

0.0001 and it is decayed by 0.5 every 150000 itera-

tions. We choose the size of the mini-batch as 4 and

the total number of iterations as 4e5.

4.2 Evalution Results and Discussion

4.2.1 Quantitative Results and Qualitative

Evaluations

As Table 1 shows, the quantitative metrics peak

pixel-to-noise ratio (PSNR) and structural similar-

ity (SSIM) are computed on RGB-channels for an

objective assessment of VSR image quality in Vid4

datasets and Vimeo-90k test part in BD method.

(1) Compared with competitive lightweight VSR

networks, our REPVSR obtains 0.46dB gain on Vid4

over FRVSR, and also has huge advantages over other

compared models. Note that, different from origi-

nal network FRVSR, we merely depoly SRMB and

FNMB technique on the shortened backbone network,

and we obtain superior performance while only con-

suming a fraction of FLOPs of the original FRVSR

network.

(2) It is interesting that our small model, de-

spite being much more efﬁcient, gets very close re-

REPVSR: Efﬁcient Video Super-Resolution via Structural Re-Parameterization

543

GTFRVSR TecoGAN

VESPCN

SOFVSR Ours

Bicubic

FRVSR

TecoGAN

VESPCN SOFVSR

Ours

Bicubic

Figure 3: Qualitative comparison on Vid4.

Table 2: Comparision of running time (in seconds).

Method Parameters(M) Source Target FLOPs(G) FPS (GPU)

VESPCN

320× 180 720p 96.56 48.48

0.879 480× 270 1080p 221.08 24.76

960× 540 4K 886.47 6.78

SOFVSR

320× 180 720p 226.12 13.31

1.640 480× 270 1080p 508.78 5.993

960× 540 4K 2035.11 1.73

FRVSR

320× 180 720p 190.81 31.16

2.589 480× 270 1080p 429.30 15.10

960× 540 4K 1718.65 3.76

TecoGAN

320× 180 720p 190.81 31.15

2.589 480× 270 1080p 429.30 15.05

960× 540 4K 1718.65 3.74

RepVSR

320× 180 720p 29.435 96.76

0.274 480× 270 1080p 66.228 37.52

960× 540 4K 264.926 14.36

sults compared to the much larger model, like Ba-

sicVSR on the validation datasets, demonstrating that

our REPVSR method can make better use of the re-

parameterized structure of the network and increases

the efﬁciency of the learned network parameters.

From the objective results, the above quantitative

evaluation is consistent with the qualitative evaluation

show in Figure 3. We can see that our models are able

to recover ﬁne details and produce visually pleasing

results. REPVSR achieves the most restoration ability

while maintaining a slim framework.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

544

4.2.2 Running Time Analysis

The running frame rates of different VSR models dur-

ing inference stage will be presented in this part. The

experimental results are shown in Table 2. The sec-

ond column lists the parameters of each VSR model

and column 5 counts the statistics of correspond-

ing computation cost. The total computation cost

required by our REPVSR during inference time is

only 31.17% of VESPCN, 16.71% of SOFVSR, and

10.58% of FRVSR and TecoGAN. Not to mention

REPVSR, which has a very large parameter of 338.5G

FLOPS. The last columns illustrate the average FPS

in different resolutions, When generating 1080p def-

inition video, the proposed method can run in real

time on NVIDIA Geforce GTX 1080 level graphics

cards. Due to the implementation of structural re-

parameterization, our REPVSR model runs two times

and even much more faster on GPU platform com-

pared with other deep models.

5 CONCLUSION

In this paper, we design a recurrent VSR net-

work based on re-parameterization (REPVSR) to re-

parameterize models with a multi-branch design. The

positive results show favorable speed-accuracy trade-

off compared to existing VSR models. In the future,

we aim to embed re-parameterization mechanism to

other efﬁcient VSR architecture.

REFERENCES

Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J.,

Wang, Z., and Shi, W. (2017). Real-time video super-

resolution with spatio-temporal networks and motion

compensation. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition,

pages 4778–4787.

Chan, K. C., Wang, X., Yu, K., Dong, C., and Loy, C. C.

(2021). Basicvsr: The search for essential components

in video super-resolution and beyond. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 4947–4956.

Chan, K. C., Zhou, S., Xu, X., and Loy, C. C. (2022). Ba-

sicvsr++: Improving video super-resolution with en-

hanced propagation and alignment. In Proceedings

of the IEEE/CVF conference on computer vision and

pattern recognition, pages 5972–5981.

Chu, M., Xie, Y., Mayer, J., Leal-Taix

e, L., and Thuerey,

N. (2020). Learning temporal coherence via self-

supervision for gan-based video generation. ACM

Transactions on Graphics (TOG), 39(4):75–1.

Ding, X., Chen, H., Zhang, X., Huang, K., Han, J.,

and Ding, G. (2022). Re-parameterizing your op-

timizers rather than architectures. arXiv preprint

arXiv:2205.15242.

Ding, X., Guo, Y., Ding, G., and Han, J. (2019). Acnet:

Strengthening the kernel skeletons for powerful cnn

via asymmetric convolution blocks. In Proceedings of

the IEEE/CVF International Conference on Computer

Vision, pages 1911–1920.

Ding, X., Zhang, X., Han, J., and Ding, G. (2021a). Diverse

branch block: Building a convolution as an inception-

like unit. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

10886–10895.

Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., and Sun,

J. (2021b). Repvgg: Making vgg-style convnets great

again. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

13733–13742.

Fuoli, D., Danelljan, M., Timofte, R., and Van Gool,

L. (2023). Fast online video super-resolution with

deformable attention pyramid. In Proceedings of

the IEEE/CVF Winter Conference on Applications of

Computer Vision, pages 1735–1744.

Haris, M., Shakhnarovich, G., and Ukita, N. (2019).

Recurrent back-projection network for video super-

resolution. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 3897–3906.

Jian, S., Xu, Z., and Shum, H. Y. (2008). Image super-

resolution using gradient proﬁle prior. In 2008 IEEE

Computer Society Conference on Computer Vision

and Pattern Recognition (CVPR 2008), 24-26 June

2008, Anchorage, Alaska, USA.

Liu, H., Zhao, P., Ruan, Z., Shang, F., and Liu, Y. (2021).

Large motion video super-resolution with dual subnet

and multi-stage communicated upsampling. In Pro-

ceedings of the AAAI conference on artiﬁcial intelli-

gence, volume 35, pages 2127–2135.

Marnerides, D., Bashford-Rogers, T., Hatchett, J., and De-

battista, K. (2018). Expandnet: A deep convolu-

tional neural network for high dynamic range expan-

sion from low dynamic range content. In Computer

Graphics Forum, volume 37, pages 37–49. Wiley On-

line Library.

Sajjadi, M. S., Scholkopf, B., and Hirsch, M. (2017). En-

hancenet: Single image super-resolution through au-

tomated texture synthesis. In Proceedings of the IEEE

International Conference on Computer Vision, pages

4491–4500.

Sajjadi, M. S., Vemulapalli, R., and Brown, M. (2018).

Frame-recurrent video super-resolution. In Proceed-

ings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 6626–6634.

Tian, Y., Zhang, Y., Fu, Y., and Xu, C. (2020). Tdan:

Temporally-deformable alignment network for video

super-resolution. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recog-

nition, pages 3360–3369.

Wang, L., Guo, Y., Liu, L., Lin, Z., Deng, X., and An, W.

(2020). Deep video super-resolution using hr optical

REPVSR: Efﬁcient Video Super-Resolution via Structural Re-Parameterization

545

ﬂow estimation. IEEE Transactions on Image Pro-

cessing, 29:4323–4336.

Wang, X., Chan, K. C., Yu, K., Dong, C., and Change Loy,

C. (2019). Edvr: Video restoration with enhanced de-

formable convolutional networks. In Proceedings of

the IEEE/CVF Conference on Computer Vision and

Pattern Recognition Workshops, pages 0–0.

Xia, B., He, J., Zhang, Y., Wang, Y., Tian, Y., Yang, W.,

and Van Gool, L. (2023). Structured sparsity learning

for efﬁcient video super-resolution. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 22638–22647.

Xue, T., Chen, B., Wu, J., Wei, D., and Freeman, W. T.

(2019). Video enhancement with task-oriented ﬂow.

International Journal of Computer Vision, 127:1106–

1125.

Zagoruyko, S. and Komodakis, N. (2017). Diracnets:

Training very deep neural networks without skip-

connections. arXiv preprint arXiv:1706.00388.

Zhang, X., Zeng, H., and Zhang, L. (2021). Edge-oriented

convolution block for real-time super resolution on

mobile devices. In Proceedings of the 29th ACM In-

ternational Conference on Multimedia, pages 4034–

4043.

Zhou, D., Gu, C., Xu, J., Liu, F., Wang, Q., Chen, G.,

and Heng, P.-A. (2023). Repmode: Learning to re-

parameterize diverse experts for subcellular structure

prediction. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 3312–3322.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

546