tion in a video encoder. Error concealment based on
neighboring motion vectors, e.g., (Chen et al., 1997),
assumes that the surrounding motion vectors to the
lost macroblock (MB) are available. Error conceal-
ment using side information, e.g., (Hadizadeh et al.,
2013), sends an additional low-resolution version of
the image frame as side information to assist error
concealment. Pixel-wise post-processing technique,
e.g., (Atzori et al., 2001), is another form of error con-
cealment where inside the loss concealed area, MBs
are refined using mesh-based warping. In error con-
cealment with error propagation, e.g., (Usman et al.,
2016), a missing frame between two received frames
is interpolated using motion trajectory and then the
error concealment quality is improved by adaptive fil-
tering. Furthermore, shape preservation loss conceal-
ment techniques, e.g. (Shirani et al., 2000a), aim to
recover the object’s shape in the lossy frames.
In recent years, researchers have been focusing
on the use of deep neural network for video error
concealment. For example, the FINNiGAN model
(Koren et al., 2017) uses generative adversarial net-
work (GAN) while performing frame interpolation.
Similarly, a GAN consisting of one completion net-
work and one discriminator network is used by the au-
thors in (Xiang et al., 2019) that follows an encoder-
decoder structure. Likewise, an adversarial learning
framework using conditional GAN (cGAN) is pro-
posed in (Mahmud et al., 2018) to reconstruct a frame
when one or more frames are missing in a multi-
camera scenario. However, the FINNiGAN model
produces some unrelated details while trying to fill
in details within the high motion region of the video,
and the GAN in (Xiang et al., 2019) only uses tem-
poral information from the past frames and omits the
information from future frames.
Other works involving neural networks include
image inpainting. The authors in (Liu et al., 2018)
proposed the use of a partial convolution layer with
an automatic mask update. However, this model does
not work well in images which has thin structured
objects in it, e.g., handlebars on the door. The au-
thors in (Radford et al., 2015) introduced a different
class of CNN and called it DCGAN, which works
well for an image classification task but not for re-
gression tasks like video error concealment. Sim-
ilarly, the authors in (Yu et al., 2018) proposed a
feed-forward CNN which can process images with
multiple loss at arbitrary locations and with variable
sizes. It is an enhancement of baseline generative
image inpainting network (Iizuka et al., 2017) which
has shown promising visual results for inpainting im-
ages of faces, building facades, and natural images.
However, these image inpainting techniques consider
spatial information and do not use knowledge of tem-
poral information in video sequences. The approach
in (Sankisa et al., 2018) combines convolutional long
short-term memory (LSTM) model and simple con-
volutional layers which predict optical flow using the
existing optical flows of the previous frames. How-
ever, the model needs to know the location of the er-
ror in the frame and it only uses frames from the past
to train the model. Similarly, the authors in (Sankisa
et al., 2020) presented a deep learning framework us-
ing capsule network architecture that uses motion as
an instantiation parameter to encode motion in videos
followed by motion-compensated error concealment
using the extracted motion. However, this network
model has been demonstrated to work with video se-
quences from the same training dataset. Very recently,
a flow-based video completion algorithm is proposed
in (Gao et al., 2020) which maintains the sharpness
of the video but it produces arbitrary content in large
missing regions within the video frames. Similarly,
the authors in (Zeng et al., 2020) proposed a joint
Spatial-Temporal Transformer Network (STTN) for
video inpainting to concurrently fill lost regions in all
video frames. However, STTN fails to generate ac-
curate contents to fill lost regions in the video frames
which have motion contents. Finally, a video inpaint-
ing method is proposed in (Liu et al., 2021) which
aligns the frames at a feature level via implicit motion
estimation and aggregates temporal features to syn-
thesize missing content by aligning reference frames
with target frame. However, this method is not suit-
able for practical applications.
From the above discussion, we can see that there
are existing methods presented by different authors
where GANs, inpainting models, and architectures of
CNNs are used for video error concealment. How-
ever, to the best of our knowledge, there are no works
or experiments in video error concealment using CNN
architecture which uses information from both the
past and future video frames. Also, the existing ap-
proaches did not consider transfer learning to make
use of spatial and temporal information from both the
past and future frames to conceal errors in video data.
3 PROPOSED APPROACH
Figure 1 and Figure 2 jointly represent our CECNN
approach. In Figure 1, we show the model training
stage. In this first stage, both the original images and
video frames from the training datasets along with
simulated errors (such as missing blocks and slices)
in those images/frames are passed to the network and
voxel information of error concealed image/frame is
SIGMAP 2022 - 19th International Conference on Signal Processing and Multimedia Applications
24