2.2 Proposed Approach
The network pipeline integrates U-Net with
ResNet18 and VGG16 backbones as the generator's
architecture for image colorization tasks, with a focus
on efficiently translating grayscale images into
colored outputs. The U-Net structure, known for its
encoder-decoder configuration with skip
connections, is pretrained on the training dataset
comprising pairs of grayscale and colored images.
Pretraining the generator for grayscale image
colorization ensures initial sample diversity and a
smooth transition towards covering the entire target
color distribution, resulting in more gradual image
evolution (Grigoryev, 2022). The PatchGAN
discriminator evaluates the authenticity of the
generated images on a localized patch basis, refining
the generator's output through adversarial training.
The training procedure involves a cycle of updating
the discriminator using both real and synthesized
images, followed by refining the generator to create
images that are progressively more difficult to
distinguish from real ones. This comprehensive
approach, illustrated in Figure 1, combines advanced
architectures and a detailed training strategy to
successfully colorize grayscale images with high
quality.
2.2.1 U-Net
The utilization of U-Net as the foundational
architecture for the network's generator proves
advantageous for tasks centered around image-to-
image translation (Isola, 2017). In multiple image
translation scenarios, a significant amount of
fundamental information is commonly exchanged
between the input and output and becomes
advantageous to directly transfer this information
across the network. In order to provide the generator
with an effective mechanism to overcome
information bottlenecks, skip connections are
introduced, inspired by the architecture of a "U-Net"
(Ronneberger, 2015). These connections are
strategically placed between every layer i and its
corresponding layer n - i, where n represents the total
number of layers. Each skip connection functions by
concatenating all channels at layer i with those at
layer ๐๎ต๐.
This framework revolves around integrating the
U-Net structures as the central generator as illustrated
in Figure 2. while incorporating diverse backbone
architectures such as ResNet18 and VGG16 to bolster
feature extraction and representation. The U-Net
generators, augmented with ResNet18 and VGG16
backbones, undergo pretraining on a curated training
dataset comprising pairs of grayscale and colored
images. Employing the L1 loss function during
pretraining and optimizing with the Adam optimizer
contributes to refining the generators' capacity to
generate realistic color predictions from grayscale
inputs.
Figure 2: The structure of U-Net (Photo/Picture credit:
Original).
The "U-Net" in Figure 2. configuration is an
encoder-decoder structure distinguished by skip
connections, where 'X' is the greyscale input and 'Y'
is the resultant colorized image. This design
facilitates the direct flow of information across the
network, allowing the model to preserve details from
the input for a precise colorization output.
2.2.2 PatchGAN
The discriminator utilizes a convolutional PatchGAN
which is a model comprised of stacked blocks of
Convolutional layer, Batch Normalization layer, and
Leaky ReLU layer, as illustrated in Figure 3, to decide
whether the input image is fake or real. The first and
last blocks do not use normalization and the last block
has no activation function. The PatchGAN solely
penalizes structural inconsistencies within patches of
an image, operating at a localized scale rather than ,
rather than evaluating the image as a whole. By
applying convolution across the entire image, the
discriminator combines all responses to produce its
final output.
Initially, the discriminator undergoes training as
fake images generated by the generator are inputted
into the discriminator. Subsequently, a batch of real
images from the training set is fed into the
discriminator and labeled as real. The losses incurred
from both fake and real images are summed up,
averaged, and subjected to the backward operation to
update the discriminator. Then, the generator is
trained by feeding fake images into the discriminator