stance of each clothing category in the input image.
Each output channel of ‘BBbranch’ is dedicated to
bounding box prediction of a specific clothing cate-
gory, and no complex region proposal subnetwork is
needed. It is important to note that no additional anno-
tation work was done, the ground truth bounding box
position calculation was inferred automatically from
the ground truth segmentation mask.
Output layers of these additional branches are in-
jected back into the segmentation network by con-
catenation. More specifically, logit outputs of the
branches are expanded so that they match the reso-
lution of the input image (e.g. for the ’ClassBranch’
eight logit output neurons are transformed into eight
layers of resolution 288x192. Each layer contains
288x192 clones of the corresponding output neuron).
This way of branches injection enables gradient flow
from the main segmentation loss to the branches. The
concatenation is followed by a residual convolutional
layer and final convolutional layer outputting segmen-
tation mask logit prediction. We also experimented
with the setting, in which no inclusion of the addi-
tional branches into the main segmentation part was
performed.
3.3 Loss Function
Loss function used in our network is a weighted sum
of three terms:
First is a weighted cross-entropy loss (WCE) per-
taining to the main pixel-wise segmentation task. Be-
cause of the data imbalance (dominance of the back-
ground class), we have set the weight of WCE to 1
for the background class and 2 for every of the fore-
ground classes. This loss term was used with a weight
of 1.
Second loss term is binary cross-entropy pertain-
ing to the multilabel classification task of the ‘Class-
Branch’. This loss term was used with a weight of
0.5.
Third component is the smooth L1 loss described
in (Girshick, 2015). According to the paper, this loss
is less sensitive to outliers. The ground truth bound-
ing box position was expressed relative to the input
image size (resulting in values in [0,1] range). There-
fore, additional sigmoid function was introduced be-
fore the smooth L1 loss calculation. Not all clothing
categories were present in the input image. There-
fore, only loss terms from the channels corresponding
to the clothing categories present in the input image
contributed to the final loss term. This loss term was
used with a weight of 75.
Weights of the particular loss terms were de-
termined empirically, so that each loss term would
contribute approximately evenly to the final com-
pound loss (unweighted loss terms have different
value ranges).
3.4 Network Training Specifics
For reproducibility, this subsection provides our net-
work training details.
Preprocessing - based on the most prevalent height
to width ratio in our dataset (3:2) and the fact that
Resnet34 backbone downsizes images by 32, we re-
size every image to 288x192.
Data augmentation - rather conservative data aug-
mentation was performed, namely horizontal flip,
mild changes in lighting and rotation. It is impor-
tant to note, that segmentation mask was transformed
in the same way as the original input image, and
the ground truth bounding box was inferred from the
transformed mask (calculation performed more effi-
ciently on the GPU).
We used a batch size of 16. Because of GPU mem-
ory constraints while still aiming for a batch of suffi-
cient size, regarding the weights and activations the
whole training was performed using half-precision
(floating point 16), but loss terms and gradients were
32 floating-point precision.
We used GPU NVidia GTX 1080. The training
was two-phased, which is characteristic of transfer
learning. In the first phase, we froze the Resnet34
backbone pre-trained on Imagenet and trained only
the rest of the network (all the branches included).
This phase was ten epochs long, using Adam op-
timizer and cyclical learning rate scheduler (Smith,
2017) with learning rate maximum of 10
−4
. Then
we unfroze the whole network, continued for an-
other ten epochs, Adam optimizer and cyclical learn-
ing rate scheduler with learning rate maximum of
10
−4
. Weight decay of 10
−3
was used in both phases
(weight decay decoupled from the Adam optimizer
was used as described in (Loshchilov and Hutter,
2017)). This training procedure was determined em-
pirically based on the training behavior of the Plain
U-Net. Training of all of the models was closely mon-
itored in terms of losses and metrics. There was no in-
dication of performance suffering from poorly chosen
training hyperparameters, hence no specific training
hyperparameters modifications for particular models
were performed.
The implementation of the network design and
training was done in FastAI (Howard et al., 2018)
and PyTorch (Paszke et al., 2017) environment.
(Yakubovskiy, 2020) implementation of DeepLabV3,
DeepLabV3+ and FPN architectures was used.
VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications
18