Table 1: Mean and class IOU (%) on Helen and Vistas (sub-
set selected) datasets at training convergence.
Helen Dataset Vistas Dataset
Class U-Net U-Net+HL Class U-Net U-Net+HL
Background 92.04 92.653 Car 80.35 80.91
Face skin 86.53 87.04 Terrain 54.21 56.07
Left eyebrow 63.12 62.68 Lane Marking 49.14 51.69
Right eyebrow 63.65 64.25 Building 77.31 79.67
Left eye 63.98 67.81 Road 82.31 82.69
Right eye 64.92 72.72 Trash Can 5.38 18.63
Nose 84.05 82.55 Manhole 2.25 16.96
Upper lip 52.88 56.48 Catch Basin 1.58 13.59
Inner mouth 62.17 67.94 Snow 56.97 71.46
Lower lip 65.64 67.92 Person 39.23 48.3
Hair 65.41 66.11 Water 29.87 16.1
Mean 69.49 71.65 Mean 24.74 26.51
gain using a simple baseline architecture. Networks
use Kaiming uniform initialisation with the same ran-
dom seed (to equally initialise vanilla and hierarchi-
cally trained networks). Pre-training is not utilised.
We use Stochastic Gradient Decent with a learning
rate of 0.01 and a batch size of 5 (due to memory
constraints). During training, images/labels were ran-
domly square-cropped using the shortest dimension
and re-sized to 256
2
. The only further augmentation
used was random flipping (p = 0.5).
Datasets. For experimenting with hierarchical
losses on segmentation we chose two very different
datasets: the Helen (Le et al., 2012) facial dataset and
Mapillary’s Vistas (Neuhold et al., 2017) road scene
dataset. The Helen dataset covers a wide variety
of facial types (age, ethnicity, colour/grayscale,
expression, facial pose), originally built for facial
feature localisation (Le et al., 2012). We use an
augmented Helen (Smith et al., 2013) dataset with
semantic segmentation labels. Helen contains
2000, 230 and 100 images/annotations for training,
validation and testing respectively, for 11 classes
(10 facial and background, see Tab. 1(left)). It
should be noted that the ground truth annotations are
occasionally inaccurate, particularly for hair which
makes it challenging to learn. The road scene Vistas
dataset (Neuhold et al., 2017) is composed of 25000
images/annotations (18000 training, 2000 validation,
5000 testing), with 66 classes. As Vistas contains too
many classes to easily illustrate we have chosen only
a representative subset in Tab. 1 (right) which show
most significant difference in performance and given
the mean over all classes. Further, our intention is
to indicate the performance improvement by using
hierarchical learning, rather than to compare between
datasets. The Vistas hierarchy is three levels deep,
contains 66 leaf nodes, and 11 internal nodes.
Results. Fig. 4 (right) shows losses for each ab-
straction depth of the class hierarchy for the He-
len experiment. Note that the deeper loss is always
Figure 5: Training behaviour on Vistas. Left: mean IOU
versus epoch. Right: classification loss for each abstraction
depth D = 1..3 versus epoch. We show results trained with
vanilla (U-Net) and hierarchical (U-Net+HL) loss.
larger than a shallower one, suggesting that our hier-
archically trained method significantly benefits from
the hierarchical structure in the class labels, partic-
ularly in the early phase of training, learning much
faster than the vanilla model. Fig. 4 (left) illus-
trates the mean Intersection over Union (IOU) during
training. Performance gain is most significant post
epoch 35 and can be observed in the qualitative re-
sults from Fig. 6. At performance convergence we
observe some qualitative differences between the hi-
erarchically trained network and the vanilla. For ex-
ample, in Fig. 6 U-net+HL predictions at epoch 200
have somewhat less hair artefacts, while the 1st exam-
ple shows improvement over a difficult angled facial
pose. Epoch 50 results clearly show faster conver-
gence.
For Vistas, the IOU performance gain is less no-
table than on Helen, but we show the hierarchi-
cally trained model outperforming the vanilla model
in both level losses and mean IOU (Fig. 5 and
Tab. 1(right)). The qualitative results in Fig. 7 illus-
trate predictions for both methods at epoch 1 and 80.
Most interestingly, after 1 epoch the hierarchically
trained model is able to classify correctly a signifi-
cant proportion of lane-markings whereas the vanilla
trained model cannot, showing how quickly our hi-
erarchical model is learning. Relative to the vanilla
model, our hierarchically trained model achieves a
3% and 7% relative improvement for Helen and Vis-
tas respectively (see Tab. 1).
7 CONCLUSIONS
Our results illustrate the great potential of using losses
that encourage semantically similar classes within a
hierarchy to be classified close together, where the
model parameters are guided towards a solution not
only better quantitatively, but faster in training than
using a standard loss implementation. We speculate
that the hierarchically trained models perform better
due to learning more robust features from visually
A Hierarchical Loss for Semantic Segmentation
265