supervised training of a segmentation method, which
needs only image-wise labeled data to train a classifi-
cation network, but outputs a pixel-wise segmentation
mask. Further, we show that our approach yields com-
parable results to dedicated segmentation networks,
but without the cumbersome requirement of pixel-
wise labeled ground-truth data for training. We put
our approach to test on two different datasets - exam-
ple images can be found in Figure 2 and Figure 3.
In the remaining paper, we will summarize related
publications in section 2, followed by an explanation
of our system in section 3. In section 4 we evalu-
ate against a classical semantic segmentation network
and discuss several design options and their perfor-
mance impacts.
2 RELATED WORK
In the past decades, generations of researchers have
developed image segmentation algorithms and the lit-
erature divides them into three related problems. Se-
mantic segmentation describes the task of assigning
a class label to every pixel in the image. With in-
stance segmentation, every pixel gets assigned to a
class, along with an object identification. Thus, it sep-
arates different instances of the same class. Panoptic
segmentation performs both tasks concurrently, pro-
ducing a semantic segmentation for “stuff” (e.g. back-
ground, sky) and “things” (objects like house, cars,
persons).
There are classical approaches like active contours
(Kass et al., 1988), watershed methods (Vincent and
Soille, 1991) and graph-based segmentation (Felzen-
szwalb and Huttenlocher, 2004), to only name a few.
A good overview for these classical approaches can
be found in (Szeliski, 2011, Ch. 5). With the advance
of deep learning techniques in the area of image seg-
mentation, new regions in terms of robustness and ac-
curacy can be reached. One can find a comprehensive
recent review of deep learning techniques for image
segmentation in (Minaee et al., 2021). This review in-
cludes an overview over different architectures, train-
ing strategies and loss functions, so we refer the inter-
ested reader there, to get a current overview. All these
approaches share one drawback: they need pixel-wise
annotated images for training.
In the field of Explainable AI (XAI), several algo-
rithms for the explanation of network decisions have
been proposed. The authors of (Sundararajan et al.,
2017) proposed a method called Integrated Gradi-
ents, where gradients are calculated for linear inter-
polations between the baseline (for example, a black
image) and the actual input image. By averaging
over these gradients pixel with the strongest effect on
the model’s decision are highlighted. SmoothGrad
(Smilkov et al., 2017) generates a sensitivity map
by averaging over several instances of one input im-
age, each one augmented by added noise. This way,
smoother sensitivity maps can be generated. There
are many more methods and a comprehensive review
would be out of scope for this paper, but we refer the
interested reader to (Linardatos et al., 2021; Barredo
Arrieta et al., 2020; Samek et al., 2019). From this
bunch of methods, we selected LRP as it results in
comparatively sharp heatmaps and is therefore the
best starting point for the generation of segmentation
maps. Furthermore, the authors of (Seibold et al.,
2021) showed that with a simple extension LRP can
be utilized to localize traces of forgery in morphed
face images.
3 METHODS
3.1 Overview
Massive amounts of pixel-wise labeled segmentation
masks are usually the training foundation of deep neu-
ral networks for image segmentation tasks. Obtaining
them is usually a tedious manual process, introduc-
ing intrinsic problems by itself, because of the vari-
ance in the annotations and the often fuzzy definitions
of object boundaries. Therefore, we come up with
an indirect way. In our approach, we train a classi-
fication network instead of a segmentation network,
consequently reducing the annotation work by a great
margin and removing the requirement of an exact def-
inition of object boundaries. To segment an image,
we first pass it through the classification network, see
subsubsection 3.2.1, which outputs if the object we
want to segment is in the image or not. If the first
is true we pass the classification network a second
time, but this time from the back to the front, using
the LRP technique described in subsection 3.3 (Bach
et al., 2015), which is an XAI technique, highlight-
ing the pixel contributing to the DNN’s decision in a
heatmap. We use this heatmap in order to generate a
segmentation mask without the cumbersome task of
manual pixel-wise labeling.
3.2 Network Architectures
3.2.1 Classification
For binary classification, we resort to the classical
VGG-11 architecture without batch normalization, as
described in (Simonyan and Zisserman, 2015), but
From Explanations to Segmentation: Using Explainable AI for Image Segmentation
617