There has been some prior evidence in related
works that deep segmentation networks can have
some difficulties with variability and size of seg-
mented objects. For instance, in (Badrinarayanan
et al., 2017) the authors evaluate and compare ap-
proaches on a SUN RGB-D dataset (Song et al., 2015)
(a very challenging and large dataset of indoor scenes
with 5,285 training and 5,050 testing images). The
results have shown that all the deep architectures
share low Intersect over Union and boundary met-
rics, where larger classes have reasonable accuracy
and smaller classes have lower accuracies.
Next, we briefly review some of the milestones
in the evolution of deep learning segmentation net-
works, from the first ones to DeepLabV3 and alike.
The Fully Convolutional Network (FCN) for image
segmentation was proposed in [14]. It modified well-
known architectures, such as VGG16 [15], replacing
all the fully connected layers by convolutional lay-
ers with large receptive fields and adding up-sampling
layers based on simple interpolation filters. Only the
convolutions part of the network was fine-tuned to
learn deconvolution indirectly. The authors achieved
more than 62% on the Intersect over Union (IoU)
metric over the 2012 PASCAL VOC segmentation
challenge using pretrained models on the 2012 Im-
ageNet dataset. The authors in (Noh et al., 2015)
proposed an improved semantic segmentation algo-
rithm, by learning a deconvolution network. The con-
volutional layers are also adapted from VGG16, while
the deconvolution network is composed of deconvolu-
tion and unpooling layers, which identify pixel-wise
class labels and predict segmentation masks. The
proposed approach reached 72.5% IoU on the same
PASCAL VOC 2012 dataset. (Ronneberger et al.,
2015) proposed the U-Net, a DCNN specially de-
signed for segmentation of biomedical images. The
authors trained the network end-to-end from very few
images and outperformed the prior best method (a
sliding-window convolutional network) on the ISBI
challenge for segmentation of neuronal structures in
electron microscopic stacks. The contracting part of
the U-Net computes features, while the deconvolution
part localizes patterns spatially in the image. The con-
tracting part has an FCN-like architecture, extracting
features with 3x3 convolutions, while the expanding
part uses deconvolutions to reduce the number of fea-
ture maps while increasing the size of the images.
Cropped feature maps from the contracting part are
also copied into the expanding part to avoid losing
pattern information. At the end, a 1x1 convolution
processes the feature maps to generate a segmenta-
tion map assigning a label to each pixel of the input
image. DeepLab (Chen et al., 2017) proposed three
main innovations. Convolutions with upsampled fil-
ters, or ‘atrous convolution’, explicitly controls the
resolution at which feature responses are computed
and enlarges the field-of-view of filters to incorporate
larger contexts without increasing the number of pa-
rameters or the amount of computation. Atrous con-
volution is also known as dilated convolution, consist-
ing of filters targeting sparse pixels with a fixed rate.
Atrous spatial pyramid pooling (ASPP) segments ob-
jects at multiple scales, by probing incoming convolu-
tional feature layers with filters at multiple sampling
rates and effective fields-of-views, thus capturing ob-
jects as well as image context at multiple scales. Fi-
nally, localization of object boundaries is improved
by combining methods from deep convolution neural
networks (DCNNs) and probabilistic graphical mod-
els. This is done by combining the responses at
the final DCNN layer with a fully connected Condi-
tional Random Field (CRF), which improves localiza-
tion both qualitatively and quantitatively. “DeepLab”
achieved 79.7% IoU on PASCAL VOC-2012 seman-
tic image segmentation task, and improvement over
(Long et al., 2015) and (Noh et al., 2015).
3 MATERIALS AND METHOD
For this work we apply one of the most recent and
best performing DCNN segmentation network archi-
tectures, DeepLabV3 (Chen et al., 2017), and also
Segnet (Badrinarayanan et al., 2017), for comparison
purposes. In this section we review the two architec-
tures briefly and the IDRiD dataset that was used in
our experimental analysis. Finally, we discuss train-
ing setup, timings and results.
3.1 DCNN Architectures
The DeepLabV3 (Chen et al., 2017) architecture in
this work uses imageNet’s pretrained Resnet-18 net-
work, with atrous convolutions as its main feature ex-
tractor. DeepLabV3 introduces a set of innovations.
Figure 3 shows a plot of the overall architecture of
DeepLabV3 we used. First of all, it uses multiscale
processing, by passing multiple rescaled versions of
original images to parallel CNN branches (Image
pyramid) and by using multiple parallel atrous convo-
lutional layers with different sampling rates (ASPP).
In the modified ResNet model, the last ResNet block
uses atrous convolutions with different dilation rates,
and Atrous Spatial Pyramid Pooling and bilinear up-
sampling are used in the decoder module on top of the
modified ResNet block. Additionally, structured pre-
diction is done by fully connected Conditional Ran-
dom Field (CRF). CRF is a postprocessing step used
Segmentation of Diabetic Retinopathy Lesions by Deep Learning: Achievements and Limitations
97