Semantic Segmentation using Light Attention Mechanism
Yuki Hiramatsu and Kazuhiro Hotta
Meijo University, Japan
Keywords: Semantic Segmentation, Attention Mechanism, Encoder-decoder Structure.
Abstract: Semantic segmentation using convolutional neural networks (CNN) can be applied to various fields such as
automatic driving. Semantic segmentation is pixel-wise class classification, and various methods using CNN
have been proposed. We introduce a light attention mechanism to the encoder-decoder network. The network
that introduced a light attention mechanism pays attention to features extracted during training, emphasizes
the features judged to be effective for training and suppresses the features judged to be irrelevant for each
pixel. As a result, training can be performed by focusing on only necessary features. We evaluated the
proposed method using the CamVid dataset and obtained higher accuracy than conventional segmentation
methods.
1 INTRODUCTION
Convolutional neural network (CNN) (Krizhevsky,
2012) has achieved very high accuracy on image
recognition. Semantic segmentation using CNN can
be applied to various fields such as automatic driving
(Badrinarayanan, 2017) and medical images
(Ronneberger, 2015). Semantic segmentation refers
to pixel-wise class classification. Typical methods for
semantic segmentation using CNN include Fully
Convolutional Neural Networks (Long, 2015) and
encoder-decoder networks (Ronneberger, 2015).
These methods are the basic structure of the semantic
segmentation method. Many of latest methods have
extracted features using very deep CNN. In general, it
is said that the deeper the hierarchy of CNN, the better
the feature extraction function and the higher the
accuracy. However, the deepening of the CNN
increases the amount of computation and the number
of parameters.
In order to deal with this problem, we propose a
light attention mechanism that can be introduced into
the basic encoder-decoder network. In the network
where an attention mechanism that we proposed is
introduced, it pay attention to the extracted features
and emphasizes the features judged to be effective for
training and suppresses the features judged to be
irrelevant for training. As a result, it can perform
training while focusing only on the necessary
features. Therefore, it can be considered that the
increase in computational complexity and the number
of parameters can be mitigated. In the experiment, we
evaluate the proposed method using the CamVid
dataset (Brostow, 2009) labelled with 11 classes of
images taken by the in-vehicle camera. As a result, the
proposed method could obtain higher accuracy than
conventional segmentation methods.
2 RELATED WORKS
This section describes related works. Section 2.1
describes the encoder-decoder structure. Section 2.2
describes the attention mechanism (Wang, 2017, Hu,
2018).
2.1 Encoder-decoder
U-Net (Ronneberger, 2015) is proposed as a
segmentation method using CNN. This method
adopts the encoder-decoder structure. The encoder
extracts features using convolution and down
sampling, and the decoder restores the resolution of
feature maps step by step while extracting features. In
addition, a skip connection is introduced at each
resolution between encoder and decoder, and feature
maps obtained by the encoder is connected to the
corresponding feature maps at decoder with the same
resolution. This restores the information lost during
feature extraction.