Semantic Segmentation in Red Relief Image Map by UX-Net

Tomoya Komiyama

, Kazuhiro Hotta

, Kazuo Oda

, Satomi Kakuta

and Mikako Sano

Meijo University, Shiogamaguchi, 468-0073, Nagoya, Japan

Asia Air Survey co.,ltd, Kawasaki, 215-0004, Kanagawa, Japan

Keywords: Semantic Segmentation, Red Relief Image Map, U-Net, UX-Net.

Abstract: This paper proposes a semantic segmentation method in Red Relief Image Map which a kind of aerial laser

image. We modify the U-Net by adding the paths between convolutional layer and deconvolutional layer

with different resolution. By using the feature maps obtained at different layers, the segmentation accuracy

is improved. We compare the segmentation accuracy of the proposed UX-Net with the original U-net. Our

proposed method improved class-average accuracy in comparison with the U-Net.

1 INTRODUCTION

Red Relief Image Map is a new topographical

expression technique (Chiba Tatsuro et al., 2010).

Figure 1 shows the example of Red Relief Image

Map. Red Relief Image Map is created by Digital

Elevation Model (DEM) data obtained from aerial

laser survey and ground truth image is created by

visual inspection with reference to DEM data. Red

Relief Image Map expresses amount of inclination

with red chroma and ridges, valleys, and the like

with red brightness, and it is outstanding for reading

performance. For example, it can understand roads

and livers in the mountains and defective areas that

we could not estimate the ground by trees. When

there are topographic changes, the computer must

understand the changes immediately from Red

Relief Image Map. Therefore, in this paper, we carry

out semantic segmentation of four classes (road,

liver, defective areas by trees and others) in Red

Relief Image Map.

Deep Learning gave high accuracy on various

kinds of image recognition tasks such as object

categorization (Huang et al., 2016), object detection

(Ren et al., 2014) and object segmentation (Long et

al., 2015). For object segmentation, the Encoder-

Decoder Convolutional Neural Network (CNN)

(Kendall et al., 2016) such as U-Net (Ronneberger et

al., 2015) worked well. We modify the U-Net for

improving the accuracy of semantic segmentation

from Red Relief Image Map.

U-net used the path between encoder and

decoder with the same resolution in order to

compensate for the information eliminated by

Figure 1: Example of Red Relief Image Map (left) and its

ground truth image with 4 class labels (right). Black pixels

are “defective areas by trees”, blue pixels are “road”, pink

pixels are “river” and white pixels are “others”.

encoder. However, the information at different layer

could be effective for semantic segmentation

because each layer extracts different kinds of

information. For example, shallower layer has fine

information such as small object and correct position

of objects. Deeper layer has the information related

to classification. Thus, we add the path between

encoder and decoder with different resolution to the

U-net. By using the feature maps with different

resolution, the segmentation accuracy is improved.

We evaluated our method on semantic

segmentation problem using eleven Red Relief

Image Maps. We segment four categories; trees,

Komiyama, T., Hotta, K., Oda, K., Kakuta, S. and Sano, M.

Semantic Segmentation in Red Relief Image Map by UX-Net.

DOI: 10.5220/0006716805970602

In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 597-602

ISBN: 978-989-758-276-9

597

Figure 2: Structure of two networks. (a) Structure of U-Net (left). (b) Structure of UX-Net (right).

road, river and others in experiments. Our proposed

method improved the accuracy in comparison with

the U-Net.

This paper is organized as follows. Section 2

describes the details of the proposed method.

Section 3 shows the experimental results.

Comparison with the original U-net is also shown.

Finally, we describe conclusion and future works in

Section 4.

2 PROPOSED METHOD

In general, the number of training data for the U-net

depends on the number of pixels in training images.

Thus, we do not need to use a large number of

training images. In this paper, we have only 11 Red

Relief Image Map with ground truth. Therefore, we

use the U-net as the baseline and modify it.

We explain the original U-Net in section 2.1. The

proposed method is explained in section 2.2.

2.1 U-Net

U-Net is a kind of encoder-decoder CNN and is

effective for semantic segmentation. In recent years,

it is also used for image generation task such as

pix2pix (Isola et al., 2017) which improved Deep

Convolutional Generative Adversarial Networks

(Radford, et al., 2016). Encoder-Decoder CNN

carries out convolution at encoder part and

deconvolution at decoder part in order to make the

segmentation result.

U-Net improved the segmentation accuracy by

using the feature map at the encoder parts in decoder

parts with the same resolution as shown in Figure 2

(a). The paths from encoder part to decoder part

compensate for the small objects and edges

eliminated at encoder parts.

2.2 UX-Net

A structure of the proposed network is shown in

Figure 2 (b). In addition to the original path of the

U-net, we give the path from the shallow layer at

encoder part to the beginning of decoder part in

order to use the fine information at the shallow layer

in the decoder part with small resolution. Since the

beginning of decoder part does not have fine

information such as small objects, edges and correct

position of object, the feature at shallow layer should

be useful. Furthermore, we also add the path from

deep layer at encoder part to the final layer at

decoder part. Since the feature map at the deep layer

of encoder part has the information about object

categories, the information should be useful to make

a final segmentation result. New adding paths are

like “X” shape. Thus, we call the proposed network

“UX-Net”.

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

598

Table 1: Accuracy of the proposed method and U-Net.

However, the size of feature maps of shallow

layer at encoder part and that of beginning layer at

decoder part is different. Thus, we use pooling to be

the same size. Similarly, since the size of deep layer

at encoder part and that of final layer at decoder part

is different, we use unpooling to be the same size.

We use batch normalization (Ioffe and Szegedy,

2015) at each layer though original U-net did not use

it. Class balancing (Badrinarayanan et al., 2016) is

also used to improve the segmentation accuracy of

objects with small area.

3 EXPERIMENTS

We show experimental results on semantic

segmentation in Red Relief Image Map. At first, we

explain the dataset that we use in the following

experiments in section 3.1. Comparison methods are

explained in section 3.2. Experimental results are

shown in section 3.3.

3.1 Dataset

In this paper, we use eleven Red Relief Image Maps.

Five images are used for training images and

remaining six images are used for test. Since some

quantity of training images are necessary for training

deep learning, we crop a local region of 256 x 256

pixels with overlapped ratio 0.7 from Red Relief

Image Map of 1,500 x 2,000 pixels. In addition, we

rotate those cropped regions at the interval of 90

degrees to enlarge the number of training images. As

a result, the number of training images is 7,344. Test

regions of 256 x 256 pixels are cropped without

overlap from the original six images. The total

number of test regions is 185.

3.2 Comparison Methods

We compare our method with some networks

including the original U-net. The first method is the

U-Net. The second method is our proposed method.

When we concatenate the feature maps of different

resolution, the size of each feature map is changed

by pooling and convolution or unpooling and

deconvolution. We call this method “UX-Net1”.

The third method is also our method but we do

not use convolution and deconvolution when we

change the size of feature map. Only pooling and

unpooling are used to change the size of feature

maps. We call this network “UX-Net2”.

3.3 Experimental Results

We show the experimental results of all methods. As

evaluation measure, we use the pixel-wise accuracy

and class average accuracy. Pixel-wise accuracy is

the accuracy in all pixels. This is influenced by

objects of large area such as background. Class-

average accuracy is the average accuracy of each

class. This is influenced by objects of small area

such as defective areas by trees, road and river. In this

paper, class average accuracy is more important than

pixel-wise accuracy because we want to segment

defective areas by trees, road and river well.

We show the segmentation results of all methods

in Figure 3 and 4. The first row shows input image

and ground truth label. The second rows show the

result by U-Net and UX-Net1. The bottom row

shows the result by UX-Net2.

We show the pixel-wise accuracy and the class-

average accuracy of each method in Table 1. The

best result at each class is shown in red.

We found that our proposed UX-Net has higher

accuracy for defective areas by trees, road and river

than the original U-Net. The pixel-wise accuracy of

the proposed method is worse than the U-net

because the pixel-wise accuracy is influenced by the

background which is not the main target.

Note that our proposed method can improve the

accuracy of defective areas by trees that are hard to

segment by the U-net. This is because we use the

“X-path” that the fine information obtained at

shallow layer is used in deep layer and semantic

information obtained at deep layer is used to general

the final segmentation result. When we compare

UX-Net1 with UX-Net2, UX-Net2 gave better result

than UX-Net1. The main difference is how to

change the feature map. Experimental results show

that only pooling and unpooling is effective to

change the size. When we use pooling and

Semantic Segmentation in Red Relief Image Map by UX-Net

599

Figure 3: Segmentation results from Red Relief Image Maps. The first row shows input image and ground truth label. The

second rows show the result by U-Net and UX-Net1. The bottom row shows the result by UX-Net2.

Figure 3: Segmentation results from Red Relief Image Maps. The first row shows input image and ground truth label. The

second rows show the result by U-Net and UX-Net1. The bottom row shows the result by UX-Net2.

convolution, the feature map obtained by shallow

layer is changed by convolution, and fine

information is lost. Similarly, the semantic

information may be lost by unpooling and

deconvolution. These are the reason why UX-Net2

is better.

4 CONCLUSION

In this paper, we carried out semantic segmentation

from Red Relief Image Map which is a kind of aerial

laser image. We add “X-path” to the original U-net.

X-path means that fine information is used in deep

layer and semantic information is used to generate

final segmentation result. Experimental results

demonstrated the effectiveness of our proposed UX-

Net. In particular, the accuracy of defective areas by

trees, road and river is much improved in

comparison with the original U-Net.

However, our proposed method has over-

detection of defective areas by trees. Therefore, we

want to improve the accuracy by using not only

information at shallow encoder part and deep

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

600

Figure 4: Segmentation results from Red Relief Image Maps. The first row shows input image and ground truth label. The

second rows show the result by U-Net and UX-Net1. The bottom row shows the result by UX-Net2.

encoder part but also effectively information at

various feature maps. Moreover, we adopt a loss

function for considering objects which are hard to

detect, and we would like to improve the class

average accuracy further. These are subjects for

future works.

REFERENCES

Chiba, T., Suzuki, Y., Arai, K., Tomita, Y., Koizumi, S.,

Nakashima, K., Ogawa K., 2010. The measurement of

magma discharge volume of the "Jogan" eruption in

Aokigahara on Fuji volcano, based on the micro

topography by LiDAR and result of the drilling.

Journal of the Japan Society of Erosion Control

Engineering.

Huang, S., Xu, Z., Tao, D., Zhang, Y., 2016. Part-Stacked

CNN for Fine-Grained Visual Categorization.

Computer Vision and Pattern Recognition.

Long, J., Shelhamer, E., Darrell, T., 2015. Fully

Convolutional Networks for Semantic Segmentation.

Computer Vision and Pattern Recognition.

Ren, S., He, K., Girshick, R., Sun, J., 2014. Faster R-

CNN: Towards Real-Time Object Detection with

Semantic Segmentation in Red Relief Image Map by UX-Net

601

Region Proposal Networks. Computer Vision and

Pattern Recognition.

Badrinarayanan, V., Kendall A., Cipolla R., 2016. SegNet:

A Deep Convolutional Encoder-Decoder Architecture

for Image Segmentation. IEEE Transactions on

Pattern Analysis and Machine Intelligence.

Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net:

Convolutional Networks for Biomedical Image

Segmentation. Medical Image Computing and

Computer Assisted Intervention.

Isola, P., Zhu, J., Zhou, T., Efros A. A., 2017. Image-to-

Image Translation with Conditional Adversarial

Networks. Computer Vision and Pattern Recognition.

Radford, A., Metz, L., Chintala, S., 2016. Unsupervised

Representation Learning With Deep Convolutional

Generative Adversarial Network. International

Conference on Learning Representations.

Ioffe, S., Szegedy, C., 2015. Batch Normalization:

Accelerating Deep Network Training by

ReducingInternal Covariate Shift. arXiv preprint

arXiv:1502.03167.

Badrinarayanan, V., Kendall, A., and Cipolla, R., 2016.

SegNet: A Deep Convolutional Encoder-Decoder

Architecture for Image Segmentation. IEEE

Transactions on Pattern Analysis and Machine

Intelligence.

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

602