3.1 Dataset
We use four satellite images captured by Hodoypshi-
1 in experiments. The resolution of the original
image is 4,152×4,003 pixels. Since four images are
too small to evaluate the accuracy, we crop a region
of 128×128 pixels at the overlap ratio of 0.25 from
the original images. In general, many supervised
images are required for training a deep learning.
Thus, we rotate the cropped images at the interval of
90 degrees. This makes the method to be robust to
the direction of roads.
In experiments, three original images are used
for training and remaining one original image is
used for test. For evaluating the general accuracy
fairly, we made four datasets while changing a
combination of three training images and one test
image. Thus, all original images are used as test.
As a result, training regions in dataset 1 is
124,828, those in dataset 2 is 125,704, those in
dataset 3 is 127,820 and those in dataset 4 is 129,704.
Test images for four datasets are 12,372 regions
cropped without overlap. The number of training
regions is different among datasets because we crop
local regions without a black region in Figure 4.
3.2 Evaluation Method
In this paper, since we have only four satellite
images, we cannot prepare the dataset for validation
and choose the most suitable model for test. Thus,
we train each method until 100 epochs and save the
model at every 5 epochs and compute Precision
Recall Curve (PRC) and Area Under the Curve
(AUC).
We drew the graph whose horizontal axis is the
number of epoch and vertical axis is AUC. We
evaluate each method by the graph and the
maximum AUC.
3.3 Comparison Methods
At first, we must compare our method with the
original U-net. We also evaluate the network which
adds the feature map at encoder part to that at
decoder part in order to investigate the effectiveness
of the difference of the feature map. The summation
of feature map is like the ResNet
3)
and we call this
network “Add”.
In addition, we also evaluate the network that the
difference of feature maps is used at only the first
layer or the second layer in order to investigate
which layer is effective. The first network does not
have the path between the second layers while only
first layer has the path. We call this method Ours
(fp:first path). The second network does not have the
path between the first layers while only second layer
has the path. We call this method Ours (sp:second
path).
3.4 Experimental Results
In this experiment, we classify a satellite image into
two classes; road and background. We evaluate all
methods using two kinds of alpha value in Adam
optimizer (Kingma at al., 2015); 1e-4 and 1e-5.
AUC graphs of each method at alpha 1e-4 and 1e-5
are shown in Figure 2 and 3. In addition, we show
the maximum AUC of each network in Table 1.
Only top 3 AUCs are shown as the red in Table1.
As we can see from Table1, our method using
two paths for the difference of feature maps
improved approximately 3% in comparison with the
original U-net. The best AUC is obtained by the
method “Ours(sp)”. This method is approximately
5% bettetr than the U-Net. This result demonstrated
that the difference of the feature maps is effective
for classifying small objects like road.
Figure 2: AUC graph when alpha is1e-4. (upper left:
dataset 1, upper right: dataset 2, bottom left: dataset 3,
bottom right: dataset 4).
When we compare Ours(fp) with Ours(sp), the
difference of the feature maps of the conv2 layer and
the deconv2 layer is more effective, and the
difference of the feature maps of conv1 and the
deconv1 improve the accuracy slightly. It is
necessary for the road detection that high-frequency
component indicating the position and semantic
information of the road. The shallow layer of
convolution layers has the information of the high-
Road Detection from Satellite Images by Improving U-Net with Difference of Features
605