techniques with a convolutional neural network. The
experiments were carried out using 19 multitemporal
images of the region of Ukraine acquired by RS satel-
lites of Landsat8 and Sentinel-1A. Using two archi-
tectures variations of a convolutional neural network,
called 1-D and 2-D to explore spectral and spatial fea-
tures respectively. An accuracy of 93.5% and 94.6%
respectively were obtained, while Random Forest and
MLP obtained 88.7% and 92.7% respectively. Conse-
quently, the proposed architecture proved more useful
for the described problem.
Another approach based on deep learning for fea-
ture detection in the scope of remote sensing was pro-
posed by (Zou et al., 2015) treating the problem of ex-
traction as a problem of features reconstruction. The
proposed method selects the most reconstructive char-
acteristics as the discriminative ones. In the experi-
ments, 2800 orbital images were divided into seven
categories (grass, farm, industry, river, forest, resi-
dential, parking) for performance evaluation. To ad-
dress the problem of feature reconstruction, a Deep
Belief Network (DBN) was used. An iterative algo-
rithm for learning features was developed to obtain re-
liable reconstruction weights and characteristics with
small reconstruction errors. On average, an accuracy
of 77% was reached, and the category with the most
misclassification was the industry category with 65%,
and the least confused class being the forest with an
accuracy of 93.5%. Considering the complexity of the
classification type, then the experiments validated the
efficiency of the proposed method.
Our proposal uses CNN architectures U-Net and
Auto Encoder, which have not yet been applied for ex-
traction of cartographic features using aerial images,
specifically on roads. Consequently, the proposal was
not used in the scientific literature for such problem.
3 PROPOSED METHODOLOGY
Our methodology presents two DL networks archi-
tectures: U-Net and Auto-Encoder. We have chosen
these two networks because they are representative of
the state-of-the-art in the proposed task and, more im-
portantly, are computationally efficient and capable of
considering a large amounts of contextual informa-
tion, which is crucial in this case. The purpose of
this paper is showing comparison of both in the road
network detection using aerial images. Each DL ar-
chitecture is described in subsections below.
3.1 U-Net Architecture
The U-Net has two phases: contraction and expan-
sion. In the contraction path, the input image goes
through 2 convolutions 3 × 3, stride 1, generating 8
feature channels and the ReLU activation function in
the first step followed by a 2 × 2 max pooling opera-
tion with stride 2. After each max pooling operation,
the number of feature channels is increased by a fac-
tor of two, and the input size is reduced by the same
factor due to the effects of the max pooling. In the
contraction path, a step is defined by 2 convolutions
and a max-pooling operation.
After 4 steps, the resulting output is fed to an up-
convolution 2 × 2, stride 2 and 64 feature channels
in the first step which is the beginning of the expan-
sion path. Every step of this phase consists of the
up-convolution with the parameters described above.
Moreover, the output of the same stage of the con-
traction path is concatenated, and convolutions are ap-
plied as in the contraction path. After each step, the
number of feature channels is reduced by a factor of
two. In the last layer, convolution with a single ker-
nel 1 × 1 is applied, and the resulting tensor passes
through the sigmoid function. Output is a single chan-
nel image with pixel values in the interval [0, 1], dur-
ing inference, thresholded at 0.5 and mapped to black
or white for visualization purposes. Black (zero) pix-
els are pixels classified as roads and white otherwise.
Figure 2 represents the U-Net used in our work.
3.2 Auto-Encoder Architecture
It is based on (Long et al., 2015) proposing fully con-
volutional networks for semantic segmentation. A
fully convolutional network does not have any fully
connected layer, such that all neurons from one layer
are connected to all neurons from the next layer. This
was common in most topologies up to this point, espe-
cially in deeper layers. However, there are benefits in
eliminating full connections: 1) decrease in the num-
ber of trainable parameters; 2) preservation of spatial
correlation; 3) images of any size can be equally pro-
cessed using the same network.
Aiming for future online applications and onboard
processing, a relatively simple topology was used in
this work, composed of three convolutional layers,
with filter size 5 × 5 and three deconvolutional lay-
ers (the convolutional transpose), with the same filter
size. Each layer is followed by a ReLU (Xu et al.,
2015), to introduce non-linearities, and convolutional
layers receive a max-pooling of 2, to downsample
spatial dimensions, while deconvolutional layers up-
sample input images by the same factor, doubling
their spatial dimensions. Convolutional and decon-
volutional layers are connected by a fourth convolu-
tional layer, with filter size 3 × 3, no activation func-
Application of U-Net and Auto-Encoder to the Road/Non-road Classification of Aerial Imagery in Urban Environments
609