where freq(c) is the number of pixels in class c di-
vided by the number of pixels that belong to an image
where class c appears, and median freq is the median
of those frequencies.
On another note, as we use a brand new database
that we had to label, we decided to only use 3 classes.
However, in section 5 we will see that we do not train
the neural network from scratch, we fine-tune it in
order to take advantage of the information learned
in its original training. This is done by initializing
the neural network with some pretrained weights and
then training it with the new database. The pretrained
weights we have used are the ones corresponding to
the final model of SegNet. Therefore, the layers of
the network are initially configured to segment an im-
age in 11 classes. In order to deal with the change
on the number of classes, we set the class frequencies
of the remaining 8 classes to 0 in the Softmax layer of
the neural network, where the class balancing is done.
With these changes, SegNet is modified to fit our
purpose. Figure 2 shows the final appearance of the
image segmentation block. SegNet now provides 3
probability images, one for each class. In each of
these images, every pixel is set with its probability
of belonging to the corresponding class. This is done
for all the frames in a video and the resulting proba-
bilities are then passed onto the temporal processing
block.
Figure 2: Input and output of SegNet. The output shows
a probability images for each class (sky, road and other in
this order). The whiter the pixel, the more probable that it
belongs to the corresponding class.
3.2 Temporal Processing
In order to make the most of the temporal character-
istics of the videos, we have added an additional pro-
cessing block after SegNet’s segmentation. We take
advantage of the temporal continuity of the videos,
as they do not have abrupt changes in the scene from
frame to frame. Therefore, we can do an image rec-
tification between a frame and its consecutive with a
minimal error. There exist two main methods to align
two frames. On the one hand, one can look for in-
teresting points (mostly corners) in one of the images
and try to find those same points in the other frame.
Then, the movement of these points from one frame
to the other is calculated, and one of the frames is
transformed to counterbalance the motion of the cam-
era. On the other hand, it is possible to align two
frames taking into account the whole image and not
only some points. This is the method we use, as the
interesting points in our images would mostly belong
to moving vehicles. Therefore, if we took those points
as a reference, we would counterbalance the whole
scene based on the movement of a vehicle, not the
motion of the camera. In order to estimate the trans-
formation that has to be applied to one of the frames
to align them, we use the Enhanced Correlation Co-
efficient (ECC) algorithm (Evangelidis and Psarakis,
2008). This iterative algorithm provides the transfor-
mation that maximizes the correlation coefficient (1)
between the first frame (i
r
) and the second one after it
is transformed (i
w
) with transformation parameters p.
ρ(p) =
i
t
r
i
w
(p)
ki
r
kki
w
(p)k
(1)
Once we have found the optimum transformation
we can align both frames and, therefore, know which
pixel on the first frame corresponds to one on the sec-
ond frame (although some of them will remain un-
paired). With this, we can add the probabilities of
a pixel throughout a sequence of images in a video,
thus obtaining a cumulative probability for each class.
When classifying a pixel of a frame, we will choose
the class with the highest cumulative probability, tak-
ing into account the last 10 frames. The results ob-
tained with this additional processing are more robust
and coherent in time. The detail on the results can be
found in section 5.
4 DATABASE
Due to the main component of our road detector be-
ing a neural network, we need to carry out a training
phase prior to obtaining a good performance. How-
ever, the training of a neural network requires a la-
belled database. As we could not find a suitable one
for our work nor had the resources to create enough
material of our own, we gathered unlabelled videos
from different sources and labelled them ourselves.
This database
1
consists of 3.323 images from 22 dif-
ferent scenes. All the images have been captured by
a flying drone and the main element in the scene is a
road, but the illumination and viewpoint varies greatly
1
https://imatge.upc.edu/web/resources/spatio-temporal-
road-detection-aerial-imagery-using-cnns-dataset
Spatio-temporal Road Detection from Aerial Imagery using CNNs
495