Table 1: Semantic Segmentation Results on SYNTHIA Sequences. We split the test sequences into two parts, one is Highway
for high speeds and the other is City for medium speeds.
Dataset Architecture Mean IoU Sky Building Road Sidewalk Fence Vegetation Pole Car Lane
Highway
FCN 85.42 0.91 0.67 0.89 0.02 0.71 0.79 0.01 0.81 0.72
MSFCN-2 93.44 0.92 0.66 0.94 0.28 0.85 0.78 0.11 0.82 0.71
RFCN-2 94.17 0.93 0.71 0.95 0.31 0.82 0.83 0.13 0.87 0.7
MSFCN-3 94.38 0.93 0.69 0.96 0.31 0.87 0.81 0.12 0.87 0.72
City
FCN 73.88 0.94 0.94 0.72 0.78 0.34 0.54 0 0.69 0.56
MSFCN-2 87.77 0.87 0.94 0.84 0.83 0.68 0.64 0 0.8 0.8
RFCN-2 88.24 0.91 0.92 0.87 0.78 0.56 0.67 0 0.8 0.74
MSFCN-3 88.89 0.88 0.89 0.86 0.74 0.64 0.53 0 0.71 0.72
Table 2: Semantic Segmentation Results on KITTI Video Sequence.
Architecture NumParams Mean IoU Sky Building Road Sidewalk Fence Vegetation Car Sign
FCN 23,668,680 74.00 46.18 86.50 80.60 69.10 37.25 81.94 74.35 35.11
MSFCN-2 (shared weights) 23,715,272 85.31 47.89 91.08 97.58 88.02 62.60 92.01 90.26 58.11
RFCN-2 (shared weights) 31,847,828 84.19 50.20 93.74 94.90 88.17 59.73 87.73 87.66 55.55
MSFCN-2 47,302,984 85.47 48.72 92.29 96.36 90.21 59.60 92.43 89.27 70.47
RFCN-2 55,435,540 83.38 44.80 92.84 91.77 91.67 58.53 86.01 87.25 52.87
Table 3: Semantic Segmentation Results on SYNTHIA Video Sequence.
Architecture Mean IoU Sky Building Road Sidewalk Fence Vegetation Pole Car Sign Pedestrain Cyclist Lane
FCN 84.08 97.2 92.97 87.74 81.58 34.44 62 1.87 72.75 0.21 0.01 0.33 93.08
MSFCN-2 (shared) 88.88 97.08 93.14 93.58 86.81 47.47 75.11 46.78 88.22 0.27 32.12 2.27 95.26
RFCN-2 (shared) 88.16 96.85 91.07 94.17 85.62 28.29 83.2 47.28 87.6 19.12 16.89 3.01 93.97
MSFCN-2 90.01 97.34 95.97 93.14 86.76 73.52 73.63 35.02 87.86 3.62 27.57 1.11 95.35
RFCN-2 89.48 97.15 94.01 93.76 85.88 76.26 70.35 39.86 87.5 8.16 28.05 1.28 94.67
encoder feature extraction per frame suffices and the
fused encoder is computed by combination of previ-
ously computed encoders. This weight sharing ap-
proach drastically brings down the complexity with
negligible additional computation relative to the sin-
gle stream encoder. We demonstrate experimentally
that the weight shared encoder can still provide a sig-
nificant improvement in accuracy.
4 EXPERIMENTS
In this section, we explain the experimental setting
including the datasets used, training algorithm details,
etc and discuss the results.
4.1 Experimental Setup
In most datasets, the frames in a video sequence are
sparsely sampled temporally to have better diversity
of objects. Thus consecutive video frames are not pro-
vided for training our multi-stream algorithm. Synt-
hetic datasets have no cost for annotation and ground
truth annotation is available for all consecutive fra-
mes. Hence we made use of the synthetic autonomous
driving dataset SYNTHIA (Ros et al., 2016) for our
experiments. We also made use of DAVIS2017 (Pont-
Tuset et al., 2017) and SegTrack V2 (Li et al., 2013)
which provides consecutive frames, they are not auto-
motive datasets but realistic.
We implemented the different proposed multi-
stream architectures using Keras (Chollet et al.,
2015). We used ADAM optimizer as it provided
faster convergence. The maximum order (number
of consecutive frames) used in the training is three
(MSFCN-3) because of limitation of memory nee-
ded for training. Categorical cross-entropy is used
as loss function for the optimizer. Maximum num-
ber of training epochs is set to 30 and early stopping
with a patience of 10 epochs monitoring the gains is
added. Mean class IoU and per-class IoU were used
as accuracy metrics. All input images were resized to
224x384 because of memory requirements needed for
multiple streams.
4.2 Experimental Results and
Discussion
We performed four sets of experiments summarized
in four tables. Qualitative results are provided in Fi-
gure 4 for KITTI, Figure 5 for DAVIS and Figure 6
for SYNTHIA. We also provide a video sequence de-
monstrating qualitative results for larger set of frames.
Table 1: Firstly, we wanted to evaluate different
orders on multi-stream and understand the impact.
We also wanted to understand the impact on high
speed and medium speed scenarios. SYNTHIA da-
taset was used for this experiment as it had sepa-
ration of various speed sequences and it was also a
relatively larger dataset. Two-stream networks pro-
vided a considerable increase in accuracy compared
to the baseline. MSFCN-2 provided an accuracy im-
provement of 8% for Highway and 14% for City se-
quence. RFCN-2 provided a slightly better accuracy
relative to MSFCN-2. MSFCN-3 provided marginal
improvement over MSFCN-2 and thus we did not ex-
plore higher orders.
Table 2: KITTI is a popular automotive dataset and
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
176