works can be used for the task of semantic segmenta-
tion. Our approach performs classification of image
patches at each pixel position. We analyzed differ-
ent popular network architectures along with differ-
ent techniques to improve the training. Furthermore,
we demonstrated how spatial prior information like
pixel positions can be incorporated into the learning
process leading to a significant performance gain.
For evaluation, we used two different application
scenarios: road detection and urban scene understand-
ing. We were able to achieve very good results in the
road detection challenge of the popular KITTI Vision
Benchmark Suite. In this scenario we outperformed
several competitors, even those that use stereo images
or laser data.
For a second set of experiments, we used the
dataset LabelMeFacade of (Fr
¨
ohlich et al., 2010)
which is a multi-class classification task and shows
very diverse urban scenes. We were again able to
achieve state-of-the-art results. Future work will fo-
cus on speeding up the prediction phase, since we cur-
rently need around 30s for each image to infer the la-
bel at each position.
REFERENCES
Alvarez, J. M., Gevers, T., LeCun, Y., and Lopez, A. M.
(2012). Road scene segmentation from a single image.
In European Conference on Computer Vision (ECCV),
pages 376–389.
Alvarez, J. M. and Lopez, A. M. (2011). Road detection
based on illuminant invariance. IEEE Transactions on
Intelligent Transportation Systems, 12(1):184–193.
Chellapilla, K., Puri, S., Simard, P., et al. (2006). High
performance convolutional neural networks for docu-
ment processing. In Tenth International Workshop on
Frontiers in Handwriting Recognition.
Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2014).
Convolutional nets and watershed cuts for real-time
semantic labeling of rgbd videos. Journal of Machine
Learning Research (JMLR), 15:3489–3511.
Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Effi-
cient graph-based image segmentation. International
Journal of Computer Vision, 59(2):1–26.
Fritsch, J., K
¨
uhnl, T., and Geiger, A. (2013). A new per-
formance measure and evaluation benchmark for road
detection algorithms. In IEEE International Con-
ference on Intelligent Transportation Systems, pages
1693–1700.
Fr
¨
ohlich, B., Rodner, E., and Denzler, J. (2010). A fast
approach for pixelwise labeling of facade images. In
Proceedings of the International Conference on Pat-
tern Recognition (ICPR), volume 7, pages 3029–3032.
Fr
¨
ohlich, B., Rodner, E., and Denzler, J. (2012). Seman-
tic segmentation with millions of features: Integrat-
ing multiple cues in a combined random forest ap-
proach. In Asian Conference on Computer Vision
(ACCV), pages 218–231.
Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready
for autonomous driving? the kitti vision benchmark
suite. In Computer Vision and Pattern Recognition
(CVPR), pages 3354–3361.
Glorot, X. and Bengio, Y. (2010). Understanding the dif-
ficulty of training deep feedforward neural networks.
In International Conference on Artificial Intelligence
and Statistics (AISTATS), pages 249–256.
Gupta, S., Girshick, R., Arbel
´
aez, P., and Malik, J. (2014).
Learning rich features from RGB-D images for object
detection and segmentation. In European Conference
on Computer Vision (ECCV).
Hariharan, B., Arbel
´
aez, P., Girshick, R., and Malik, J.
(2014). Simultaneous detection and segmentation. In
European Conference on Computer Vision (ECCV).
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. R. (2012). Improving neural
networks by preventing co-adaptation of feature de-
tectors. arXiv preprint arXiv:1207.0580.
Kang, Y., Yamaguchi, K., Naito, T., and Ninomiya, Y.
(2011). Multiband image segmentation and object
recognition for understanding road scenes. IEEE
Transactions on Intelligent Transportation Systems,
12(4):1423–1433.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. In Advances in neural information process-
ing systems (NIPS), pages 1097–1105.
K
¨
uhnl, T., Kummert, F., and Fritsch, J. (2011). Monocular
road segmentation using slow feature analysis. In Pro-
ceedings of the IEEE Intelligent Vehicles Symposium,
pages 800–806.
K
¨
uhnl, T., Kummert, F., and Fritsch, J. (2012). Spa-
tial ray features for real-time ego-lane extraction. In
IEEE Conference on Intelligent Transportation Sys-
tems, pages 288–293.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,
R. E., Hubbard, W., and Jackel, L. D. (1989). Back-
propagation applied to handwritten zip code recogni-
tion. Neural computation, 1(4):541–551.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (2001).
Gradient-based learning applied to document recogni-
tion. In Intelligent Signal Processing, pages 306–351.
IEEE Press.
Masci, J., Giusti, A., Ciresan, D. C., Fricout, G., and
Schmidhuber, J. (2013). A fast learning algorithm for
image segmentation with max-pooling convolutional
networks. arXiv preprint arXiv:1302.1690.
Nowozin, S. (2014). Optimal decisions from probabilis-
tic models: the intersection-over-union case. In Com-
puter Vision and Pattern Recognition (CVPR).
Scharwaechter, T., Enzweiler, M., Franke, U., and Roth, S.
(2013). Efficient multi-cue scene segmentation. In
German Conference on Pattern Recognition (GCPR),
Lecture Notes in Computer Science, pages 435–445.
Torralba, A. (2003). Contextual priming for object de-
tection. International Journal of Computer Vision
(IJCV), 53(2):169–191.
Zhang, C., Wang, L., and Yang, R. (2010). Semantic seg-
mentation of urban scenes using dense depth maps. In
Daniilidis, K., Maragos, P., and Paragios, N., editors,
European Conference on Computer Vision (ECCV),
pages 708–721.
ConvolutionalPatchNetworkswithSpatialPriorforRoadDetectionandUrbanSceneUnderstanding
517