Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks

Nazrul Haque, Dinesh Reddy, K. Madhava Krishna

2017

Abstract

Dynamic scene understanding is a challenging problem and motion segmentation plays a crucial role in solving it. Incorporating semantics and motion enhances the overall perception of the dynamic scene. For applications of outdoor robotic navigation, joint learning methods have not been extensively used for extracting spatiotemporal features or adding different priors into the formulation. The task becomes even more challenging without stereo information being incorporated. This paper proposes an approach to fuse semantic features and motion clues using CNNs, to address the problem of monocular semantic motion segmentation. We deduce semantic and motion labels by integrating optical flow as a constraint with semantic features into dilated convolution network. The pipeline consists of three main stages i.e Feature extraction, Feature amplification and Multi Scale Context Aggregation to fuse the semantics and flow features. Our joint formulation shows significant improvements in monocular motion segmentation over the state of the art methods on challenging KITTI tracking dataset.

References

  1. Athanasiadis, T., Mylonas, P., Avrithis, Y., and Kollias, S. (2007). Semantic image segmentation and object labeling. IEEE Transactions on Circuits and Systems for Video Technology, 17(3):298-312.
  2. Badrinarayanan, V., Handa, A., and Cipolla, R. (2015). Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293.
  3. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2014). Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062.
  4. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR.
  5. Dai, J., He, K., and Sun, J. (2015). Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1635-1643.
  6. Elhamifar, E. and Vidal, R. (2009). Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2790- 2797. IEEE.
  7. Fields, R. (2001). Probabilistic models for segmenting and labeling sequence data. In ICML 2001.
  8. Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazirbas¸, C., Golkov, V., van der Smagt, P., Cremers, D., and Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852.
  9. Fragkiadaki, K., Arbeláez, P., Felsen, P., and Malik, J. (2015). Learning to segment moving objects in videos. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4083-4090. IEEE.
  10. Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR).
  11. Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pages 249-256.
  12. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
  13. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725-1732.
  14. Koltun, V. (2011). Efficient inference in fully connected crfs with gaussian edge potentials. Adv. Neural Inf. Process. Syst.
  15. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541-551.
  16. Lin, G., Shen, C., Reid, I., et al. (2015). Efficient piecewise training of deep structured models for semantic segmentation. arXiv preprint arXiv:1504.01013.
  17. Liu, Z., Li, X., Luo, P., Loy, C.-C., and Tang, X. (2015). Semantic image segmentation via deep parsing network. In Proceedings of the IEEE International Conference on Computer Vision, pages 1377-1385.
  18. Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431-3440.
  19. Park, E., Han, X., Berg, T. L., and Berg, A. C. (2016). Combining multiple sources of knowledge in deep cnns for action recognition. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1-8. IEEE.
  20. Reddy, N. D., Singhal, P., and Krishna, K. M. (2014). Semantic motion segmentation using dense crf formulation. In Proceedings of the 2014 Indian Conference on Computer Vision Graphics and Image Processing, page 56. ACM.
  21. Rozantsev, A., Lepetit, V., and Fua, P. (2014). Flying objects detection from a single moving camera. arXiv preprint arXiv:1411.7715.
  22. Russell, C., Kohli, P., Torr, P. H., et al. (2009). Associative hierarchical crfs for object class image segmentation. In 2009 IEEE 12th International Conference on Computer Vision, pages 739-746. IEEE.
  23. Shotton, J., Johnson, M., and Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on, pages 1-8. IEEE.
  24. Simonyan, K. and Zisserman, A. (2014a). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568-576.
  25. Simonyan, K. and Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  26. Tokmakov, P., Alahari, K., and Schmid, C. (2016). Weaklysupervised semantic segmentation using motion cues. arXiv preprint arXiv:1603.07188.
  27. Tourani, S. and Krishna, K. M. (2016). Using in-frame shear constraints for monocular motion segmentation of rigid bodies. Journal of Intelligent & Robotic Systems, 82(2):237-255.
  28. Wedel, A., Meißner, A., Rabe, C., Franke, U., and Cremers, D. (2009). Detection and segmentation of independently moving objects from dense scene flow. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 14-27. Springer.
  29. Weinzaepfel, P., Revaud, J., Harchaoui, Z., and Schmid, C. (2013). Deepflow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision, pages 1385-1392.
  30. Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
Download


Paper Citation


in Harvard Style

Haque N., Reddy D. and Madhava Krishna K. (2017). Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks . In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017) ISBN 978-989-758-226-4, pages 75-85. DOI: 10.5220/0006129200750085


in Bibtex Style

@conference{visapp17,
author={Nazrul Haque and Dinesh Reddy and K. Madhava Krishna},
title={Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks},
booktitle={Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)},
year={2017},
pages={75-85},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006129200750085},
isbn={978-989-758-226-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)
TI - Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks
SN - 978-989-758-226-4
AU - Haque N.
AU - Reddy D.
AU - Madhava Krishna K.
PY - 2017
SP - 75
EP - 85
DO - 10.5220/0006129200750085