Starting from Xavier initialised weights then do-
ing self-training, results in higher mIoU (by +0.47%)
than starting from ImageNet pre-trained weights and
also doing self-training. Still model initialisation us-
ing ImegeNet pre-trained weights outperforms Xavier
initialisation by 1.8% in the mIoU score.
Xavier initialisation of the Batch Normalisation
layers instead of using the self-trained ones improves
the mIoU by +2.66%.
Using the same machine, the same dataset and
same image input size (1024 × 64), our best gener-
ated model outperforms the SalsaNext by +5.53% in
the mIoU score. For approximately the same mIoU
score, the segmentation network initialised using self-
trained weights generates less average PPNLL than
the network initialised using Xavier. This shows that
self-training reduces the epistemic uncertainty of the
model. For the same architecture and same self-
training, the lower segmentation validation loss is, the
lower the model’s epistemic uncertainty.
The recipe that generated the best results was, us-
ing the TransUnet architecture, keep the Transformer
block but with replacing the CNN ResNet and de-
coder blocks with those in the SalsaNext architec-
ture, use the self-supervision training with input re-
construction objective, use the pre-trained weights to
initialise the segmentation network except the batch
normalisation layers which are randomly initialised
and use Cross entropy loss plus the Lovasz-Softmax
loss as the semantic segmentation loss.
REFERENCES
Assran, M., Caron, M., Misra, I., Bojanowski, P., Joulin, A.,
Ballas, N., and Rabbat, M. (2021). Semi-supervised
learning of visual features by non-parametrically pre-
dicting view assignments with support samples. arXiv
preprint arXiv:2104.13963.
Atito, S., Awais, M., and Kittler, J. (2021). Sit:
Self-supervised vision transformer. arXiv preprint
arXiv:2104.03602.
Balan, A. K., Rathod, V., Murphy, K., and Welling,
M. (2015). Bayesian dark knowledge. CoRR,
abs/1506.04416.
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke,
S., Stachniss, C., and Gall, J. (2019). SemanticKITTI:
A Dataset for Semantic Scene Understanding of Li-
DAR Sequences. In Proc. of the IEEE/CVF Interna-
tional Conf. on Computer Vision (ICCV).
Bhattacharyya, P., Huang, C., and Czarnecki, K. (2021). Sa-
det3d: Self-attention based context-aware 3d object
detection.
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E.,
Xu, Q., Krishnan, A., Pan, Y., Baldan, G., and Bei-
jbom, O. (2019). nuscenes: A multimodal dataset for
autonomous driving. CoRR, abs/1903.11027.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kir-
illov, A., and Zagoruyko, S. (2020). End-to-end
object detection with transformers. arXiv preprint
arXiv:2005.12872.
Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu,
Z., Ma, S., Xu, C., Xu, C., and Gao, W. (2020).
Pre-trained image processing transformer. CoRR,
abs/2012.00364.
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wangy, Y.,
Lu, L., Yuille, A. L., and Zhou, Y. (2021). Transunet:
Transformers make strong encoders for medical image
segmentation. arXiv preprint arXiv:2102.04306.
Cortinhal, T., Tzelepis, G., and Aksoy, E. E. (2020). Sal-
sanext: Fast, uncertainty-aware semantic segmenta-
tion of lidar point clouds for autonomous driving.
Dai, Z., Cai, B., Lin, Y., and Chen, J. (2021). Up-detr: Un-
supervised pre-training for object detection with trans-
formers. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 1601–1610.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). ImageNet: A Large-Scale Hierarchical
Image Database. In CVPR09.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In ICLR.
Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian
approximation: Representing model uncertainty in
deep learning.
Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready
for Autonomous Driving? The KITTI Vision Bench-
mark Suite. In Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), pages 3354–
3361.
Glorot, X. and Bengio, Y. (2010). Understanding the diffi-
culty of training deep feedforward neural networks. In
Teh, Y. W. and Titterington, M., editors, Proceedings
of the Thirteenth International Conference on Artifi-
cial Intelligence and Statistics, volume 9 of Proceed-
ings of Machine Learning Research, pages 249–256,
Chia Laguna Resort, Sardinia, Italy. PMLR.
Graves, A. (2011). Practical variational inference for neural
networks. In Shawe-Taylor, J., Zemel, R., Bartlett,
P., Pereira, F., and Weinberger, K. Q., editors, Ad-
vances in Neural Information Processing Systems,
volume 24. Curran Associates, Inc.
Hahner, M., Dai, D., Liniger, A., and Gool, L. V. (2020).
Quantifying data augmentation for lidar based 3d ob-
ject detection. CoRR, abs/2004.01643.
Hao, W., Li, C., Li, X., Carin, L., and Gao, J. (2020).
Towards learning a generic agent for vision-and-
language navigation via pre-training. Conference on
Computer Vision and Pattern Recognition (CVPR).
Hern
´
andez-Lobato, J. M. and Adams, R. P. (2015). Prob-
abilistic backpropagation for scalable learning of
bayesian neural networks.
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
1018