attention): one with the feature transform disabled,
and one with it enabled.
As it can be seen in Table 2, the baseline already
refines inverse-depth significantly. Without our feature
transformation, the models are unable to use multi-
view information because of the vastly different view-
points, and indeed this slightly hurts performance.
Only when transforming the features between view-
points does the performance increase over the baseline,
highlighting the importance of our method for success-
fully aggregating multiple views.
4.3 Generalisation
To asses the ability of our method to generalise on
unseen reconstructions, we divide our training data by
sequence: we use two of the sequences for training,
and the third for testing. Sequences 00 and 05 are
recorded in a suburban area with narrow roads, while
sequence 06 is a loop on a divided road with a median
strip, a much wider and visually distinct space. We
train models for 200 000 steps and aggregate feature
maps by averaging. The results in Table 3 show that
our method successfully uses information from mul-
tiple views, even in areas of a city different from the
ones it was trained on. Furthermore, they reaffirm the
need for our feature transformation method in addition
to warping.
5 CONCLUSION AND FUTURE
WORK
In conclusion, we have presented a new method for
correcting dense reconstructions via 2D mesh feature
renderings. In contrast to previous work, we make pre-
dictions on multiple views at the same time by warping
and aggregating feature maps inside a
CNN
. In addi-
tion to warping the feature maps, we also transform the
features between views and show that this is necessary
for using arbitrary viewpoints.
The method presented here aggregates feature
maps between every pair of overlapping input views.
This scales quadratically with the number of views
and thus limits the size of the neighbourhood we can
reasonably process. Future work will consider aggre-
gation into a shared 2D spatial representation, such as
a
360
◦
view, which would scale linearly with the input
neighbourhood size.
While in this paper we have applied our method
to correct stereo reconstructions using lidar as high-
quality supervision, our approach operates strictly on
meshes, so it is agnostic to the types of sensors used
to produce the low- and high-quality reconstructions,
so long as it is trained accordingly.
REFERENCES
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural
machine translation by jointly learning to align and
translate. In Proceedings of The International Confer-
ence on Learning Representations (ICLR).
Brabandere, B. D., Jia, X., Tuytelaars, T., and Gool, L. V.
(2016). Dynamic filter networks. In Proceedings of The
Conference on Neural Information Processing Systems
(NIPS).
Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015).
Fast and accurate deep network learning by exponen-
tial linear units (ELUs). In Proceedings of The In-
ternational Conference on Learning Representations
(ICLR).
Donne, S. and Geiger, A. (2019). Defusr: Learning non-
volumetric depth fusion using successive reprojections.
In Proceedings of The IEEE/CVF International Con-
ference on Computer Vision and Pattern Recognition
(CVPR).
Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).
Vision meets robotics: The kitti dataset. The Inter-
national Journal of Robotics Research, 32(11):1231–
1237.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings
of The IEEE/CVF International Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
770–778.
Jeon, J. and Lee, S. (2018). Reconstruction-based pairwise
depth dataset for depth image enhancement using CNN.
In Proceedings of The European Conference on Com-
puter Vision (ECCV).
Kingma, D. and Ba, J. (2015). Adam: A method for stochas-
tic optimization. In Proceedings of The International
Conference on Learning Representations (ICLR).
Kwon, H., Tai, Y.-W., and Lin, S. (2015). Data-driven depth
map refinement via multi-scale sparse representation.
In Proceedings of The IEEE/CVF International Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 159–167.
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and
Navab, N. (2016). Deeper depth prediction with fully
convolutional residual networks. In Proceedings of The
IEEE International Conference on 3D Vision (3DV),
pages 239–248.
Ma, F. and Karaman, S. (2018). Sparse-to-dense: Depth pre-
diction from sparse depth samples and a single image.
In Proceedings of The IEEE International Conference
on Robotics and Automation (ICRA).
Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D.,
Kim, D., Davison, A. J., Kohli, P., Shotton, J., Hodges,
S., and Fitzgibbon, A. W. (2011). KinectFusion: Real-
time dense surface mapping and tracking. In Proceed-
ings of The IEEE International Symposium on Mixed
and Augmented Reality (ISMAR), pages 127–136.
VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications
908