are manually annotated for the instruction generation
task in the navigation work. For our task, all videos
are re-annotated such that each query, which is
randomly chosen with various lengths from 5 to 120
seconds from each video, is aligned to its original
video. Specifically, we assign the interval [ l
t
, u
t
]
in the original video to represent the start and end
indices of the query input, and given the predicted
interval from the proposed model, we can estimate
the alignment error for each pair.
We can qualitatively observe the performance of
our proposed approach on two video pairs from the
navigation dataset, one is captured at indoor as shown
in figure 2, and the other at outdoor as shown in fi-
gure 3. Clearly, We can observe the well defined cor-
respondences among each of the two videos, which
support our idea of considering our approach a pro-
mising solution for the video alignment task.
5 CONCLUSIONS
In this work, we present a new technique to solve the
temporal alignment task between two overlapped vi-
deos with no restrictions imposed on the capturing
process. The proposed technique uses the pretrained
CNN network,”VGG-16”, to obtain highly descriptive
features of the video frames. Also, it exploits the bi-
directional attention flow mechanism that has already
proved its efficiency in MC in order to consider the
existing interactions between the two input videos in
both directions. Initial results obtained using a trai-
ning dataset including around 10k of video pairs from
”YouTube” show that this approach is highly effective
in mapping the input query video to its corresponding
part in the context video. We plan to test our model
using the state of the art datasets for video alignment
to be able to assess its accuracy thoroughly.
ACKNOWLEDGEMENTS
This work has been supported by the Ministry of Hig-
her Education (MoHE) of Egypt and Waseda Univer-
sity at Japan through a PhD scholarship.
REFERENCES
Abobeah, R., Hussein, M., Abdelwahab, M., and Shoukry,
A. (2018). Wearable rgb camera-based navigation sy-
stem for the visually impaired. In Advanced Concepts
for Intelligent Vision Systems, volume 5, pages 555–
562.
Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L.,
Parikh, D., and Batra, D. (2017). Vqa: Visual question
answering. International Journal of Computer Vision,
123(1):4–31.
Cao, X., Wu, L., Xiao, J., Foroosh, H., Zhu, J., and Li,
X. (2010). Video synchronization and its applica-
tion to object transfer. Image and Vision Computing,
28(1):92–100.
Caspi, Y. and Irani, M. (2002). Spatio-temporal alignment
of sequences. IEEE Transactions on Pattern Analysis
& Machine Intelligence, (11):1409–1424.
Dai, C., Zheng, Y., and Li, X. (2006). Accurate video align-
ment using phase correlation. IEEE Signal Processing
Letters, 13(12):737–740.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. (2009). Imagenet: A large-scale hierarchical image
database. In Computer Vision and Pattern Recogni-
tion, 2009. CVPR 2009. IEEE Conference on, pages
248–255. Ieee.
Diego, F., Ponsa, D., Serrat, J., and L
´
opez, A. M. (2011).
Video alignment for change detection. IEEE Tran-
sactions on Image Processing, 20(7):1858–1869.
Douze, M., Revaud, J., Verbeek, J., J
´
egou, H., and Schmid,
C. (2016). Circulant temporal encoding for video re-
trieval and temporal alignment. International Journal
of Computer Vision, 119(3):291–306.
Fraundorfer, F., Engels, C., and Nist
´
er, D. (2007). Topologi-
cal mapping, localization and navigation using image
collections. In Intelligent Robots and Systems, 2007.
IROS 2007. IEEE/RSJ International Conference on,
pages 3872–3877. IEEE.
Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T.,
and Rohrbach, M. (2016). Multimodal compact bili-
near pooling for visual question answering and visual
grounding. arXiv preprint arXiv:1606.01847.
Ho, K. L. and Newman, P. (2007). Detecting loop closure
with scene sequences. International Journal of Com-
puter Vision, 74(3):261–286.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural computation, 9(8):1735–1780.
Kong, H., Audibert, J.-Y., and Ponce, J. (2010). Detecting
abandoned objects with a moving camera. IEEE Tran-
sactions on Image Processing, 19(8):2201–2210.
Lei, C. and Yang, Y.-H. (2006). Tri-focal tensor-based
multiple video synchronization with subframe opti-
mization. IEEE Transactions on Image Processing,
15(9):2473–2480.
Lu, J., Yang, J., Batra, D., and Parikh, D. (2016). Hierar-
chical question-image co-attention for visual question
answering. In Advances In Neural Information Pro-
cessing Systems, pages 289–297.
Malinowski, M., Rohrbach, M., and Fritz, M. (2015). Ask
your neurons: A neural-based approach to answering
questions about images. In Proceedings of the IEEE
international conference on computer vision, pages 1–
9.
Padua, F., Carceroni, R., Santos, G., and Kutulakos,
K. (2010). Linear sequence-to-sequence alignment.
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
588