REFERENCES
Alwassel, H., Caba Heilbron, F., Escorcia, V., and Ghanem,
B. (2018). Diagnosing error in temporal action detec-
tors. In ECCV.
Alwassel, H., Giancola, S., and Ghanem, B. (2020). TSP:
temporally-sensitive pretraining of video encoders for
localization tasks. CoRR, abs/2011.11479.
Arandjelovic, R. and Zisserman, A. (2017). Look, listen
and learn. In ICCV.
Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., and
Niebles, J. (2019). End-to-end, single-stream tempo-
ral action detection in untrimmed videos. In BMVC.
Caba Heilbron, F., Escorcia, V., Ghanem, B., and Car-
los Niebles, J. (2015). Activitynet: A large-scale
video benchmark for human activity understanding. In
CVPR.
Carreira, J. and Zisserman, A. (2017). Quo vadis, action
recognition? a new model and the kinetics dataset. In
CVPR.
Delphin-Poulat, L. and Plapous, C. (2019). Mean teacher
with data augmentation for dcase 2019 task 4. Tech-
nical report, Orange Labs Lannion, France.
Ebbers, J. and Haeb-Umbach, R. (2020). Convolutional
recurrent neural networks for weakly labeled semi-
supervised sound event detection in domestic environ-
ments. Technical report, DCASE2020 Challenge.
Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016). Con-
volutional two-stream network fusion for video action
recognition.
Gao, J., Chen, K., and Nevatia, R. (2018). CTAP: comple-
mentary temporal action proposal generation. CoRR,
abs/1807.04821.
Gao, J., Yang, Z., Sun, C., Chen, K., and Nevatia, R.
(2017). TURN TAP: temporal unit regression network
for temporal action proposals. CoRR, abs/1703.06189.
Girshick, R. (2015). Fast r-cnn. In ICCV, pages 1440–1448.
Hao, J., Hou, Z., and Peng, W. (2020). Cross-domain sound
event detection: from synthesized audio to real audio.
Technical report, DCASE2020 Challenge.
He, Y., Xu, X., Liu, X., Ou, W., and Lu, H. (2021). Multi-
modal transformer networks with latent interaction for
audio-visual event localization. In ICME.
Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F.,
Jansen, A., Moore, R. C., Plakal, M., Platt, D.,
Saurous, R. A., Seybold, B., et al. (2017). Cnn ar-
chitectures for large-scale audio classification. In
ICASSP.
Jiang, H., Li, Y., Song, S., and Liu, J. (2018). Rethink-
ing Fusion Baselines for Multi-modal Human Action
Recognition: 19th Pacific-Rim Conference on Mul-
timedia, Hefei, China, September 21-22, 2018, Pro-
ceedings, Part III.
Jiang, Y.-G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev,
I., Shah, M., and Sukthankar, R. (2014). THUMOS
challenge: Action recognition with a large number of
classes.
Kazakos, E., Nagrani, A., Zisserman, A., and Damen, D.
(2019). Epic-fusion: Audio-visual temporal binding
for egocentric action recognition. In ICCV.
Lee, J.-T., Jain, M., Park, H., and Yun, S. (2021). Cross-
attentional audio-visual fusion for weakly-supervised
action localization. In ICLR.
Li, X., Lin, T., Liu, X., Gan, C., Zuo, W., Li, C., Long, X.,
He, D., Li, F., and Wen, S. (2019). Deep concept-wise
temporal convolutional networks for action localiza-
tion.
Lin, C., Li, J., Wang, Y., Tai, Y., Luo, D., Cui, Z., Wang, C.,
Li, J., Huang, F., and Ji, R. (2020). Fast learning of
temporal action proposal via dense boundary genera-
tor. Proceedings of the AAAI Conference on Artificial
Intelligence.
Lin, C., Xu, C., Luo, D., Wang, Y., Tai, Y., Wang, C., Li, J.,
Huang, F., and Fu, Y. (2021). Learning salient bound-
ary feature for anchor-free temporal action localiza-
tion. In CVPR, pages 3320–3329.
Lin, L. and Wang, X. (2019). Guided learning convolution
system for dcase 2019 task 4. Technical report, Insti-
tute of Computing Technology, Chinese Academy of
Sciences, Beijing, China.
Lin, T., Liu, X., Li, X., Ding, E., and Wen, S. (2019). BMN:
boundary-matching network for temporal action pro-
posal generation. CoRR, abs/1907.09702.
Lin, T., Zhao, X., and Shou, Z. (2017). Single shot temporal
action detection. CoRR, abs/1710.06236.
Lin, T., Zhao, X., Su, H., Wang, C., and Yang, M. (2018).
BSN: boundary sensitive network for temporal action
proposal generation. CoRR, abs/1806.02964.
Liu, X., Hu, Y., Bai, S., Ding, F., Bai, X., and Torr, P.
H. S. (2021). Multi-shot temporal event localization:
A benchmark. In CVPR, pages 12596–12606.
Liu, Y., Ma, L., Zhang, Y., Liu, W., and Chang, S. (2018).
Multi-granularity generator for temporal action pro-
posal. CoRR, abs/1811.11524.
Long, X., Gan, C., De Melo, G., Liu, X., Li, Y., Li, F., and
Wen, S. (2018a). Multimodal keyless attention fusion
for video classification. In AAAI 2018.
Long, X., Gan, C., de Melo, G., Wu, J., Liu, X., and Wen,
S. (2018b). Attention clusters: Purely attention based
local feature integration for video classification. In
CVPR.
Miyazaki, K., Komatsu, T., Hayashi, T., Watanabe, S.,
Toda, T., and Takeda, K. (2020). Convolution-
augmented transformer for semi-supervised sound
event detection. Technical report, DCASE2020 Chal-
lenge.
Montes, A., Salvador, A., and Gir
´
o-i-Nieto, X. (2016).
Temporal activity detection in untrimmed videos with
recurrent neural networks. CoRR, abs/1608.08128.
Ono, N., Harada, N., Kawaguchi, Y., Mesaros, A., Imoto,
K., Koizumi, Y., , and Komatsu, T. (2020). Proceed-
ings of the Fifth Workshop on Detection and Classifi-
cation of Acoustic Scenes and Events (DCASE 2020).
Owens, A. and Efros, A. A. (2018). Audio-visual scene
analysis with self-supervised multisensory features. In
ECCV.
Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., and
Torralba, A. Learning sight from sound: Ambient
sound provides supervision for visual learning. Int.
J. Comput. Vis.
Hear Me out: Fusional Approaches for Audio Augmented Temporal Action Localization
153