
pervised player classification. In Proceedings of the
5th International ACM Workshop on Multimedia Con-
tent Analysis in Sports, MM ’22. ACM.
Deli
`
ege, A., Cioppa, A., Giancola, S., Seikavandi, M. J.,
Dueholm, J. V., Nasrollahi, K., Ghanem, B., Moes-
lund, T. B., and Van Droogenbroeck, M. (2021).
SoccerNet-v2: A dataset and benchmarks for holis-
tic understanding of broadcast soccer videos. In 2021
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition Workshops (CVPRW), pages 4503–
4514.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019).
Slowfast networks for video recognition. In 2019
IEEE/CVF International Conference on Computer Vi-
sion (ICCV), pages 6201–6210.
FIFA, I. (2024). FIFA publishes Professional Football Re-
port 2023. https://inside.fifa.com/legal/news/fifa-pub
lishes-professional-football-report-2023. Accessed:
04/06/2024.
Gan, Y., Togo, R., Ogawa, T., and Haseyama, M. (2022).
Transformer based multimodal scene recognition in
soccer videos. In 2022 IEEE International Confer-
ence on Multimedia and Expo Workshops (ICMEW),
pages 1–6.
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen,
A., Lawrence, W., Moore, R. C., Plakal, M., and Rit-
ter, M. (2017). Audio Set: An ontology and human-
labeled dataset for audio events. In 2017 IEEE Inter-
national Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 776–780.
Giancola, S. and Ghanem, B. (2021). Temporally-aware
feature pooling for action spotting in soccer broad-
casts. In 2021 IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW),
pages 4485–4494.
He, B., Yang, X., Wu, Z., Chen, H., Lim, S.-N., and Shri-
vastava, A. (2020). GTA: Global temporal atten-
tion for video action understanding. arXiv preprint
arXiv:2012.08510.
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke,
J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D.,
Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J.,
and Wilson, K. (2017). CNN architectures for large-
scale audio classification. In 2017 IEEE International
Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 131–135.
Intelligence, M. (2024). FOOTBALL MARKET. https:
//www.mordorintelligence.com/industry-reports/foo
tball-market. Accessed: 04/06/2024.
Lei, J., Li, G., Zhang, J., Guo, Q., and Tu, D. (2016).
Continuous action segmentation and recognition using
hybrid convolutional neural network-hidden markov
model model. IET Computer vision, 10(6):537–544.
Neimark, D., Bar, O., Zohar, M., and Asselmann, D. (2021).
Video transformer network. In 2021 IEEE/CVF Inter-
national Conference on Computer Vision Workshops
(ICCVW), pages 3156–3165.
OpenAI (2024a). GPT-4o. https://platform.openai.com/do
cs/models/gpt-4o.
OpenAI (2024b). text embedding large 3. https://platform
.openai.com/docs/models/embeddings.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey,
C., and Sutskever, I. (2023). Robust speech recogni-
tion via large-scale weak supervision. In International
conference on machine learning, pages 28492–28518.
PMLR.
Shaikh, M. B., Chai, D., Islam, S. M. S., and Akhtar, N.
(2022). MAiVAR: Multimodal audio-image and video
action recognizer. In 2022 IEEE International Confer-
ence on Visual Communications and Image Process-
ing (VCIP), pages 1–5.
SYSTRAN (2024). Faster whisper. https://github.com/S
YSTRAN/faster-whisper. Accessed: 2024-06-10.
Teranishi, M., Tsutsui, K., Takeda, K., and Fujii, K. (2022).
Evaluation of creating scoring opportunities for team-
mates in soccer via trajectory prediction. In Interna-
tional Workshop on Machine Learning and Data Min-
ing for Sports Analytics, pages 53–73. Springer.
Tran, D., Wang, H., Feiszli, M., and Torresani, L. (2019).
Video classification with channel-separated convolu-
tional networks. In 2019 IEEE/CVF International
Conference on Computer Vision (ICCV), pages 5551–
5560.
Vanderplaetse, B. and Dupont, S. (2020). Improved soccer
action spotting using both audio and video streams.
In 2020 IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops (CVPRW), pages
3921–3931.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is all you need. In Proceedings of
the 31st International Conference on Neural Informa-
tion Processing Systems, NIPS’17, page 6000–6010,
Red Hook, NY, USA. Curran Associates Inc.
Xarles, A., Escalera, S., Moeslund, T. B., and Clap
´
es, A.
(2023). ASTRA: An action spotting transformer for
soccer videos. In Proceedings of the 6th International
Workshop on Multimedia Content Analysis in Sports,
pages 93–102.
Yang, C., Xu, Y., Shi, J., Dai, B., and Zhou, B. (2020).
Temporal pyramid network for action recognition. In
2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 588–597.
Zhang, S. and Feng, Y. (2023). Hidden markov trans-
former for simultaneous machine translation. ArXiv,
abs/2303.00257. https://api.semanticscholar.org/Corp
usID:257255341.
Zhou, X., Kang, L., Cheng, Z., He, B., and Xin, J. (2021).
Feature combination meets attention: Baidu soccer
embeddings and transformer based temporal detec-
tion. CoRR, abs/2106.14447.
ASPERA: Exploring Multimodal Action Recognition in Football Through Video, Audio, and Commentary
657