
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). Imagenet: A large-scale hierarchical
image database. In 2009 IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 248–255.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., et al. (2020). An image is
worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv preprint arXiv:2010.11929.
Duma, I., Burnete, N., and Todorut, A. (2022). A review of
road traffic accidents reconstruction methods and their
limitations with respect to the national legal frame-
works. IOP Conference Series: Materials Science and
Engineering, 1220(1):012055.
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala,
K. V., Joulin, A., and Misra, I. (2023). Imagebind:
One embedding space to bind them all.
He, K., Chen, X., Xie, S., Li, Y., Doll
´
ar, P., and Girshick,
R. (2022). Masked autoencoders are scalable vision
learners. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages
16000–16009.
Huang, J. and Chang, K. C.-C. (2023). Towards reasoning
in large language models: A survey. In Findings of
the Association for Computational Linguistics: ACL
2023, pages 1049–1065, Toronto, Canada. Associa-
tion for Computational Linguistics.
Li, W., Hacid, H., Almazrouei, E., and Debbah, M. (2023).
A comprehensive review and a taxonomy of edge ma-
chine learning: Requirements, paradigms, and tech-
niques. AI, 4(3):729–786.
Liang, P. P., Zadeh, A., and Morency, L.-P. (2023). Foun-
dations and trends in multimodal machine learning:
Principles, challenges, and open questions.
Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual in-
struction tuning. arXiv preprint arXiv:2304.08485.
Lyu, C., Wu, M., Wang, L., Huang, X., Liu, B., Du, Z.,
Shi, S., and Tu, Z. (2023). Macaw-llm: Multi-modal
language modeling with image, audio, video, and text
integration.
Man, Y., Gui, L.-Y., and Wang, Y.-X. (2023). Bev-
guided multi-modality fusion for driving perception.
In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
21960–21969.
Mohammed, A. A., Ambak, K., Mosa, A. M., and Syam-
sunur, D. (2019). A Review of the Traffic Accidents
and Related Practices Worldwide. The Open Trans-
portation Journal, 13(1):65–83.
Moon, S., Madotto, A., Lin, Z., Dirafzoon, A., Saraf, A.,
Bearman, A., and Damavandi, B. (2022). Imu2clip:
Multimodal contrastive learning for imu motion sen-
sors from egocentric videos and text.
Moon, S., Madotto, A., Lin, Z., Nagarajan, T., Smith, M.,
Jain, S., Yeh, C.-F., Murugesan, P., Heidari, P., Liu,
Y., Srinet, K., Damavandi, B., and Kumar, A. (2023).
Anymal: An efficient and scalable any-modality aug-
mented language model.
Najafi Moghaddam Gilani, V., Hosseinian, S. M., Ghasedi,
M., and Nikookar, M. (2021). Data-Driven Ur-
ban Traffic Accident Analysis and Prediction Using
Logit and Machine Learning-Based Pattern Recogni-
tion Models. Mathematical Problems in Engineering,
2021:9974219.
OpenAI (2023). Gpt-4 technical report.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec,
M., Khalidov, V., Fernandez, P., Haziza, D., Massa,
F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W.,
Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rab-
bat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H.,
Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P.
(2023). Dinov2: Learning robust visual features with-
out supervision.
Radenovic, F., Dubey, A., Kadian, A., Mihaylov, T., Van-
denhende, S., Patel, Y., Wen, Y., Ramanathan, V., and
Mahajan, D. (2023). Filtering, distillation, and hard
negatives for vision-language pre-training. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 6967–6977.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., et al. (2021). Learning transferable visual models
from natural language supervision. In International
conference on machine learning, pages 8748–8763.
PMLR.
Stechly, K., Marquez, M., and Kambhampati, S. (2023).
Gpt-4 doesn’t know it’s wrong: An analysis of iter-
ative prompting for reasoning problems.
Touvron, H., Martin, L., and et al. (2023). Llama 2: Open
foundation and fine-tuned chat models.
Vepakomma, P., Gupta, O., Swedish, T., and Raskar, R.
(2018). Split learning for health: Distributed deep
learning without sharing raw patient data.
Wu, P., Wang, Z., Zheng, B., Li, H., Alsaadi, F. E., and
Zeng, N. (2023a). Aggn: Attention-based glioma
grading network with multi-scale feature extraction
and multi-modal information fusion. Computers in Bi-
ology and Medicine, 152:106457.
Wu, S., Fei, H., Qu, L., Ji, W., and Chua, T.-S. (2023b).
Next-gpt: Any-to-any multimodal llm.
Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick,
T., and Dubnov, S. (2023c). Large-scale contrastive
language-audio pretraining with feature fusion and
keyword-to-caption augmentation. In ICASSP 2023
- 2023 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 1–5.
Xu, Z., Peng, S., Lin, H., He, G., Sun, J., Shen, Y., Bao,
H., and Zhou, X. (2023). 4k4d: Real-time 4d view
synthesis at 4k resolution.
Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang,
W., and Yue, X. (2023). Meta-transformer: A unified
framework for multimodal learning.
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
(2023a). Minigpt-4: Enhancing vision-language un-
derstanding with advanced large language models.
arXiv preprint arXiv:2304.10592.
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
(2023b). Minigpt-4: Enhancing vision-language un-
derstanding with advanced large language models.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
950