
rithm. Pattern Analysis and Applications, 23(2):611–
623.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y.,
Sun, J., and Wang, H. (2023). Retrieval-Augmented
Generation for Large Language Models: A Survey.
arXiv preprint arXiv:2312.10997.
Hassner, T., Itcher, Y., and Kliper-Gross, O. (2012). Violent
Flows: Real-Time Detection of Violent Crowd Behav-
ior. In Proc. IEEE CVPR Workshops, pages 1–6.
Herrenkohl, T. I. (2011). Violence in Context: Current Evi-
dence on Risk, Protection, and Prevention. OUP USA.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,
V., Goyal, N., K
¨
uttler, H., Lewis, M., Yih, W.-t.,
Rockt
¨
aschel, T., et al. (2020). Retrieval-Augmented
Generation for Knowledge-Intensive NLP Tasks. Ad-
vances in Neural Information Processing Systems,
33:9459–9474.
Li, J., Li, D., Savarese, S., and Hoi, S. (2023a). BLIP-
2: Bootstrapping Language-Image Pre-Training with
Frozen Image Encoders and Large Language Models.
In Proc. ICML, pages 19730–19742.
Li, Y., Wang, C., and Jia, J. (2023b). LLaMA-VID: An
Image is Worth 2 Tokens in Large Language Models.
arXiv preprint arXiv:2311.17043.
Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., and Yuan, L.
(2023). Video-LLaVA: Learning United Visual Rep-
resentation by Alignment Before Projection. arXiv
preprint arXiv:2311.10122.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilac-
qua, M., Petroni, F., and Liang, P. (2024). Lost in
the Middle: How Language Models Use Long Con-
texts. Transactions of the Association for Computa-
tional Linguistics, 12:157–173.
Mumtaz, N., Ejaz, N., Habib, S., Mohsin, S. M., Tiwari, P.,
Band, S. S., and Kumar, N. (2023). An Overview of
Violence Detection Techniques: Current Challenges
and Future Directions. Artificial Intelligence Review,
56(5):4641–4666.
Panagopoulou, A., Xue, L., Yu, N., Li, J., Li, D., Joty, S.,
Xu, R., Savarese, S., Xiong, C., and Niebles, J. C.
(2023). X-InstructBLIP: A Framework for Aligning
X-Modal Instruction-Aware Representations to LLMs
and Emergent Cross-Modal Reasoning. arXiv preprint
arXiv:2311.18799.
Park, J.-H., Mahmoud, M., and Kang, H.-S. (2024).
Conv3D-Based Video Violence Detection Network
Using Optical Flow and RGB Data. Sensors,
24(2):317.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).
Exploring the Limits of Transfer Learning with a Uni-
fied Text-to-Text Transformer. Journal of Machine
Learning Research, 21(140):1–67.
Senst, T., Eiselein, V., Kuhn, A., and Sikora, T.
(2017). Crowd Violence Detection Using Global
Motion-Compensated Lagrangian Features and Scale-
Sensitive Video-Level Representation. IEEE Trans-
actions on Information Forensics and Security,
12(12):2945–2956.
Sudhakaran, S. and Lanz, O. (2017). Learning to Detect
Violent Videos Using Convolutional Long Short-Term
Memory. In Proc. IEEE AVSS, pages 1–6.
Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G.,
and Liu, J. (2022). Human Action Recognition from
Various Data Modalities: A Review. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
45(3):3200–3225.
Szeliski, R. (2022). Computer Vision: Algorithms and Ap-
plications. Springer Nature.
Traor
´
e, A. and Akhloufi, M. A. (2020). Violence Detection
in Videos Using Deep Recurrent and Convolutional
Neural Networks. In Proc. IEEE SMC, pages 154–
159.
Ullah, F. U. M., Obaidat, M. S., Ullah, A., Muhammad,
K., Hijji, M., and Baik, S. W. (2023). A Comprehen-
sive Review on Vision-Based Violence Detection in
Surveillance Videos. ACM Comput. Surv., 55(10).
Ullah, F. U. M., Ullah, A., Muhammad, K., Haq, I. U., and
Baik, S. W. (2019). Violence Detection Using Spa-
tiotemporal Features with 3D Convolutional Neural
Network. Sensors, 19(11):2472.
Wang, H., Kl
¨
aser, A., Schmid, C., and Liu, C.-L. (2013).
Dense Trajectories and Motion Boundary Descrip-
tors for Action Recognition. International Journal of
Computer Vision, 103:60–79.
Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., and Yang,
Z. (2020). Not Only Look, But Also Listen: Learning
Multimodal Violence Detection Under Weak Supervi-
sion. In Proc. ECCV, pages 322–339. Springer.
Zhang, H., Li, X., and Bing, L. (2023). Video-
LLaMA: An Instruction-Tuned Audio-Visual Lan-
guage Model for Video Understanding. arXiv preprint
arXiv:2306.02858.
Leveraging Vision Language Models for Understanding and Detecting Violence in Videos
113