Leveraging Vision Language Models for Understanding and Detecting Violence in Videos

Jose Alejandro Avellaneda Gonzalez, Tetsu Matsukawa, Einoshin Suzuki

2025

Abstract

Detecting violent behaviors in video content is crucial for public safety and security. Ensuring accurate identification of such behaviors can prevent harm and enhance surveillance. Traditional methods rely on manual feature extraction and classical machine learning algorithms, which lack robustness and adaptability in diverse real-world scenarios. These methods struggle with environmental variability and often fail to generalize across contexts. Due to the nature of violence content, ethical and legal challenges in dataset collection result in a scarcity of data. This limitation impacts modern deep learning approaches, which, despite their effectiveness, often produce models that struggle to generalize well across diverse contexts. To address these challenges, we propose VIVID: Vision-Language Integration for Violence Identification and Detection. VIVID leverages Vision Language Models (VLMs) and a database of violence definitions to mitigate biases in Large Language Models (LLMs) and operates effectively with limited video data. VIVID functions through two steps: key-frame selection based on optical flow to capture high-motion frames, and violence detection using VLMs to translate visual representations into tokens, enabling LLMs to comprehend video content. By incorporating an external database with definitions of violence, VIVID ensures accurate and contextually relevant understanding, addressing inherent biases in LLMs. Experimental results on five datasets—Movies, Surveillance Fight, RWF-2000, Hockey, and XD-Violence— demonstrate that VIVID outperforms LLM-based methods and achieves competitive performance compared with deep learning-based methods, with the added benefit of providing explanations for its detections.

Download


Paper Citation


in Harvard Style

Gonzalez J., Matsukawa T. and Suzuki E. (2025). Leveraging Vision Language Models for Understanding and Detecting Violence in Videos. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-728-3, SciTePress, pages 99-113. DOI: 10.5220/0013160000003912


in Bibtex Style

@conference{visapp25,
author={Jose Gonzalez and Tetsu Matsukawa and Einoshin Suzuki},
title={Leveraging Vision Language Models for Understanding and Detecting Violence in Videos},
booktitle={Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP},
year={2025},
pages={99-113},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013160000003912},
isbn={978-989-758-728-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
TI - Leveraging Vision Language Models for Understanding and Detecting Violence in Videos
SN - 978-989-758-728-3
AU - Gonzalez J.
AU - Matsukawa T.
AU - Suzuki E.
PY - 2025
SP - 99
EP - 113
DO - 10.5220/0013160000003912
PB - SciTePress