Unsupervised Domain Adaptation for Video Violence Detection in the

Wild

Luca Ciampi

1 a

, Carlos Santiago

2 b

, Joao Paulo Costeira

2 c

, Fabrizio Falchi

1 d

Claudio Gennaro

1 e

and Giuseppe Amato

1 f

Institute of Information Science and Technologies, National Research Council, Pisa, Italy

Instituto Superior Técnico (LARSyS/IST), Lisbon, Portugal

Keywords:

Video Violence Detection, Video Violence Classiﬁcation, Action Recognition, Unsupervised Domain

Adaptation, Deep Learning, Deep Learning for Visual Understanding, Video Surveillance.

Abstract:

Video violence detection is a subset of human action recognition aiming to detect violent behaviors in trimmed

video clips. Current Computer Vision solutions based on Deep Learning approaches provide astonishing re-

sults. However, their success relies on large collections of labeled datasets for supervised learning to guarantee

that they generalize well to diverse testing scenarios. Although plentiful annotated data may be available for

some pre-speciﬁed domains, manual annotation is unfeasible for every ad-hoc target domain or task. As a

result, in many real-world applications, there is a domain shift between the distributions of the train (source)

and test (target) domains, causing a signiﬁcant drop in performance at inference time. To tackle this problem,

we propose an Unsupervised Domain Adaptation scheme for video violence detection based on single image

classiﬁcation that mitigates the domain gap between the two domains. We conduct experiments considering

as the source labeled domain some datasets containing violent/non-violent clips in general contexts and, as the

target domain, a collection of videos speciﬁc for detecting violent actions in public transport, showing that our

proposed solution can improve the performance of the considered models.

1 INTRODUCTION

In recent years, in the Computer Vision ﬁeld,

there has been an increasing interest in develop-

ing applications and services that make life eas-

ier for citizens. Thanks to the signiﬁcant growth

of Deep Learning (DL) and the ubiquity of video

surveillance cameras in modern cities, smart ap-

plications ranging from pedestrian detection (Am-

ato et al., 2019b) (Ciampi et al., 2020a) (Cafarelli

et al., 2022) to people tracking (Spremolla et al.,

2016) (Staniszewski et al., 2020), crowd count-

ing (Benedetto et al., 2022) (Avvenuti et al., 2022),

parking lot management (Ciampi et al., 2022b) (Am-

ato et al., 2019a) (Amato et al., 2018) (Ciampi

et al., 2018) and even facial reconstruction (P˛eszor

https://orcid.org/0000-0002-6985-0439

https://orcid.org/0000-0002-4737-0020

https://orcid.org/0000-0001-6769-2935

https://orcid.org/0000-0001-6258-5313

https://orcid.org/0000-0002-3715-149X

https://orcid.org/0000-0003-0171-4315

et al., 2016) have been proposed and are nowadays

widely employed worldwide, helping to manage pub-

lic spaces and preventing many criminal activities by

exploiting AI systems that automatically analyze this

deluge of visual data. However, the success of these

supervised DL-based approaches hinges on two as-

sumptions: (i) the existence of large collections of

labeled data required for accurate model ﬁtting dur-

ing the training phase, and (ii) training (or source)

and test (or target) datasets are independent and iden-

tically distributed (i.i.d.) (Huo et al., 2022). Al-

though plentiful annotated data may be available for a

few pre-speciﬁed domains, such as ImageNet (Deng

et al., 2009) for image classiﬁcation or COCO (Lin

et al., 2014) for object detection, manual annotations

are often prohibitive to obtain for every ad-hoc tar-

get domain or task. As a result, models trained by

leveraging already existing labeled data are applied

to target domains never seen during the training and

consequently suffer from shifts in data distributions,

i.e., Domain Shifts between source and target do-

mains (Torralba and Efros, 2011).

Ciampi, L., Santiago, C., Costeira, J., Falchi, F., Gennaro, C. and Amato, G.

Unsupervised Domain Adaptation for Video Violence Detection in the Wild.

DOI: 10.5220/0011965300003497

In Proceedings of the 3rd International Conference on Image Processing and Vision Engineering (IMPROVE 2023), pages 37-46

ISBN: 978-989-758-642-2; ISSN: 2795-4943

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

One possible solution to tackle this issue is repre-

sented by Unsupervised Domain Adaptation - (UDA).

Speciﬁcally, it aims at mitigating domain shifts be-

tween different domains, relying on labeled data in

the source domain and unlabelled data in the target

domain. In other words, UDA techniques exploit an-

notated data from the source domain as well as non-

annotated data coming from the target domain that is

easy to gather since it does not require human effort

for labeling. The challenge here is to automatically

infer some knowledge from this latter data ﬂow to re-

duce the gap between the two domains and, speciﬁ-

cally, to learn feature representations that should be (i)

discriminative for the main learning task on the source

domain and (ii) indiscriminative concerning the shift

between the domains.

In this work, we focus on the speciﬁc task of vi-

olence detection in trimmed videos, i.e., capturing an

exact action (either violent or non-violent). There-

fore, this task is a subset of human action recogni-

tion. Speciﬁcally, the goal is to binary classify clips

to predict if they contain (or not) any behaviors con-

sidered to be violent, differing from violent detection

in untrimmed videos, a subset of action localization

where the purpose is also to seek the action in the

temporal dimension. Despite its importance in many

practical, real-world scenarios, it is relatively unex-

plored compared to other action recognition tasks.

Although some annotated datasets for video violence

detection in general contexts already exist, they are

limited in size and in the considered different scenar-

ios. Therefore, existing Deep Learning-based solu-

tions trained using these data systematically experi-

ence performance degradation when applied to new

speciﬁc contexts, such as violence detection in public

transport environments (Ciampi et al., 2022a).

To mitigate this problem, in this paper, we pro-

pose an end-to-end DL-based UDA solution to detect

violent situations in videos in speciﬁc target scenarios

where annotated data is scarce or lacking. Our pro-

posal relies on single image classiﬁcation randomly

sampled from the frames making up the video, a

simple technique already addressed by (Akti et al.,

2022). Starting from this, some UDA techniques for

image classiﬁcation are employed during the train-

ing pipeline, automatically gathering some knowl-

edge from the unlabeled data belonging to the target

domain. To the best of our knowledge, it is the ﬁrst

attempt at using a UDA schema for video violence de-

tection. We conducted experiments by exploiting, as

the source domain, several annotated datasets present

in the literature dealing with video violence detection

in general contexts and, as the target domain, the re-

cently introduced Bus Violence benchmark (Ciampi

et al., 2022a), a collection of clips speciﬁc for detec-

tion of violent behaviors inside a moving bus. Exper-

imental results show that by using our UDA pipeline,

we can improve the performance of the considered

models by a signiﬁcant margin, thus suggesting that

they generalize better over this new scenario without

the need to use new labels.

Summarizing, the contribution of this work can be

listed as follows:

• we introduce a UDA scheme for video violence

detection based on single-image classiﬁcation,

which can mitigate the domain gap between a la-

beled source dataset and an unlabeled target one:

to the best of our knowledge, this is the ﬁrst time

that UDA has been applied to video violence de-

tection;

• we conduct an experimental evaluation taking

into account as the source domain some anno-

tated dataset containing violent/non-violent clips

in general contexts and, as the target domain, a re-

cently introduced collection of videos speciﬁc for

detection of violent behaviors in public transport;

• preliminary results show that our proposed UDA

scheme can improve the performance of the

considered models, which can better generalize

against new scenarios for which labels are absent.

The rest of the paper is structured as follows. Sec-

tion 2 reviews some works related to ours. Section 3

describes the proposed methodology. Section 4 shows

the performed experimental evaluation. Finally, Sec-

tion 5 concludes the paper, suggesting some insights

on future directions.

2 RELATED WORKS

In the literature, there are several methods and

datasets speciﬁc to video violence detection. Most

deal with trimmed clips, i.e., capturing an exact action

(either violent or non-violent). Therefore, this task

lies with action recognition aiming at binary classi-

fying videos to predict if they contain (or not) violent

human behaviors. On the other hand, a few works also

deal with untrimmed videos. In this case, the task is

no longer a subset of action recognition but is treated

as action localization, i.e., it is also needed to seek

the actions’ starting and ending time points. This dis-

tinction is also reﬂected in the datasets required for

the learning phase: in the former case, they are anno-

tated at a video level, while in the second case, frame-

annotated data is necessary. In this paper, we consider

video violence detection in trimmed videos. Hereafter

we describe some of the more popular techniques and

IMPROVE 2023 - 3rd International Conference on Image Processing and Vision Engineering

collections of trimmed clips in the literature, conclud-

ing the section by reviewing some existing UDA ap-

proaches.

2.1 Video Violence Detection Methods

In (Sudhakaran and Lanz, 2017), the authors intro-

duced a Deep Learning-based model consisting of

a series of convolutional layers for spatial features

extraction, followed by Convolutional Long Short

Memory (ConvLSTM) (Shi et al., 2015) for encod-

ing the frame level changes. On the other hand, a

variant of this architecture is presented in (Hanson

et al., 2019), where a spatio-temporal encoder built

on a standard convolutional backbone for features ex-

traction is combined with the Bidirectional Convolu-

tional LSTM (BiConvLSTM) architecture for extract-

ing the long-term movement information present in

the clips. Differently, the authors in (Akti et al., 2022)

proposed classifying videos using single frames ran-

domly sampled from the clips. Alternatively, it is also

possible to exploit methods suitable for human ac-

tion recognition: in this case, ﬁne-tuning is needed

to recognize only two classes – violence and non-

violence. For instance, the ResNet 3D network (Tran

et al., 2018) considers actions as spatiotemporal ob-

jects and handles both spatial and temporal dimen-

sions using 3DConv layers (Tran et al., 2015); on

the other hand, the ResNet 2+1D architecture (Tran

et al., 2018), decomposes the convolutions into sepa-

rate 2D spatial and 1D temporal ﬁlters (Feichtenhofer

et al., 2016). Another popular model is represented

by SlowFast (Feichtenhofer et al., 2019). In this two-

pathway architecture, the ﬁrst one is designed to cap-

ture the semantic information that can be given by im-

ages or a few sparse frames operating at low frame

rates. In contrast, the other one is responsible for cap-

turing rapid changing motion by working at a fast re-

freshing speed. Finally, recently, architecture relying

on Transformer attention modules have been intro-

duced, such as Video Swim Transformer (Liu et al.,

2022), which extends the sliding-window Transform-

ers proposed for image processing (Liu et al., 2021) to

the temporal axis, obtaining an excellent efﬁciency-

effective trade-off.

2.2 Video Violence Detection Datasets

In the last years, some benchmarks of trimmed clips

suitable for video violence detection have been intro-

duced. In (Padamwar, 2020), the authors presented

two video benchmarks for violence detection — the

Hockey Fight and the Movies Fight datasets. The for-

mer consists of 200 clips extracted from short movies.

On the other hand, the second one has 1,000 ﬁght and

non-ﬁght clips from the ice hockey game. More re-

cently, another dataset, named Surveillance Camera

Fight, has been presented in (Akti et al., 2019). It con-

sists of 300 videos in total, 150 of which describe ﬁght

sequences and 150 depict non-ﬁght scenes, recorded

from several surveillance cameras located in public

spaces. Moreover, the RWF-2000 (Cheng et al., 2021)

and the Real-Life Violence Situations (Soliman et al.,

2019) datasets were gathered from public surveillance

cameras. In both collections, the authors collected

2000 video clips: half of them include violent behav-

iors, while the others belong to non-violent activities.

Finally, in the Bus Violence benchmark (Ciampi et al.,

2022a), the authors gathered and made publicly avail-

able 1,400 videos of violent/non-violent actions sim-

ulated by several actors in a moving bus.

2.3 Unsupervised Domain Adaptation

Traditional UDA approaches have been developed

to address the problem of image classiﬁcation, and

they try to align features across the two domains.

Some notable examples are (Ganin et al., 2016) (Jin

et al., 2020) (Tzeng et al., 2017). However, their

usage in other applications is not straightforward,

as pointed out by (Zhang et al., 2017), and in the

literature, there are a limited number of UDA ap-

proaches suitable for different tasks. More recent ad-

vances involve semantic segmentation (Hong et al.,

2018) (Chen et al., 2019) and visual counting (Ciampi

et al., 2020b) (Ciampi et al., 2021). In this work, we

propose a UDA scheme for video violence detection

in videos. To the best of our knowledge, it is the ﬁrst

attempt to exploit UDA in this task.

3 METHOD

3.1 Background

Following the notation introduced in (Pan and Yang,

2010) (Csurka, 2017), we deﬁne a domain D con-

sisting of two components: a d-dimensional feature

space X ⊂ R

and a marginal probability distribution

P(X), where X = {x

, . . . , x

} ⊂ X . Given a speciﬁc

domain, D = {X , P(X)}, we formulate a task T de-

ﬁned by a label space Y and the conditional probabil-

ity distribution P(Y |X ), where Y = {y

, . . . , y

} ⊂ Y

is the set of the corresponding labels for X. In gen-

eral, P(Y|X ) can be learned in a supervised manner

from these feature-label pairs ⟨x

, y

⟩.

When considering Unsupervised Domain Adap-

tation (UDA), there is (i) a source domain D

Unsupervised Domain Adaptation for Video Violence Detection in the Wild

, P(X

)} with T

= {Y

, P(Y

)} and (ii)

a target domain D

= {X

, P(X

)} with T

, P(Y

)}, where Y

is unknown, i.e., we do

not have any labels. Due to the difference between the

two domains, the distributions are assumed to be dif-

ferent, i.e., P(X

) ̸= P(X

) and P(Y

) ̸= P(Y

UDA aims to learn a model with lower generalization

error in the target domain by mitigating the domain

discrepancy.

3.2 UDA for Video Violence Detection

In this work, the source domain D

consists of a la-

beled set of videos with Y

= {0, 1}, where 0 and 1

indicate the non-presence/presence of violent actions

occurring in the clips, respectively. Speciﬁcally, we

considered some general violence detection datasets

present in the literature collecting very heterogeneous

and everyday life violent and non-violent actions. On

the other hand, the target domain D

consists of a

different set of videos for which we do not have anno-

tations. In this case, clips include violent/non-violent

actions performed in a more speciﬁc and different sce-

nario compared to the ones characterizing the source

domain. The goal is to infer some knowledge from

the unlabeled target domain during the training phase,

mitigating the domain discrepancy present with the

source domain so that the model can be able to better

generalize to the new speciﬁc scenario for which the

annotations are absent.

Our method relies on Deep Learning-based mod-

els trained end-to-end together with some UDA tech-

niques attached to them. The peculiarity of our UDA

scheme is that it is based on image classiﬁcation.

Speciﬁcally, we cast the task of video classiﬁcation to

image classiﬁcation since the scenes including violent

actions can be discriminated from non-violent scenes

just by classifying an image randomly sampled from

the entire video clip (Akti et al., 2022). Starting from

this baseline, we put into the training pipeline two

different UDA techniques native for image classiﬁca-

tion that we fed with images sampled from the target

domain, which are responsible for the intra-domain

transfer knowledge.

More in detail, we considered some Convolutional

Neural Networks (CNNs) as backbones for feature

extraction, cutting off the ﬁnal classiﬁcation layers.

We replaced the last classiﬁcation head with a binary

classiﬁcation layer, outputting the probability that the

given video contains (or does not contain) violent ac-

tions, and we added an additional linear layer fol-

lowed by a ReLu to map the feature maps coming

from the feature extractor to a ﬁxed dimension. This

latter ﬁxed dimensional feature map is then fed to a

UDA module.

We considered two different UDA strategies. The

ﬁrst one is the Domain-Adversarial Neural Network

- (DANN) (Ganin et al., 2016) where a domain re-

gressor competes against the classiﬁer in an adversar-

ial way. Here, UDA is achieved by connecting the

domain classiﬁer to the feature extractor via a gra-

dient reversal layer that produces an adversarial loss

by multiplying the gradient by a certain negative con-

stant during the backpropagation-based training. Oth-

erwise, the training proceeds in a standard way by

minimizing the label prediction loss (for source exam-

ples) and the domain classiﬁcation loss (for all sam-

ples). The adversarial loss ensures that the feature

distributions over the two domains are made similar

(as indistinguishable as possible for the domain clas-

siﬁer), thus resulting in domain-invariant features. We

refer the reader to (Ganin et al., 2016) for further de-

tails. The second one is the Minimum Class Confu-

sion - (MCC) (Jin et al., 2020), a loss function that

can be considered as a UDA approach that does not

explicitly perform domain alignment. It is based on

class confusion, i.e., the tendency of a classiﬁer to

confuse the predictions between the correct and the

ambiguous classes. Speciﬁcally, given the feature ex-

tractor, MCC is deﬁned on the class prediction given

by the classiﬁer on the target data. Provided that less

class confusion implies more transferability, during

the training pipeline, MCC is optimized using stan-

dard backpropagation to obtain more generalized fea-

tures. We refer the reader to (Jin et al., 2020) for fur-

ther details.

4 PERFORMANCE ANALYSIS

4.1 Evaluation Metrics

Following previous works regarding video violence

detection, we used Accuracy to measure the perfor-

mance of the considered methods, deﬁned as:

Accuracy =

T P + T N

T P + T N + FP +FN

, (1)

where TP, TN, FP, and FN are the True Positives, True

Negatives, False Positives, and False Negatives, re-

spectively. To have a more in-depth comparison be-

tween the obtained results, we also considered as met-

rics the F1-score, the False Alarm, and the Missing

Alarm, deﬁned as follows:

F1 = 2 ×

Precision × Recall

Precision + Recall

, (2)

IMPROVE 2023 - 3rd International Conference on Image Processing and Vision Engineering

FalseAlarm =

T N + FP

, (3)

MissingAlarm =

T P + FN

, (4)

where Precision and Recall are deﬁned as

T P

T P+FP

and

T P

T P+FN

, respectively. Finally, to account also for the

probabilities of the detections, we employed the Area

Under the Receiver Operating Characteristics (ROC

AUC), computed as the area under the curve plotted

with True Positive Rate (TPR) against the False Posi-

tive Rate (FPR) at different threshold settings, where

T PR =

T P

T P+FN

and FPR =

T N+FP

4.2 Experimental Setting

We exploited three datasets present in the litera-

ture as the source domain — Surveillance Camera

Fight (Akti et al., 2019), Real-Life Violence Situa-

tions (Soliman et al., 2019), and RWF-2000 (Cheng

et al., 2021), already mentioned in Section 2. These

videos have been gathered from ﬁxed security cam-

eras and include trimmed heterogeneous violent and

non-violent scenes, thus containing very general vi-

olent situations. On the other hand, we considered

the recently introduced Bus Violence dataset (Ciampi

et al., 2022a) as the target domain. In this case,

trimmed clips are recorded inside a moving bus where

some actors simulated violent/non-violent actions.

This latter scenario is, therefore, more speciﬁc as it

involves violent situations in public transport, and it

represents the perfect testing ground for evaluating

the generalization capabilities of Deep Learning mod-

els trained with more generic labeled data. We depict

the considered scenario in Figure 1.

As the backbone for feature extraction, we con-

sidered two popular CNNs — ResNet50 (He et al.,

2016) and VGG16 (Simonyan and Zisserman, 2015).

As already mentioned, we replaced their ﬁnal clas-

siﬁcation head with a binary classiﬁcation layer and

exploited them as baselines, i.e., without any UDA

modules, as well as the feature extractors and clas-

siﬁer for our proposed UDA schemes. Furthermore,

to compare the obtained results with the literature,

we also considered other existing approaches tailored

for video violence detection and video action recog-

nition. Speciﬁcally, we exploited the architectures in-

troduced in (Sudhakaran and Lanz, 2017) and (Han-

son et al., 2019) that employ ConvLSTM and BiCon-

vLSTM as spatio-temporal encoders, together with

some popular video action classiﬁers — the (2+1)D

network (Tran et al., 2018), the SlowFast (Feichten-

hofer et al., 2019) architecture and the Video Swim

Transformer (Liu et al., 2022). We refer the reader

to Section 2 and the related papers for further details

about the employed models. For a fair comparison,

we accounted for the ImageNet pre-trained versions

of all these models as the starting point without using

any additional extra data. Furthermore, we always ap-

plied the same data augmentation strategy during the

learning phase: horizontal ﬂipping with a probability

of 0.5 and image resizing to 256 × 256 pixels.

4.3 Results and Discussion

We employed the following evaluation protocol to

have reliable statistics on the ﬁnal metrics. For each

of the three considered source (training) domains, i.e.,

Surveillance Camera Fight, Real-Life Violence Situa-

tions, and RWF-2000, we randomly varied the train-

ing and validation subsets three times, picking up the

best model in terms of accuracy and testing it over

the target (test) domain, i.e., the Bus Violence bench-

mark. Finally, we reported the mean and the standard

deviation of these three runs.

Results are shown Table 1. Overall, all the con-

sidered models exhibit moderate performance, indi-

cating the difﬁculties in generalizing their abilities in

detecting violent actions in videos coming from the

target domain. However, the model which turns out

to be the most performing in terms of the golden met-

rics, i.e., the Accuracy, is the ResNet50 architecture

with the MCC UDA module. Speciﬁcally, we gain

7.4%, 0.37%, and 12.9% of accuracy compared with

the ResNet50 network without UDA concerning the

Surveillance Camera Fight, the RWF-2000, and the

Real-life Violence Situations source domains, respec-

tively, overcoming all the other considered methods

present in the literature.

Considering False Alarms and Missing Alarms, it

can be noted that, in general, all the methods obtained

very good results regarding the ﬁrst metric, while they

struggled with the latter. Considering that missing

alarms are crucial for video violence detection since

they indicate violent actions that happened but were

not detected, this represents the main limitation for

all the violence detectors. However, it is worth not-

ing that the proposed approach made of the ResNet50

architecture and the MCC module can mitigate this is-

sue, achieving better performance compared with the

single ResNet50 model and often overtaking all the

other techniques. This behavior is linked with a lower

number of detected False Negatives and consequently

affects and improves the Recall and F1-score. In Fig-

ure 2, we report some samples of True Positive, True

Negative, False Positive, and False Negative coming

out from the best model, i.e., the ResNet50 architec-

ture with attached the MCC UDA module.

Unsupervised Domain Adaptation for Video Violence Detection in the Wild

RWF-2000

Surveillance camera fight

Real life violence situations

Source Domain

Bus Violence

Target Domain

Adaptation

Figure 1: The considered scenario. We propose an Unsupervised Domain Adaptation scheme for video violence detection to

mitigate the domain gap existing between a source domain (on the left) and a target domain (on the right). The source domain

consists of three collections of annotated videos depicting violent/non-violent scenes in general contexts. On the other hand,

the target domain is represented by a set of unlabeled clips of violent/non-violent actions in public transport.

5 CONCLUSIONS AND FUTURE

DIRECTIONS

In this paper, we tackled the problem of video vio-

lence detection in the context of data scarcity. In-

deed, current Deep Learning solutions hinge on vast

quantities of labeled data needed for supervised learn-

ing, and they suffer when applied to new scenarios

never seen during the training phase. Thus, a model

trained on one domain, named source, usually expe-

riences a drastic drop in performance when applied

on another domain, named target. To tackle this is-

sue, we proposed an Unsupervised Domain Adapta-

tion scheme for detecting violent/non-violent actions

present in trimmed videos, which relies on supervised

learning in the source domain and, at the same time,

exploits an unlabeled target dataset to reduce the do-

main shift between the two sets. Our proposed solu-

tion is based on single image classiﬁcation, randomly

sampled from the frames making up the clips. The

feature representations generated by the target images

have been hooked and fed to a UDA module responsi-

ble for making them indiscriminative concerning the

shift between the domains. To the best of our knowl-

edge, it is the ﬁrst attempt at using a UDA schema

for video violence detection. We conducted exper-

iments considering as source domain three datasets

composed of videos of violent/non-violent scenes in

general contexts and, as the target domain, a collec-

tion of clips of violent/non-violent actions in public

transport. Preliminary results showed that our UDA

scheme can help to improve the generalization capa-

bilities of the considered models mitigating the do-

main gap.

In the future, we plan to extend our experimenta-

tion by considering and designing other UDA strate-

gies to be attached to the classiﬁer. Indeed, although

we obtained a signiﬁcant performance boost, the con-

sidered models still exhibit moderate generalization

capabilities, suggesting that a more effective domain

gap reduction is needed. Furthermore, we plan to put

into the pipeline also the spatio-temporal information

provided by consecutive frames making up the clips.

ACKNOWLEDGEMENTS

This work was partially funded by: AI4Media - A

European Excellence Centre for Media, Society and

Democracy (EC, H2020 n. 951911); PNRR - M4C2

- Investimento 1.3, Partenariato Esteso PE00000013

- "FAIR - Future Artiﬁcial Intelligence Research" -

Spoke 1 "Human-centered AI", funded by European

Union - NextGenerationEU.

IMPROVE 2023 - 3rd International Conference on Image Processing and Vision Engineering

Table 1: Performance Evaluation. We considered three datasets for video violence detection in general contexts as source

domains and a collection of clips with violent situations in public transport as the target domain. We randomly varied the

training and validation subsets of the source domains three times, picking up the best model in terms of accuracy. Mean ±

st.dev is reported.

Source Domain: Surveillance Camera Fight (Akti et al., 2019) - Target Domain: Bus Violence (Ciampi et al., 2022a)

Model Accuracy ↑ F1 ↑ False Alarm ↓ Miss Alarm ↓ ROC AUC ↑

(Hanson et al., 2019) 0.54 ± 0.02 0.19 ± 0.11 0.04 ±0.03 0.89 ± 0.07 0.68 ± 0.02

(Sudhakaran and Lanz, 2017) 0.52 ± 0.01 0.27 ± 0.18 0.16 ± 0.17 0.79 ± 0.18 0.55 ± 0.02

ResNet (2+1)D (Tran et al., 2018) 0.52 ± 0.02 0.44 ± 0.34 0.52 ± 0.44 0.44 ± 0.46 0.54 ± 0.05

SlowFast (Feichtenhofer et al., 2019) 0.55 ± 0.01 0.40 ± 0.21 0.27 ± 0.32 0.62 ± 0.35 0.62 ± 0.02

VideoSwimTransformer (Liu et al., 2022) 0.52 ± 0.01 0.65 ± 0.01 0.86 ± 0.01 0.10 ± 0.01 0.50 ± 0.01

ResNet50 (He et al., 2016) 0.54 ± 0.02 0.52 ± 0.06 0.44 ± 0.12 0.48 ± 0.12 0.55 ± 0.03

VGG16 (Simonyan and Zisserman, 2015) 0.51 ± 0.01 0.45 ± 0.07 0.39 ± 0.21 0.59 ± 0.19 0.51 ± 0.01

ResNet50 + DANN (Ganin et al., 2016) 0.55 ± 0.01 0.51 ± 0.04 0.39 ± 0.03 0.51 ± 0.06 0.56 ± 0.03

ResNet50 + MCC (Jin et al., 2020) 0.58 ± 0.01 0.52 ± 0.03 0.45 ± 0.05 0.47 ± 0.04 0.63 ± 0.01

VGG16 + DANN (Ganin et al., 2016) 0.53 ± 0.01 0.51 ± 0.04 0.49 ± 0.12 0.46 ± 0.10 0.51 ± 0.01

VGG16 + MCC (Jin et al., 2020) 0.53 ± 0.01 0.43 ± 0.01 0.28 ± 0.03 0.64 ± 0.01 0.52 ± 0.01

Source Domain: RWF-2000 (Cheng et al., 2021) - Target Domain: Bus Violence (Ciampi et al., 2022a)

Model Accuracy ↑ F1 ↑ False Alarm ↓ Miss Alarm ↓ ROC AUC ↑

(Hanson et al., 2019) 0.51 ± 0.01 0.07 ± 0.03 0.01 ± 0.01 0.96 ± 0.02 0.67 ± 0.05

(Sudhakaran and Lanz, 2017) 0.51 ± 0.01 0.08 ± 0.08 0.03 ± 0.03 0.95 ± 0.05 0.52 ± 0.02

ResNet (2+1)D (Tran et al., 2018) 0.53 ± 0.03 0.43 ± 0.05 0.29 ± 0.01 0.64 ± 0.05 0.54 ± 0.03

SlowFast (Feichtenhofer et al., 2019) 0.53 ± 0.03 0.40 ± 0.10 0.26 ± 0.08 0.67 ± 0.12 0.55 ± 0.03

VideoSwimTransformer (Liu et al., 2022) 0.53 ± 0.01 0.52 ± 0.04 0.45 ± 0.12 0.49 ± 0.09 0.57 ± 0.01

ResNet50 (He et al., 2016) 0.54 ± 0.01 0.49 ± 0.04 0.34 ± 0.05 0.56 ± 0.06 0.58 ± 0.01

VGG16 (Simonyan and Zisserman, 2015) 0.54 ± 0.01 0.41 ± 0.03 0.25 ± 0.06 0.67 ± 0.04 0.54 ± 0.01

ResNet50 + DANN (Ganin et al., 2016) 0.55 ± 0.01 0.52 ± 0.01 0.40 ± 0.01 0.50 ± 0.01 0.57 ± 0.01

ResNet50 + MCC (Jin et al., 2020) 0.56 ± 0.01 0.59 ± 0.02 0.49 ± 0.05 0.37 ± 0.05 0.60 ± 0.02

VGG16 + DANN (Ganin et al., 2016) 0.55 ± 0.02 0.52 ± 0.03 0.39 ± 0.04 0.51 ± 0.03 0.54 ± 0.02

VGG16 + MCC (Jin et al., 2020) 0.55 ± 0.01 0.41 ± 0.02 0.20 ± 0.05 0.69 ± 0.06 0.55 ± 0.01

Source Domain: Real-life Violence Situations (Soliman et al., 2019) - Target Domain: Bus Violence (Ciampi et al., 2022a)

Model Accuracy ↑ F1 ↑ False Alarm ↓ Miss Alarm ↓ ROC AUC ↑

(Hanson et al., 2019) 0.58 ± 0.02 0.49 ± 0.09 0.26 ± 0.12 0.57 ± 0.14 0.61 ± 0.01

(Sudhakaran and Lanz, 2017) 0.52 ± 0.01 0.45 ± 0.02 0.35 ± 0.04 0.61 ± 0.04 0.55 ± 0.02

ResNet (2+1)D (Tran et al., 2018) 0.51 ± 0.01 0.01 ± 0.01 0.01 ± 0.01 0.99 ± 0.01 0.57 ± 0.08

SlowFast (Feichtenhofer et al., 2019) 0.51 ± 0.01 0.02 ± 0.02 0.01 ± 0.01 0.99 ± 0.01 0.54 ± 0.04

VideoSwimTransformer (Liu et al., 2022) 0.51 ± 0.02 0.30 ± 0.20 0.22 ± 0.17 0.76 ± 0.20 0.53 ± 0.02

ResNet50 (He et al., 2016) 0.54 ± 0.01 0.49 ± 0.03 0.38 ± 0.08 0.54 ± 0.06 0.56 ± 0.01

VGG16 (Simonyan and Zisserman, 2015) 0.53 ± 0.01 0.54 ± 0.02 0.33 ± 0.09 0.51 ± 0.08 0.58 ± 0.01

ResNet50 + DANN (Ganin et al., 2016) 0.57 ± 0.01 0.49 ± 0.03 0.25 ± 0.04 0.59 ± 0.03 0.57 ± 0.02

ResNet50 + MCC (Jin et al., 2020) 0.61 ± 0.01 0.54 ± 0.09 0.32 ± 0.15 0.51 ± 0.13 0.61 ± 0.01

VGG16 + DANN (Ganin et al., 2016) 0.54 ± 0.01 0.52 ± 0.03 0.40 ± 0.05 0.49 ± 0.03 0.54 ± 0.02

VGG16 + MCC (Jin et al., 2020) 0.57 ± 0.01 0.54 ± 0.04 0.36 ± 0.08 0.50 ± 0.08 0.59 ± 0.01

Unsupervised Domain Adaptation for Video Violence Detection in the Wild

Surveillance Camera Fight RWF-2000 Real-life Violent Situations

True Positive (TP)True Negative (TN)False Positive (FP)False Negative (FN)

Figure 2: Some samples of predictions over the target domain. In the four rows, we report some samples of True Positives,

True Negatives, False Positives, and False Negatives concerning the best model, i.e., ResNet50 + MCC, for each of the

considered source domains (one for each column).

REFERENCES

Akti, S., Oﬂi, F., Imran, M., and Ekenel, H. K. (2022).

Fight detection from still images in the wild. In

2022 IEEE/CVF Winter Conference on Applications

of Computer Vision Workshops (WACVW). IEEE.

Akti, S., Tataroglu, G. A., and Ekenel, H. K. (2019). Vision-

based ﬁght detection from surveillance cameras. In

2019 Ninth International Conference on Image Pro-

cessing Theory, Tools and Applications (IPTA). IEEE.

Amato, G., Bolettieri, P., Moroni, D., Carrara, F., Ciampi,

L., Pieri, G., Gennaro, C., Leone, G. R., and Vairo, C.

(2018). A wireless smart camera network for parking

monitoring. In 2018 IEEE Globecom Workshops (GC

Wkshps). IEEE.

Amato, G., Ciampi, L., Falchi, F., and Gennaro, C. (2019a).

Counting vehicles with deep learning in onboard UAV

imagery. In 2019 IEEE Symposium on Computers and

Communications (ISCC). IEEE.

Amato, G., Ciampi, L., Falchi, F., Gennaro, C., and

Messina, N. (2019b). Learning pedestrian detection

from virtual worlds. In Image Analysis and Processing

- ICIAP 2019 - 20th International Conference, Trento,

Italy, September 9-13, 2019, Proceedings, Part I, vol-

ume 11751 of Lecture Notes in Computer Science,

pages 302–312. Springer.

Avvenuti, M., Bongiovanni, M., Ciampi, L., Falchi, F., Gen-

naro, C., and Messina, N. (2022). A spatio-temporal

attentive network for video-based crowd counting.

CoRR, abs/2208.11339.

Benedetto, M. D., Carrara, F., Ciampi, L., Falchi, F.,

Gennaro, C., and Amato, G. (2022). An embed-

ded toolset for human activity monitoring in criti-

cal environments. Expert Systems with Applications,

199:117125.

IMPROVE 2023 - 3rd International Conference on Image Processing and Vision Engineering

Cafarelli, D., Ciampi, L., Vadicamo, L., Gennaro, C.,

Berton, A., Paterni, M., Benvenuti, C., Passera, M.,

and Falchi, F. (2022). MOBDrone: A drone video

dataset for man OverBoard rescue. In Image Anal-

ysis and Processing – ICIAP 2022, pages 633–644.

Springer International Publishing.

Chen, Y., Li, W., Chen, X., and Gool, L. V. (2019). Learn-

ing semantic segmentation from synthetic data: A ge-

ometrically guided input-output adaptation approach.

In 2019 IEEE/CVF Conference on Computer Vision

and Pattern Recognition (CVPR). IEEE.

Cheng, M., Cai, K., and Li, M. (2021). RWF-2000: An

open large scale video database for violence detec-

tion. In 2020 25th International Conference on Pat-

tern Recognition (ICPR). IEEE.

Ciampi, L., Amato, G., Falchi, F., Gennaro, C., and Rabitti,

F. (2018). Counting vehicles with cameras. In Pro-

ceedings of the 26th Italian Symposium on Advanced

Database Systems, Castellaneta Marina (Taranto),

Italy, June 24-27, 2018, volume 2161 of CEUR Work-

shop Proceedings. CEUR-WS.org.

Ciampi, L., Foszner, P., Messina, N., Staniszewski, M.,

Gennaro, C., Falchi, F., Serao, G., Cogiel, M., Golba,

D., Szcz˛esna, A., and Amato, G. (2022a). Bus vio-

lence: An open benchmark for video violence detec-

tion on public transport. Sensors, 22(21):8345.

Ciampi, L., Gennaro, C., Carrara, F., Falchi, F., Vairo, C.,

and Amato, G. (2022b). Multi-camera vehicle count-

ing using edge-AI. Expert Systems with Applications,

207:117929.

Ciampi, L., Messina, N., Falchi, F., Gennaro, C., and Am-

ato, G. (2020a). Virtual to real adaptation of pedes-

trian detectors. Sensors, 20(18):5250.

Ciampi, L., Santiago, C., Costeira, J., Gennaro, C., and

Amato, G. (2021). Domain adaptation for trafﬁc den-

sity estimation. In Proceedings of the 16th Interna-

tional Joint Conference on Computer Vision, Imag-

ing and Computer Graphics Theory and Applications.

SCITEPRESS - Science and Technology Publications.

Ciampi, L., Santiago, C., Costeira, J. P., Gennaro, C., and

Amato, G. (2020b). Unsupervised vehicle counting

via multiple camera domain adaptation. In Proceed-

ings of the First International Workshop on New Foun-

dations for Human-Centered AI (NeHuAI) co-located

with 24th European Conference on Artiﬁcial Intelli-

gence (ECAI 2020), Santiago de Compostella, Spain,

September 4, 2020, volume 2659 of CEUR Workshop

Proceedings, pages 82–85. CEUR-WS.org.

Csurka, G. (2017). Domain adaptation for visual

applications: A comprehensive survey. CoRR,

abs/1702.05374.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). ImageNet: A large-scale hierarchical im-

age database. In 2009 IEEE Conference on Computer

Vision and Pattern Recognition. IEEE.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019).

SlowFast networks for video recognition. In 2019

IEEE/CVF International Conference on Computer Vi-

sion (ICCV). IEEE.

Feichtenhofer, C., Pinz, A., and Wildes, R. P. (2016). Spa-

tiotemporal residual networks for video action recog-

nition. In Advances in Neural Information Processing

Systems 29: Annual Conference on Neural Informa-

tion Processing Systems 2016, December 5-10, 2016,

Barcelona, Spain, pages 3468–3476.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P.,

Larochelle, H., Laviolette, F., Marchand, M., and

Lempitsky, V. S. (2016). Domain-adversarial training

of neural networks. J. Mach. Learn. Res., 17:59:1–

59:35.

Hanson, A., PNVR, K., Krishnagopal, S., and Davis, L.

(2019). Bidirectional convolutional LSTM for the de-

tection of violence in videos. In Lecture Notes in Com-

puter Science, pages 280–295. Springer International

Publishing.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR). IEEE.

Hong, W., Wang, Z., Yang, M., and Yuan, J. (2018). Con-

ditional generative adversarial network for structured

domain adaptation. In 2018 IEEE/CVF Conference on

Computer Vision and Pattern Recognition. IEEE.

Huo, X., Xie, L., Hu, H., Zhou, W., Li, H., and Tian, Q.

(2022). Domain-agnostic prior for transfer seman-

tic segmentation. In 2022 IEEE/CVF Conference on

Computer Vision and Pattern Recognition (CVPR).

IEEE.

Jin, Y., Wang, X., Long, M., and Wang, J. (2020). Mini-

mum class confusion for versatile domain adaptation.

In Computer Vision – ECCV 2020, pages 464–480.

Springer International Publishing.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-

manan, D., Dollár, P., and Zitnick, C. L. (2014). Mi-

crosoft COCO: Common objects in context. In Com-

puter Vision – ECCV 2014, pages 740–755. Springer

International Publishing.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,

S., and Guo, B. (2021). Swin transformer: Hierarchi-

cal vision transformer using shifted windows. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision, pages 10012–10022.

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and

Hu, H. (2022). Video swin transformer. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 3202–3211.

Padamwar, B. (2020). Violence detection in surveillance

video using computer vision techniques. International

Journal for Research in Applied Science and Engi-

neering Technology, 8(8):533–536.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-

ing. IEEE Transactions on Knowledge and Data En-

gineering, 22(10):1345–1359.

P˛eszor, D., Staniszewski, M., and Wojciechowska, M.

(2016). Facial reconstruction on the basis of video

surveillance system for the purpose of suspect identi-

ﬁcation. In Intelligent Information and Database Sys-

tems, pages 467–476. Springer Berlin Heidelberg.

Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., and

Woo, W. (2015). Convolutional LSTM network: A

Unsupervised Domain Adaptation for Video Violence Detection in the Wild

machine learning approach for precipitation nowcast-

ing. In Advances in Neural Information Processing

Systems 28: Annual Conference on Neural Informa-

tion Processing Systems 2015, December 7-12, 2015,

Montreal, Quebec, Canada, pages 802–810.

Simonyan, K. and Zisserman, A. (2015). Very deep con-

volutional networks for large-scale image recognition.

In Bengio, Y. and LeCun, Y., editors, 3rd Interna-

tional Conference on Learning Representations, ICLR

2015, San Diego, CA, USA, May 7-9, 2015, Confer-

ence Track Proceedings.

Soliman, M. M., Kamal, M. H., Nashed, M. A. E.-M.,

Mostafa, Y. M., Chawky, B. S., and Khattab, D.

(2019). Violence recognition from videos using deep

learning techniques. In 2019 Ninth International

Conference on Intelligent Computing and Information

Systems (ICICIS). IEEE.

Spremolla, I. R., Antunes, M., Aouada, D., and Ottersten,

B. (2016). RGB-d and thermal sensor fusion. In Pro-

ceedings of the 11th Joint Conference on Computer

Vision, Imaging and Computer Graphics Theory and

Applications. SCITEPRESS - Science and Technol-

ogy Publications.

Staniszewski, M., Foszner, P., Kostorz, K., Michalczuk,

A., Wereszczy

nski, K., Cogiel, M., Golba, D., Woj-

ciechowski, K., and Pola

nski, A. (2020). Application

of crowd simulations in the evaluation of tracking al-

gorithms. Sensors, 20(17):4960.

Sudhakaran, S. and Lanz, O. (2017). Learning to detect vio-

lent videos using convolutional long short-term mem-

ory. In 2017 14th IEEE International Conference

on Advanced Video and Signal Based Surveillance

(AVSS). IEEE.

Torralba, A. and Efros, A. A. (2011). Unbiased look at

dataset bias. In CVPR 2011. IEEE.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri,

M. (2015). Learning spatiotemporal features with 3d

convolutional networks. In 2015 IEEE International

Conference on Computer Vision (ICCV). IEEE.

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y.,

and Paluri, M. (2018). A closer look at spatiotem-

poral convolutions for action recognition. In 2018

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition. IEEE.

Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. (2017).

Adversarial discriminative domain adaptation. In

2017 IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR). IEEE.

Zhang, Y., David, P., and Gong, B. (2017). Curriculum do-

main adaptation for semantic segmentation of urban

scenes. In 2017 IEEE International Conference on

Computer Vision (ICCV). IEEE.

IMPROVE 2023 - 3rd International Conference on Image Processing and Vision Engineering