FiDaSS: A Novel Dataset for Firearm Threat Detection in Real-World

Scenes

Murilo S. Regio and Isabel H. Manssour

Pontiﬁcal Catholic University of Rio Grande do Sul, PUCRS, School of Technology, Porto Alegre, RS, Brazil

Keywords:

Surveillance, CCTV, Firearm Detection, Armed Person Detection.

Abstract:

For a society to thrive, people must feel safe; otherwise, fear and stress reduce the quality of life. A variety

of security measures are used, but as populations grow and ﬁrearms become more accessible, societal safety

faces new challenges. Existing works on threat detection focus primarily on security cameras but lack common

benchmarks, standard datasets, or consistent constraints, making it difﬁcult to assess their real-world perfor-

mance, especially with low-quality footage. This work introduces a challenging dataset for Firearm Threat

Detection, comprising 7450 annotated frames across 291 videos, created under rigorous quality controls. We

also developed tools to streamline dataset creation and expansion through semi-automatic annotations. To

our knowledge, this is the largest real-world dataset with frame-level annotations in the area. Our dataset is

available online alongside the tools developed, including some to facilitate its extension. We evaluated popular

detectors and state-of-the-art transformer-based methods on the dataset to validate its difﬁculty.

1 INTRODUCTION

Security has always been a major concern, and

as ﬁrearms become increasingly accessible, societal

safety grows more fragile (Hurka and Knill, 2020).

Firearms allow individuals, even without advanced

training, to cause signiﬁcant harm in public spaces,

leading to tragedies such as school and mass shoot-

ings (Gius, 2018; Lemieux, 2014). Several measures

exist to handle these situations, the most common be-

ing monitoring environments using security cameras.

While security cameras offer advantages (Piza

et al., 2019), such as recording events for posterior-

ity, they rely heavily on human supervision. Cam-

eras only capture footage, they require operators to

actively monitor and respond to incidents. In larger

areas or buildings, multiple cameras must be moni-

tored simultaneously, increasing the risk of distrac-

tions and human error (Darker et al., 2007).

Effective vigilance requires sustained concentra-

tion (Donald and Donald, 2015), yet CCTV operators

often struggle to maintain attention over time. Studies

show that focus declines signiﬁcantly after 20 min-

utes (Velastin et al., 2006), with operators missing

45% of scene elements by 12 minutes and up to 95%

after 22 minutes (Ainsworth, 2002). This highlights

the limitations of traditional monitoring and the need

for solutions to enhance security.

Many studies have tackled ﬁrearm detection, of-

ten prioritizing model performance over the data

used (Gelana and Yadav, 2019; de Azevedo Kanehisa

and de Almeida Neto, 2019). Some focus on spe-

ciﬁc tasks, such as concealed weapon detection (Ra-

turi et al., 2019; Ineneji and Kusaf, 2019), while oth-

ers address broader security issues, including aban-

doned luggage (Loganathan et al., 2019), ﬁre (Mehta

et al., 2020), or general violence (Pawar et al., 2019).

Despite ongoing concerns about dataset quality,

these issues are frequently left for future work (Ol-

mos et al., 2018; Lim et al., 2019). Few authors pro-

pose new datasets, and even fewer provide real-world

data with detailed object detection annotations. To

address this gap, we developed a novel and ﬂexible

dataset for ﬁrearm threat detection, created methodi-

cally from real-world scenes using rigorous selection

and annotation processes.

The contributions of this work are threefold:

• A novel challenging dataset called FiDaSS

(Firearm Dataset for Smart Surveillance) with

7450 real-world annotated images featuring di-

verse scenarios, cultural contexts, and detailed an-

notations for victims, perpetrators, and weapons.

• Tools to streamline dataset creation or expansion

by using pre-existing detectors to estimate anno-

tations, which can be manually reﬁned afterward.

Regio, M. S. and Manssour, I. H.

FiDaSS: A Novel Dataset for Firearm Threat Detection in Real-World Scenes.

DOI: 10.5220/0013177800003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

683-690

ISBN: 978-989-758-728-3; ISSN: 2184-4321

683

Table 1: Most frequently used datasets in the studied literature, shown in descending order of popularity.

Type Amount Frame

Dataset

of Data of Data

Annotations Task

Dimension

Year

(Olmos et al., 2018) Movie 3, 000 Frames Frame Level Detection Varied 2018

(IMFDB, 2015) Movie 396, 808 Frames Frame Level Classiﬁcation Varied 2015

(Sultani et al., 2018) Real-world 200 Videos Video Level Classiﬁcation 320x240 2018

(Grega et al., 2013) Acted 7 Videos Video Level Classiﬁcation 640x480 2013

(Gonz

alez et al., 2020) Synthetic 4000 Frames Frame Level Detection 1920x1080 2020

(Gu et al., 2022) Acted 5000 Frames Frame Level Detection Varied 2022

(Hnoohom et al., 2022) Acted 8319 Frames Frame Level Detection 1920x1080 2022

FiDaSS Real-world 7450 Frames Frame Level Detection Varied N/A

• Experiments using state-of-the-art networks to

evaluate the quality and difﬁculty of FiDaSS.

2 RELATED WORK

Through a literature study, we identiﬁed 34 datasets

used or proposed, and in Table 1 we provide a com-

parison between the most popular. The datasets found

can be roughly categorized based on the data they use.

This insight is crucial for quickly ﬁltering undesirable

datasets and focusing on those that are adequate to our

objectives. The categories identiﬁed are as follows:

• Movie Data: Datasets based on movies are abun-

dant and offer plenty of data, but have a ten-

dency for lower real-world performance due to

cinematic characteristics. E.g., (IMFDB, 2015).

• Enacted Data: Simulated real-life scenarios offer

better realism but are smaller due to the high effort

required for creation. E.g., (Grega et al., 2013).

• Real Data: Surveillance footage datasets are rare,

small, and often subsets of broader datasets, de-

spite being the most representative of real-world

scenarios. E.g., (Sultani et al., 2018).

Although most works focused on CCTV scenar-

ios, only three public datasets are based on real-world

data. Movie-based datasets were the most prevalent,

with the two most used datasets falling into this cat-

egory. This mismatch highlights a reliance on in-

adequate data for real-world applications, likely due

to the greater availability and size of movie datasets.

Furthermore, when analyzing the datasets shown in

Table 1, we notice it is difﬁcult to compare the meth-

ods in the area fairly, as works use diverging datasets

that focus on different categories. Similarly, there

is no standard measure for comparing methods, even

considering within the same category.

We identiﬁed that the biggest concern in the area

is the construction of representative datasets. Some

works (Sultani et al., 2018; Lim et al., 2019) stand

out for presenting data from actual events captured

by security cameras and made available to the pub-

lic and contain exciting data. However, they lack in

amount, diversity, and are composed of a set of videos

or contiguous frames marked as containing or not the

object of interest instead of precise annotations. Con-

sidering this, we created a dataset aiming to address

the limitations identiﬁed and provide a robust foun-

dation for future research, thus fulﬁlling the follow-

ing gaps: (I) Real-world data to encourage practical

applications; (II) High variability in sources, video

quality, and cultural representation; (III) Frame-level

annotations, adaptable for tasks like object detection

and scene classiﬁcation.

3 DATASET DESCRIPTION

We started with a literature review to identify com-

monly used datasets, their characteristics, and areas

for improvement (Section 2). Based on that, we de-

cided that our primary objective for FiDaSS was to

portray a diversity of cultures using real-world scenes,

thus minimizing regional social biases.

To facilitate FiDaSS creation, we implemented a

set of tools to provide useful scripts for manipulating

videos, creating and labeling clips, creating bounding

box annotations, and generating statistics. We also in-

tegrated our annotation pipeline with an open-source

video object tracking frameworkto provide sugges-

tions for future annotations.

The following sections describe each step pre-

sented in Fig. 1. FiDaSS, the tools developed to create

it, and complementary details about it (e.g., dataset

splits, geographical diversity, and training conﬁgura-

tions) are available online

3.1 Data Collection

To create FiDaSS, we explored a wide range of data

sources to assess the existing resources in the lit-

erature and identify gaps. We began by analyzing

https://github.com/ﬁdass/ﬁdass dataset

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

684

Figure 1: Steps followed to create our dataset.

the well-known datasets listed in Table 1, along with

lesser-known ones, to establish a strong foundation

for our dataset. Building on this, we sought videos

from YouTube and a local news company to enrich

FiDaSS with diverse scenarios, situations, and cul-

tures. However, obtaining a substantial amount of

varied real-world data remains a signiﬁcant challenge

due to privacy concerns and the limited availability of

recordings held by security companies.

From our analysis of existing datasets, we se-

lected UCF Crime as a foundation due to its focus

on real-world security camera footage. We selected

videos containing moments that clearly displayed a

held weapon and included a visible criminal, thus ex-

cluding scenarios solely involving law enforcement.

Then, to expand beyond the literature, we collected

unexplored data from YouTube. We collected an ini-

tial pool of videos using a query-based search with the

keywords [surveillance video armed robbery, CCTV

assaults, guns in CCTV, assault caught on camera] in

over 20 languages. After ﬁltering based on our criteria

(weapon visibility and the presence of a criminal) and

removing duplicates, this yielded 139 videos. Fur-

thermore, we used YouTube’s recommendation sys-

tem to discover more content, and by applying the

same ﬁltering process, we gathered 162 new videos.

To ensure there were no duplicate videos, we con-

ducted a manual veriﬁcation to remove all overlap-

ping data from the selection. Thus, ultimately, we

selected 301 videos from YouTube depicting crime

scenes from different countries and cultures. The

playlists with these videos are available online

Finally, to expand our dataset further, we con-

tacted a local news company, requesting access to

some videos provided to them depicting recent crime

scenes from the region. Upon receiving their ap-

proval, we obtained 13 novel videos.

3.2 Dataset Annotation

After collecting the videos, we began annotating

clips, while ensuring each was self-contained. For

https://youtube.com/@ﬁdassdataset7285/playlists.

that, each clip had to feature the assailant for at least

ﬁve seconds, ensuring sufﬁcient relevant information

to contribute. Through this process, we reduced 18

hours of video into 2.8 hours of manually selected

clips, with each containing either unique footage or

a different camera angle. To standardize the dataset,

we converted all clips to a frame-rate of 1 fps.

After delimiting each clip, we started annotating

each clip on a frame-level for object detection. Our

ﬁrst step was to annotate only the ﬁrst time each per-

son appeared in each clip. In addition to armed peo-

ple, we also included their guns and unarmed peo-

ple in the annotations. This way, models could learn

more reliably the difference between armed people

and people holding items similar to guns in low-

quality videos (such as phones and umbrellas).

Next, we processed all clips and their unique ob-

jects using a network designed for object tracking.

This approach generated an initial approximation of

annotations for every clip with minimal manual ef-

fort. To ensure quality, we meticulously reviewed

each frame and corrected mismarked instances. This

procedure signiﬁcantly reduced our workload while

expediting the creation process of our dataset.

In many recordings, we noticed that the assailant

would, for example, stand perfectly still while mak-

ing demands for 15 − 30 seconds, and the people in-

volved were paralyzed listening to their threats dur-

ing most of this time. These scenarios would cause

several clips to be composed of nearly identical im-

ages with no substantial changes that would provide

new information. Thus, while adjusting the miss-

detected bounding boxes, we discarded these long re-

dundant sequences to avoid having inﬂated results in

our experiments, as having the models predict identi-

cal frames would make them seem more accurate than

they actually were. By doing so, we reduced the total

duration of the clips in our dataset to 2.1 hours.

Since we gathered images from several differ-

ent videos available online, many of them had faces

blurred for anonymity. However, as our objective with

our dataset is to provide an accurate estimation of a

model’s performance in real-world scenarios, adding

blur to all faces would provide an inaccurate represen-

FiDaSS: A Novel Dataset for Firearm Threat Detection in Real-World Scenes

685

Figure 2: Example of images from our dataset with their corresponding labels, following the color scheme: green for “armed”

labels, blue for “unarmed”, and red for “ﬁrearm”.

Table 2: Total data selected from each data source.

Dataset of Origin Videos Clips Frames

Youtube Playlist 301 216 4905

UCF Crime 197 144 2239

News Company 13 19 306

Total 522 379 7450

tation of the data models would ﬁnd when applied in

real scenarios. Therefore, we decided not to add any

more blur to our dataset but still use the images we

gathered that were already blurred. This way, we hope

those images would serve as augmented data during

training, hopefully teaching models that the individ-

ual’s face is not as important as their posture and what

they are holding.

The annotation process described involved three

contributors, each responsible for annotating a sub-

set of the collected videos. A ﬁnal group revision

was conducted to ensure consistency across all an-

notations, addressing nuances in low-quality frames,

such as determining whether faintly visible individu-

als in the background should be annotated.

The tools we developed streamlined the methodol-

ogy described, enabling efﬁcient frame selection, an-

notations, and subsequent corrections. The resulting

dataset comprises a total of 23109 annotated objects

across 379 clips (7450 frames), representing approx-

imately two hours of annotated footage. Examples

of the dataset’s annotations are shown in Fig. 2, with

the ﬁrst row illustrating higher-quality frames and the

second row showcasing lower-quality frames that re-

quire additional context for proper interpretation.

4 DATASET STATISTICS

This section discusses some properties of FiDaSS

while also comparing it to those presented in Table 1

and addressing why the original annotations were in-

sufﬁcient in the datasets we used as a basis for ours.

One crucial characteristic of FiDaSS is that we

made the annotations directed toward the task of ob-

ject detection in real-life scenarios. To the best of our

knowledge, considering our literature analysis, there

is no dataset presenting those characteristics and con-

taining a substantial number of images. Although the

datasets highlighted in Table 1 have a large amount of

data, only one presented exclusively real-world data,

and none had annotations for the object detection task,

only for image or video classiﬁcation.

Table 2 lists the sources used for FiDaSS, show-

ing the number of videos and selected frames from

each. Approximately 40% of FiDaSS derives from

existing datasets, but we have rigorously selected and

annotated the most relevant frames, which were pre-

viously available only as raw videos. The remaining

60% consists of novel data from diverse cultures.

After describing FiDaSS’ properties, it is essen-

tial to compare it with the datasets identiﬁed in Ta-

ble 1. While Weapons-Detection and IMFDB offer

the largest datasets, they primarily consist of movie

scenes or context-free images, limiting real-world

applicability. The Gun Movies Database provides

security camera footage, but comprises only seven

laboratory-shot videos. The UCF Crime dataset pro-

vides real-world footage, with 150 robbery and 50

shooting videos, but uses clip-level labels instead of

frame-level. FiDaSS bridges these gaps with detailed

annotations and diverse real-world scenarios.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

686

Table 3: Comparison of the models we explored with our dataset, highlighting the transformer-based models in grayed lines.

Model Input Backbone mAP

Armed Unarmed Firearm

#Params

DAFNe Frames ResNet-101 34.00% 50.07% 39.83% 12.11% 5M

Faster-RCNN Frames ResNet-50 40.65% 45.69% 48.83% 27.42% 42M

YOLOv10 Frames CSPDarknet-53 44.30% 56.10% 40.50% 36.10% 24M

DINO Frames ResNet-50 60.60% 71.30% 66.09% 44.35% 47M

EVA-02 Frames FPN-12 72.07% 86.47% 75.48% 54.27% 86M

TransVOD Clips ResNet-50 45.00% 57.11% 50.62% 27.26% 59M

5 ALGORITHMIC ANALYSIS

FiDaSS introduces a novel, challenging object detec-

tion dataset designed for adaptability to various tasks,

such as video and frame detection. To evaluate its util-

ity, we conducted experiments using both clip-based

context and individual frames. We also performed

cross-dataset evaluations to assess model generaliza-

tion when trained on our dataset. The following sec-

tions outline the setup used and discuss the results.

5.1 Experimental Setup

FiDaSS aims to evaluate how accurately models

would perform in real surveillance system applica-

tions. We tested a range of state-of-the-art architec-

tures to identify their weaknesses in this area.

Our initial experiments employed YOLO (Jocher

et al., 2023) and Faster-RCNN (Ren et al., 2015), ver-

satile models effective across diverse tasks. How-

ever, these models struggled with our dataset, es-

pecially with detecting ﬁrearms. To address this,

we tested DAFNe (Lang et al., 2021), a special-

ized architecture for detecting small objects in scenes.

We then explored two transformer-based networks,

DINO (Zhang et al., 2022) and EVA-02 (Fang

et al., 2023), as an alternative to convolutional ap-

proaches. Additionally, we evaluated sequence-

processing models using TransVOD (Zhou et al.,

2022), an enhanced version of DETR (Carion et al.,

2020), to analyze performance on short video clips

rather than isolated frames. Details of the training

conﬁgurations are available on the project’s GitHub.

5.2 Experimental Results

The results of our experiments are presented on Ta-

ble 3, including information such as if the experiment

focused on individual frames or a clip sequence, the

average precision for each class, and the model size.

Because of the low quality of the images, when a per-

son has their arms stretched out, the models some-

times detect only the torso, which causes a dispar-

ity between the label and the detections. Considering

this, we focused our analysis on a 50% IoU threshold,

as we infer a person to be correctly detected even if

their limbs were not included in the prediction.

From the results gathered on Table 3, we can no-

tice a signiﬁcant advantage of using transformer mod-

els. The ﬁrst three models barely achieved a mAP

of 40%, and a class AP of 50%, while the trans-

former models achieved a mAP above 60%, with a

per-class AP between 70% and 80% for non-ﬁrearm

classes. Additionally, while the ﬁrearm class is con-

siderably lower, peaking at approximately 40%, if

we can consistently identify armed individuals, we

can infer more easily the presence of ﬁrearms on the

scene. Thus, we will focus mainly on “Armed” and

“Unarmed”.

Recurring errors emerged across all models, with

person labeling and ﬁrearm detection being the most

signiﬁcant challenges. Models consistently identiﬁed

people, except in cases of heavy blur or background

occlusion. However, they struggled with labeling in-

dividuals as “armed” or “unarmed,” often ﬂuctuating

between the two across consecutive frames, with a

slight bias toward the “armed” class. No model con-

sistently located ﬁrearms in the scene due to factors

such as image quality, variability in object shape, an-

gle, and distance.

These were the major issues identiﬁed that caused

the metrics to drop in all models proportionally to

how well the model could minimize these mistakes.

Because of this, we focus our analysis on the results

produced by EVA-02, which had the highest perfor-

mance, and discuss other issues we identiﬁed in more

depth while studying its results more thoroughly.

We present correct and incorrect detection sam-

ples in Fig. 3, illustrating diverse camera angles from

our test set. We can see that even from a distance

and with low-quality images, the model consistently

located people in the scene, with few recurring excep-

tions. For instance, individuals with their faces hidden

were often mislabeled as “Armed,” leading to errors in

cases with innocents with their backs to the camera.

To evaluate how representative our dataset is, we

FiDaSS: A Novel Dataset for Firearm Threat Detection in Real-World Scenes

687

Figure 3: Examples of successful (left) and missed (right) detections by the EVA-02 model. The ground truth for each sample

is shown on the right and the model’s detection is on the left. The images presented have been zoomed for clarity.

conducted inference of our best-performing model

on two datasets from the literature: Sohas-Weapons,

which has been frequently used throughout the years,

and YouTube-GDD, which is a more recent and very

promising dataset. Because of the difference in the

datasets’ objectives, we decided to focus on a qualita-

tive analysis instead of metrics. In both datasets, we

noticed instances of ﬁrearms labeled in the scenes but

not held by anyone (e.g., in gun stores), which were

not detected by the model we trained. We did not con-

sider those as miss-detections, as our objective is to

identify ﬁrearms being wielded by someone. For fur-

ther comparisons, examples of our cross-dataset ex-

periments are also available on our GitHub.

Our experiments with Sohas-Weapons showed

that, despite its interesting and diverse data (e.g.,

video game screenshots, stock photos, and selﬁes), it

lacks relevance for real-world applications. Their an-

notations are also limited, labeling pistols regardless

of context (e.g., images of ﬁgurines) but excluding

people or other types of ﬁrearms, which reduces the

dataset’s applicability and ﬂexibility. EVA-02 strug-

gled most with contextless images or ﬁrst-person per-

spectives, which are absent in our training set.

Similarly, experimenting with YouTube-GDD

showed that it contains more specialized data, includ-

ing shooting range videos and demonstrations. Ad-

ditionally, they provide more complete annotations

covering ﬁrearms and people, though without dis-

tinguishing between armed and unarmed individuals.

However, while aligning better with real-world sce-

narios, it also includes close-ups of ﬁrearms or indi-

viduals, limiting its practical value. The model we

trained performed well in this dataset, mostly just suf-

fering from different ﬁrearm annotation policies be-

tween their dataset and ours.

Analyzing the results, we can observe that

transformer-based networks outperform purely con-

volutional networks. This is due to attention mech-

anisms that probably associate global information

to classify people in ambiguous scenarios, whereas

convolutional networks rely more on the immediate

vicinity of the object. Because of this, transformer

models can perform more reliably in instances where

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

688

the presence of a ﬁrearm is not clear, either because

of its position relative to the camera or because the

image quality blurs it out. However, we were sur-

prised that our ﬁrst experiments with clip sequences

performed worse than other transformer strategies.

We are unsure why that happened, given that we ex-

pected temporal information to help resolve ambigu-

ous scenes even more than spatial information.

6 DISCUSSION, LIMITATIONS,

AND FUTURE WORK

FiDaSS is one of the few datasets made entirely with

real data from various cultures, to the best of our

knowledge, being the only one in this context with

annotations for object detection. Moreover, our ex-

periments show that state-of-the-art methods have dif-

ﬁculty with it, primarily with differentiating armed

from non-armed people, making it an exciting alter-

native for future research. Our dataset offers a rich

diversity of scenes, capturing different real-world sit-

uations from security cameras with varying levels of

quality. These diverse scenes, often obfuscated and

ambiguous, provide exciting challenges to be tack-

led by future research. We also observed that only a

minority of ﬁrearm objects were correctly identiﬁed,

which we attribute to the high similarity between ob-

jects that the camera did not catch well due to their

small size and low image quality.

Additionally, by analyzing our results, we identi-

ﬁed certain patterns that frequently reappear through-

out many videos. The ﬁrst pattern we identiﬁed was

cases where the model found the gun but labeled the

person holding it as being unarmed. While this is

strange for us to observe, we must consider that the

model has no “reasoning module” that would asso-

ciate that those two labels go together. Thus, we be-

lieve that this can be addressed in a post-processing

module in specialized solutions.

The second important pattern is the fact that the

models appeared to associate a person hiding their

face with them being armed, which is true in a lot

of scenarios. It is not uncommon to ﬁnd cases of peo-

ple wearing hoods and masks or bike helmets during

robberies, so this association was mostly positive for

the model. However, we also identiﬁed that, in un-

clear cases, a person with their back to the camera

had the tendency to be labeled as being armed. This

introduces several cases of false-positives in our pre-

dictions, but taking into consideration the ﬁnal goal of

being usable in real surveillance systems, we consider

this a lower priority compared to false-negatives, i.e.,

cases when an armed person is not detected.

Although we sought diversity, a limitation of Fi-

DaSS is that we still noticed some inherent biases

consistently reﬂected in our results. One issue iden-

tiﬁed was that, in unclear scenarios where the video

quality makes it hard to discern where the ﬁrearm is,

the models tend to mark a man as “armed”, even if

it was a woman who was holding a gun. This char-

acteristic shows us that, even after including compre-

hensive data from different cultures, our dataset still

contains an overwhelming amount of examples of a

man holding guns compared to a minority of instances

with an armed woman.

For future work, we want to expand FiDaSS fur-

ther with new and unexplored data from more coun-

tries and cultures to introduce even more diversity of

images. Moreover, we want to provide more sub-

stantial and representative data to avoid social biases.

However, we expect that by using the proposed tools,

the task of enhancing the dataset becomes easier and

more efﬁcient. Finally, we are also interested in ex-

ploring more specialized approaches that may achieve

better detection results on our task since the results we

achieved with general-purpose models were so low.

7 CONCLUSIONS

This work introduced a novel and challenging dataset

for ﬁrearm threat detection, focusing on object detec-

tion in real-world scenarios. Our experiments, both

individual frames and video sequences, resulted in

very low AP scores, highlighting the dataset’s difﬁ-

culty. The dataset contains 7450 annotated frames

from diverse cultures, environments, and situations

and is easily extendable using the provided tools.

We hope our work will help stimulate the research

area and provide a challenging dataset that could as-

sist in comparing the performance of works. Finally,

besides promoting research in the area, we hope to

contribute to security in our everyday lives.

ACKNOWLEDGEMENTS

This research is partially funded by Coordenac¸

de Aperfeic¸oamento de Pessoal de N

ıvel Superior –

Brasil (CAPES) – Finance Code 001. Research sup-

ported by HP Brasil Ind

ustria e Com

ercio de Equipa-

mentos Eletr

onicos Ltda. using ﬁnancial incentives of

IPI refund regarding the Law (Law nº 8.248 of 1991).

Manssour would like to thank the ﬁnancial support of

the CNPq Scholarship - Brazil (303208/2023-6).

While preparing and revising this manuscript, we

used ChatGPT and Grammarly to ensure clarity and

FiDaSS: A Novel Dataset for Firearm Threat Detection in Real-World Scenes

689

grammatical precision, as English is our second lan-

guage. The authors are responsible for creating the

entire content and ensuring technical accuracy.

REFERENCES

Ainsworth, T. (2002). Buyer beware. Security Oz, 19:18–

26.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,

A., and Zagoruyko, S. (2020). End-to-end object de-

tection with transformers. In European conference on

computer vision, pages 213–229. Springer.

Darker, I., Gale, A., Ward, L., and Blechko, A. (2007). Can

cctv reliably detect gun crime? In 2007 41st Annual

IEEE International Carnahan Conference on Security

Technology, pages 264–271. IEEE.

de Azevedo Kanehisa, R. F. and de Almeida Neto, A.

(2019). Firearm detection using convolutional neural

networks. In ICAART (2), pages 707–714.

Donald, F. M. and Donald, C. H. (2015). Task disengage-

ment and implications for vigilance performance in

cctv surveillance. Cognition, Technology & Work,

17(1):121–130.

Fang, Y., Sun, Q., Wang, X., Huang, T., Wang, X., and Cao,

Y. (2023). Eva-02: A visual representation for neon

genesis. arXiv preprint arXiv:2303.11331.

Gelana, F. and Yadav, A. (2019). Firearm detection from

surveillance cameras using image processing and ma-

chine learning techniques. In Smart innovations in

communication and computational sciences, pages

25–34. Springer.

Gius, M. (2018). The effects of state and federal gun control

laws on school shootings. Applied economics letters,

25(5):317–320.

Gonz

alez, J. L. S., Zaccaro, C.,

Alvarez-Garc

ıa, J. A., Mo-

rillo, L. M. S., and Caparrini, F. S. (2020). Real-time

gun detection in cctv: an open problem. Neural net-

works.

Grega, M., Lach, S., and Sieradzki, R. (2013). Automated

recognition of ﬁrearms in surveillance video. In Cog-

nitive Methods in Situation Awareness and Decision

Support (CogSIMA), 2013 IEEE International Multi-

Disciplinary Conference on, pages 45–50.

Gu, Y., Liao, X., and Qin, X. (2022). Youtube-gdd: A chal-

lenging gun detection dataset with rich contextual in-

formation. arXiv preprint arXiv:2203.04129.

Hnoohom, N., Chotivatunyu, P., and Jitpattanakul, A.

(2022). Acf: an armed cctv footage dataset for en-

hancing weapon detection. Sensors, 22(19):7158.

Hurka, S. and Knill, C. (2020). Does regulation matter? a

cross-national analysis of the impact of gun policies

on homicide and suicide rates. Regulation & Gover-

nance.

IMFDB (2015). Internet movie ﬁrearms database.

http://www.imfdb.org/index.php?title=Main

Page&

oldid=911151. Last accessed in 18/01/2021.

Ineneji, C. and Kusaf, M. (2019). Hybrid weapon detection

algorithm, using material test and fuzzy logic system.

Computers & Electrical Engineering, 78:437–448.

Jocher, G., Chaurasia, A., and Qiu, J. (2023). YOLO by

Ultralytics.

Lang, S., Ventola, F., and Kersting, K. (2021). Dafne: A

one-stage anchor-free approach for oriented object de-

tection. arXiv e-prints, pages arXiv–2109.

Lemieux, F. (2014). Effect of gun culture and ﬁrearm

laws on gun violence and mass shootings in the united

states: A multi-level quantitative analysis. Interna-

tional Journal of Criminal Justice Sciences, 9(1):74.

Lim, J., Al Jobayer, M. I., Baskaran, V. M., Lim, J. M.,

Wong, K., and See, J. (2019). Gun detection in

surveillance videos using deep neural networks. In

2019 Asia-Paciﬁc Signal and Information Processing

Association Annual Summit and Conference (APSIPA

ASC), pages 1998–2002. IEEE.

Loganathan, S., Kariyawasam, G., and Sumathipala, P.

(2019). Suspicious activity detection in surveillance

footage. In 2019 International Conference on Elec-

trical and Computing Technologies and Applications

(ICECTA), pages 1–4. IEEE.

Mehta, P., Kumar, A., and Bhattacharjee, S. (2020). Fire

and gun violence based anomaly detection system us-

ing deep neural networks. In 2020 International Con-

ference on Electronics and Sustainable Communica-

tion Systems (ICESC), pages 199–204. IEEE.

Olmos, R., Tabik, S., and Herrera, F. (2018). Automatic

handgun detection alarm in videos using deep learn-

ing. Neurocomputing, 275:66–72.

Pawar, M., Dhanki, M., Parkar, S., Dandekar, C., and Gupta,

B. (2019). A novel approach to detect crimes and as-

sist law enforcement agency using deep learning with

cctvs and drones.

Piza, E. L., Welsh, B. C., Farrington, D. P., and Thomas,

A. L. (2019). Cctv surveillance for crime preven-

tion: A 40-year systematic review with meta-analysis.

Criminology & Public Policy, 18(1):135–159.

Raturi, G., Rani, P., Madan, S., and Dosanjh, S. (2019).

Adocw: An automated method for detection of con-

cealed weapon. In 2019 Fifth International Confer-

ence on Image Information Processing (ICIIP), pages

181–186. IEEE.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Sultani, W., Chen, C., and Shah, M. (2018). Real-world

anomaly detection in surveillance videos. In Proceed-

ings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 6479–6488.

Velastin, S. A., Boghossian, B. A., and Vicencio-Silva,

M. A. (2006). A motion-based image processing

system for detecting potentially dangerous situations

in underground railway stations. Transportation Re-

search Part C: Emerging Technologies, 14(2):96–113.

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni,

L. M., and Shum, H.-Y. (2022). Dino: Detr with im-

proved denoising anchor boxes for end-to-end object

detection. arXiv preprint arXiv:2203.03605.

Zhou, Q., Li, X., He, L., Yang, Y., Cheng, G., Tong, Y., Ma,

L., and Tao, D. (2022). Transvod: end-to-end video

object detection with spatial-temporal transformers.

IEEE Transactions on Pattern Analysis and Machine

Intelligence.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

690