taneously with a single CNN framework. The last ap-
proach first detects the worker and uses the cropped
image as input to a classifier that detects the presence
of PPE. The paper uses ∼1,500 self-obtained and an-
notated images to train the three approaches. The sec-
ond approach achieved the best performance, with a
mAP of 72.3% and 11 FPS on a laptop. The first ap-
proach, although having lower mAP (63.1%), attained
13 FPS.
The paper (Fang et al., 2018) proposed a system
to detect non-hard hat use with deep learning from a
far distance. The exact range was not specified, just
small, medium, and large are mentioned. Faster R-
CNN (Ren et al., 2015) was used due to its ability
to detect small objects. More than 100,000 images
were taken from numerous surveillance recordings at
25 different construction sites. Out of these, 81,000
images were randomly selected for training. Object
detection is outdoors, and the images were captured
at different periods, distances, weather, and postures.
The precision and recall in the majority of conditions
are above 92%. At a short distance, like the one that
we will employ in this work, they achieved a precision
of 98.4% and a recall of 95.9%, although the system
only checks one class (helmet).
In (Zhafran et al., 2019), Faster R-CNN is also
used to detect helmets, masks, vests, and gloves. The
dataset consists of 14,512 images, where 13,000 were
used to train the model. The detection accuracy was
measured at various distances and under lighting but
with a distracting background. Helmet resulted in
100% accuracy, masks 79%, vest 73%, and gloves
68% from a 1-meter distance. Longer distances con-
tributed to poorer results, with detection at a 5-meter
distance or above deemed unfeasible for all classes.
Reduced light intensity was also observed to have a
high impact on the smaller objects (masks, gloves).
The paper (Delhi et al., 2020) developed a frame-
work to detect the use of PPE on construction work-
ers using YOLOv3. They suggested a system where
two objects (hat, jacket) are detected and divided into
four categories: NOT SAFE, SAFE, NoHardHat, and
NoJacket. The dataset contains 2,509 images from
video recordings of construction sites. l. The model
reported an F1 score of 0.96. The paper also devel-
oped an alarm system to bring awareness whenever a
non-safe category was detected.
In (W
´
ojcik et al., 2021), hard hat wearing detec-
tion is carried out based on the separate detection
of people and hard hats, coupled with person head
keypoint localization. Three different models were
tested, ResNet-50, ResNet-101, and ResNeXt-101. A
public dataset of 7,035 images was used. ResNeXt-
101 achieved the best average precision (AP) at 71%
for persons with hardhat detection, but the AP for per-
sons without hard had was 64.1%.
Figure 3: Examples of images from volunteers to be used
as test images of the evaluation phases (see Section 3.3).
3 METHODOLOGY
3.1 Data Acquisition
The five object classes to be detected in this work
include hardhat, safety vest, safety gloves, safety
glasses, and hearing protection. Two data sources
have been used to gather data to train our system.
A dataset from kaggle.com
1
(Figure 2, left) with la-
beled hardhats and safety vests was utilized, contain-
ing 3,541 images, where 2,186 were positive, i.e. with
hardhats, safety vests, or both present. Negative im-
ages are pictures of people in different activities in-
doors and outdoors. An additional 2,700 images were
collected and annotated from Google Images with dif-
ferent keyword searches to complement the remain-
ing classes (Figure 2, right), given the impossibility
of finding a dataset containing all target classes. Pre-
liminary tests showed that, for example, caps could be
predicted as hardhats. This was solved by adding im-
ages of caps as negative samples of the training set to
help the model to learn the difference. The available
data has been augmented with different modifications
(saturation, brightness, contrast, blur, and sharpness)
to increase the amount of training data of some under-
represented classes, as it will be explained during the
presentation of results in Section 3.3.
3.2 System Overview
Several algorithms exist for object detection, e.g.:
RetinaNet (Lin et al., 2020), Single-Shot MultiBox
Detector (SSD) (Liu et al., 2016), and YOLOv3 (Red-
mon and Farhadi, 2018). Comparatively, RetinaNet
usually has a better mean average precision (mAP),
but it is too slow for real-time applications (Tan et al.,
2021). SSD has a mAP between RetinaNet and
YOLOv3. YOLOv3 has the worst mAP but is faster
and suits real-time applications. In addition, it is the
fastest to train, so one can adapt quickly to changes
1
https://www.kaggle.com/datasets/johnsyin97/hardhat-
and-safety-vest-image-for-object-detection
Visual Detection of Personal Protective Equipment and Safety Gear on Industry Workers
397