first-person cameras positioned at eye level for each
of the generated workers. Once the construction site
and the actors have been generated, the active camera
is iterated over all deployed camerad and all the an-
notations corresponding to the entities present inside
the frames are saved. The stored data are realted to
the 2D and 3D bounding boxes, the distance between
the entity and the camera, the body joints of the work-
ers. Once the iteration of all the generated cameras is
finished, the entities are deleted and the playable char-
acter is moved to the next position to be visited on the
map until termination of all positions in the list.
As described above, the generated cameras can
be grouped into three categories: third-person cam-
eras (TPV), first-person cameras (FPV), cameras po-
sitioned on vehicles.
The third-person cameras mimic the cameras that are
usually placed on the perimeter of the construction
site to have a top view of the area (usally used for
surveillance purpose). These cameras are positioned
at an height of 6 meters. Third-person cameras simu-
late wide cameras whit a Field of View (FoV) of 120
degrees. The images are acquired at a resolution of
1280x720 pixels.
The first-person cameras represent the cameras that in
the real settings can be worn by the workers (e.g., on
the helmets). A first-person camera is simulated for
each worker generated within the construction site.
The workers “on the ground” were not equipped with
a camera in the first person, since their being lying
on the ground led to acquire frames with artifacts due
to interpenetration with the ground. The first-person
cameras, simulating the cameras mounted on the hel-
mets, have a FoV of 64.67 degrees, in order to sim-
ulate a HoloLens2 camera. The images captured by
the cameras from the first person point of view are ac-
quired at a resolution of 1280x720 pixels.
The cameras positioned on the vehicles are four per
vehicle and they are positioned at an elevated position
and rotated in order to have a view of everything that
surrounds the vehicle. The cameras were positioned
as if they were physically present at the 4 corners of
the vehicle to understand if a worker is too close to
a moving vehicle. The cameras positioned on the ve-
hicles have a FoV of 120 degrees. The images are
captured by the cameras positioned on the vehicles at
a resolution of 1280x720 pixels.
While the plugin is running, different views of the
scene are displayed, one for each acquisition point.
These views are obtained by activating, deactivating
and moving the created virtual cameras. For each
generated view two files are created: a screen cap-
ture saved in JPG format and a text file containing the
annotations for each entity present within the view.
The 2D and 3D bounding boxes are labeled with
the following 12 classes: head with work helmet,
worker, torso with high visibility vest, pneumatic
hammer, vehicle, head without work helmet, torso
without high visibility vest, cone, worker on the
ground, shovel, wheelbarrow, concrete mixer.
For each labeled entity, distance from the camera
was measured as the length of the segment connecting
the camera position to the center of the 3D bounding
box of the entity.
The worker joints that have been labeled are the
following: nose, neck, left clavicle, right clavicle,
left thigh, right thigh, left knee, right knee,
left ankle, right ankle, left wrist, right wrist,
left elbow, right elbow.
4 BENCHMARK
We tested the quality of the synthetic dataset gener-
ated by the proposed plugin running benchmarks on
three tasks: safety compliance through object detec-
tion, fall detection through pose estimation and dis-
tance regression from monocular view. Experiments
have been performed on both synthetic dataset and an-
notated real data. These last have been used for fine-
tuning and for the evaluation of the algorithms.
4.1 Dataset
The dataset was collected by generating 200 build-
ing sites within the game map. In total, 76,580
frames were generated, with 44,580 in FPV (work-
ers), 16,000 frames in FPV (vehicles) and 16,000
frames in TPV (construction site corners). The
dataset contains 2,438,566 labels distributed as fol-
lows: 333,856 workers, 168,686 heads with helmet,
165,867 busts without high visibility vest, 20,454
pneumatic hammers, 35,850 vehicles, 165,170 heads
without helmet, 167,989 busts without high visibil-
ity vest, 1,085,729 cones, 135,084 ground work-
ers, 78,584 shovels, 39,999 wheelbarrows and 41,298
concrete mixers. Some of these images were dis-
carded for occlusions or other glitches, bringing the fi-
nal count to 51,081 synthetic images splitted in train-
ing (30,019), and validation (21,062).
The final dataset contains also a grand total of
9,698 real images splitted in training set (9,212) and
validation (486).
Put Your PPE on: A Tool for Synthetic Data Generation and Related Benchmark in Construction Site Scenarios
659