Instance Segmentation and Detection of Children to Safeguard

Vulnerable Trafﬁc User by Infrastructure

Shiva Agrawal

1 a

, Savankumar Bhanderi

1 b

, Sumit Amanagi

1 c

, Kristina Doycheva

2 d

and Gordon Elger

1,2 e

Institute for Innovative Mobility (IIMo), Technische Hochschule Ingolstadt, Germany

Fraunhofer IVI, Applied Center Connected Mobility and Infrastructure, Ingolstadt, Germany

Keywords:

Child and Adult Detection, Classiﬁcation, Intelligent Roadside Infrastructure, Image Segmentation,

Mask-RCNN, Trafﬁc Flow Optimization, Transfer Learning.

Abstract:

Cameras mounted on intelligent roadside infrastructure units and vehicles can detect humans on the road using

state-of-the-art perception algorithms, but these algorithms are presently not trained to distinguish between

human and adult. However, this is a crucial requirement from a safety perspective because a child may not

follow all the trafﬁc rules, particularly while crossing the road. Moreover, a child may stop or may start

playing on the road. In such situations, the separation of a child from an adult is necessary. The work in this

paper targets to solve this problem by applying a transfer-learning-based neural network approach to classify

child and adult separately in camera images. The described work is comprised of image data collection, data

annotation, transfer learning-based model development, and evaluation. For the work, Mask-RCNN (region-

based convolutional neural network) with different backbone architectures and two different baselines are

investigated and the perception precision of the architectures after transfer-learning is compared. The results

reveal that the best performing trained model is able to detect and classify children and adults separately in

different road scenarios with segmentation mask AP (average precision) of 85% and bounding box AP of

92%.

1 INTRODUCTION

Intelligent roadside infrastructure units are usually

comprised of one or more sensors to detect, classify

and predict the motion of various road users. Among

all the sensors, the camera is a very important sen-

sor because it has the unique ability to detect different

colours, shapes, sizes, textures, and types of objects.

Hence, it is widely used to classify various road users

like pedestrians, bicycles, motorbikes, cars, trucks,

buses, animals, static objects, etc. This classiﬁcation

helps intelligent roadside infrastructure units to de-

cide critical and non-critical situations arising on the

road. For example, if a pedestrian is detected crossing

the road when the trafﬁc lights of the vehicle lane are

https://orcid.org/0000-0001-8633-341X

https://orcid.org/0000-0001-7257-6736

https://orcid.org/0000-0003-0132-8115

https://orcid.org/0000-0002-3340-7048

https://orcid.org/0000-0002-7643-7327

red, then it is normal condition but if the pedestrian

is detected crossing the road when the trafﬁc lights of

the vehicle lane are green, then this is a critical sit-

uation. In such a condition, an intelligent roadside

infrastructure unit has to immediately send a warning

signal to passing vehicles in order to avoid accidents

and save human lives. Such a situation may not rise

so often when an adult is crossing the road but it is

often possible when a child is crossing the road.

A child may not follow or understand all the traf-

ﬁc rules. Also, a child may cross the road anytime or

may stop or start playing in the middle of the road.

Hence, it is important that intelligent roadside infras-

tructure units are able to detect and classify pedestri-

ans, separately as an adult or a child to increase their

safety on the road and also to avoid accidents. A cam-

era can recognize various road users better than other

sensors, so it is wise to use it to recognize a human

(or pedestrian) as a child or an adult.

The results from traditional computer vision al-

gorithms in this domain are limited but artiﬁcial in-

206

Agrawal, S., Bhanderi, S., Amanagi, S., Doycheva, K. and Elger, G.

Instance Segmentation and Detection of Children to Safeguard Vulnerable Trafﬁc User by Infrastructure.

DOI: 10.5220/0011825400003479

In Proceedings of the 9th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2023), pages 206-214

ISBN: 978-989-758-652-1; ISSN: 2184-495X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

(a) Original image. (b) YOLO-v7. (c) SSD.

(d) Faster R-CNN. (e) Mask-RCNN. (f) Retina Net.

Figure 1: Both adults and children are detected as persons in the camera image by different state-of-the-art AI-based detectors.

Figure 1a is taken from open source (Productions, 2021).

telligence (AI) based computer vision algorithms are

very good at detecting and classifying various road

users from camera images. The latest state-of-the-art

algorithms like faster-RCNN (Region-Based Convo-

lutional Neural Network) (Ren et al., 2015), SSD

(Single Shot Detector) (Liu et al., 2016), Mask-

RCNN (He et al., 2017), RetinaNet (Lin et al., 2017b)

and YOLOv7 (You Only Look Once version 7) (Wang

et al., 2022) are widely used for road user classiﬁca-

tion in many research and commercial applications.

But as highlighted in ﬁgure 1, none of these state-of-

the-art object detectors is able to separately recognize

whether the detected person is an instance of a child

or an adult.

The work in this paper is focused to solve this

problem by using the transfer learning (Zhuang et al.,

2020) approach. Among the state-of-the-art models,

the Mask-RCNN model is selected for the work be-

cause it has the ability to generate instance segmen-

tation (object mask) along with bounding box and

classiﬁcation output. The literature survey during the

work concluded that there is no public dataset avail-

able for such a problem. Hence, images from vari-

ous sources containing humans (both child and adult)

are collected and annotated using a semi-automatic

labelling framework. For speciﬁc images, manual la-

belling of the data is also performed. Thereafter, the

pre-trained Mask-RCNN network is modiﬁed to adapt

it for a two-class object detection and segmentation

task and then trained on the dataset using six differ-

ent feature extraction backbone architectures includ-

ing two different baselines. All the trained models are

evaluated and the best performing model is selected to

use for trafﬁc ﬂow optimization use-case for the intel-

ligent roadside infrastructure (Agrawal et al., 2022).

This paper is outlined as follows: section 2 pro-

vides insights into transfer learning, Mask-RCNN,

and detectron2 framework. Section 3 describes the

approach and method of child and adult detection

and instance segmentation that includes the process

of data collection, data annotation, dataset generation,

AI-based model development, training, and testing.

Section 4 provides the results of the proposed method

and then at last conclusion is given.

2 TECHNICAL BACKGROUND

WITH RELATED WORK

2.1 Transfer Learning

Transfer learning (Zhuang et al., 2020) is the ap-

proach widely used in deep learning applications.

Training of deep learning architectures requires a very

huge amount of labelled data. For every application,

the collection of such a huge amount of data is not

practically possible. With the help of transfer learn-

Instance Segmentation and Detection of Children to Safeguard Vulnerable Trafﬁc User by Infrastructure

207

ing, one can use the pre-trained weights and deep

learning model designed for one application to an-

other similar application with some modiﬁcations. In

this approach, the amount of new data required for

training is comparatively low and can provide good

results with high accuracy.

Figure 2 visually describes the concept of transfer

learning. The source model is the original model from

either the same domain or from another domain that

is trained previously using a large amount of labelled

data. The knowledge in form of trained weights and

architecture of this source model is then transferred

and used partially or fully to train another model,

known as the target model. To train this target model,

a comparatively very small amount of labelled data is

required to get high-performance metrics because this

target model is not trained from scratch but is trained

over the available knowledge of the source model.

Figure 2: Overview of transfer learning approach.

Transfer learning is applied successfully in many

ﬁelds including medical imaging, aerospace, natural

language processing, audio processing, autonomous

vehicle development, etc. As a result, the avail-

able literature in this area is very vast, and, hence

only some of the work from different ﬁelds is cited

here. For example, in (Liang et al., 2016) aerial im-

ages are classiﬁed using transfer learning for remote

sensing image understanding tasks. The work stated

in (Cao et al., 2013) used transfer learning for better

and more accurate detection of pedestrians from cam-

era images. The author in (Hu and Yang, 2011) has

used this approach to identify and predict various hu-

man activities. In (Alzubaidi et al., 2020), the author

has used transfer learning to classify images to detect

breast cancer and in (Kocmi and Bojar, 2018) for low-

resource neural network-based machine translation

application. Similarly, the work stated in (G

ati and

Kiss, 2021) represents sound signals as images and

then uses transfer learning to classify sounds through

an image classiﬁer as the source model.

2.2 Mask-RCNN

Mask-RCNN (He et al., 2017) is a state-of-the-art

neural network model which is widely used in image-

based object detection and image segmentation appli-

cations. This model consists of a backbone network

that does the work of feature extraction (high level

to low level) from an input image. These extracted

features along with pre-deﬁned anchors are fed into a

region proposal network (RPN) followed by an ROI

(region of interest) alignment block to get ﬁxed-size

proposals. These proposals are then passed through

a series of fully connected layers (FCN) to generate

object class probabilities and bounding box regres-

sion and also passed through a series of CNN lay-

ers to predict the binary mask of each detected object.

The backbone of the Mask-RCNN that is responsible

for feature extraction can be used from multiple state-

of-the-art classiﬁer models. Among them, the most

widely used are different variants of ResNet(residual

network)(He et al., 2016) with FPN (feature pyramid

network)(Lin et al., 2017a).

Mask-RCNN together with transfer learning is

used in many applications. For example, the work de-

scribed in (Zhang et al., 2020) used transfer learning

together with Mask-RCNN to detect damaged vehi-

cles using camera images. Authors in (Do

gru et al.,

2020) used the same combination of Mark-RCNN

with transfer learning to develop a target model with

a small amount of labelled data to detect dents on the

upper body of the aircraft for automatic maintenance.

Similarly, authors in (Shenavarmasouleh et al., 2021)

used this combination in the medical ﬁeld to detect

lesian in fundus images to address the diabetic retino-

pathic problem.

From this literature survey, it is evident that trans-

fer learning is a very powerful method and it has

signiﬁcantly generated good results by using many

different deep learning models (as source models)

including Mask-RCNN. Hence, in this paper, this

powerful and proven combination of transfer learn-

ing with Mask-RCNN is used to develop the target

model to classify children and adults as two distinct

classes from camera images for intelligent roadside

infrastructure applications. In addition, no direct re-

lated work found in the literature that has solved this

speciﬁc issue which further motivated the proposed

work of this paper.

2.3 Detectron2

Detectron2 (Wu et al., 2019) is a widely used

open-source modular software framework from the

AI group of Facebook. It is implemented in py-

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

208

Figure 3: Proposed method for detection and instance segmentation of a child and an adult.

torch (Paszke et al., 2019), a python library for AI-

based development. It provides easy to use inter-

face and takes low computation time on single or

multiple GPU systems. Detectron2 is the successor

of detectron that was started with the Mask-RCNN

benchmark. Hence, Mask-RCNN is available in the

detectron2 framework, and with this framework, the

process of using Mask-RCNN together with trans-

fer learning is comparatively easy. Hence, in this

work, Mask-RCNN within the detectron2 framework

is used.

3 PROPOSED METHOD

The complete pipeline of the proposed method for

detection and instance segmentation of a person as

a child or an adult is highlighted in ﬁgure 3. Vari-

ous images of children and adults in the road envi-

ronment were collected during the image data collec-

tion process. All the collected images were annotated

by a semi-automatic labelling pipeline to generate a

dataset for the work. The resulting dataset was then

split between train and test samples for the training

purpose.

In the transfer learning-based approach, the Mask-

RCNN model is used as the source model that is

pre-trained on COCO (common objects in context)

dataset (Lin et al., 2014). The target model, i.e. the

described model of the work is then selected by freez-

ing all the initial layers of the source model and modi-

fying the classiﬁcation, bounding box regression, and

mask generation layers for two categories of objects,

i.e. a child and an adult. Then the target model is

trained on a relatively small size dataset of child and

adult images. For each feature detection backbone ar-

chitecture used during the work, the target model is

designed using the corresponding source model in a

similar manner for training and evaluation.

3.1 Data Collection

As stated before, for this work, a comparatively small

dataset is required because of the transfer learning ap-

proach, but none of the publicly available datasets has

labels for a person as an adult or a child. Hence,

for this work, at ﬁrst, images containing instances

of children and adults were collected from many

open-source websites and publicly available datasets.

Please note that the images from public datasets con-

taining adults and children are available only as a per-

son.

These images were collected in a variety of set-

tings and poses for instances of adults and children

in road environments. With further reﬁnement, a to-

tal of 506 images were selected that were categorized

as images containing instances of only children, im-

ages containing instances of only adults, and images

containing instances of both children and adults.

Instance Segmentation and Detection of Children to Safeguard Vulnerable Trafﬁc User by Infrastructure

209

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

Figure 4: Semi-automatic labelling framework for data annotation. From (4a source - (Wolf, 2021)) to (4d) is the labelling

process for images having only children. From (4e source - (Han, 2020)) to (4h) is the labelling process for images having

only adults and from (4i source - (Productions, 2021) ) to (4l) is the labelling process for images having both adults and

children.

3.2 Data Annotation

To annotate the collected images, a semi-automatic

labelling framework is developed. This framework is

depicted in ﬁgure 4. Please note that the ﬁgure 4d, 4h,

and 4l are not the output of the ﬁnal model, rather they

show the label transfer from person to child or adult

for generating the annotations using semi-automatic

labelling framework.

At ﬁrst, the state-of-the-art Mask-RCNN model

that is pre-trained on the COCO dataset with 80 differ-

ent classes including persons is used within the detec-

tron2 framework to generate the bounding box, class,

score, and object mask of all the 80 classes. Then only

instances of a person are preserved and the rest are re-

moved. Then depending on the type of image used as

input, the label of the person is transferred as a child

or an adult. For example, as shown in ﬁgure 4, images

from 4a to 4d depict this complete semi-automatic la-

belling pipeline for images contains only child (one

or more instances). The original image of ﬁgure 4a

is fed into the Mask-RCNN model which results in

ﬁgure 4b and then only the instances of person are

kept which is shown in ﬁgure 4c. At last, the label

of a person is transferred to the child as shown in ﬁg-

ure 4d. Similarly, the pipeline for images containing

only adults (one or more instances) is shown from ﬁg-

ure 4e to ﬁgure 4h.

For the labelling of images containing instances

of both children and adults, the same pipeline is used

to generate labels as a person but then manually each

instance in the image is checked and labels are trans-

ferred as either a child or an adult. Due to this man-

ual intervention, the proposed framework is named

a semi-automatic labelling framework. After gen-

erating the annotations through the aforementioned

methodology, a manual reﬁnement step is performed

to account for the cases where the network might have

failed to detect a person. This was followed by a ﬁnal

manual validation step and then the annotations were

saved in a COCO format for training purposes.

3.3 Dataset Summary

Figure 5 illustrates the summary of the generated

dataset and its distribution. The entire dataset con-

sists of 506 images, from which 454 are selected as

training samples and 52 are chosen as testing sam-

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

210

(a) Images type distribution. (b) Class distribution.

Figure 5: Dataset sumamry.

Table 1: Conﬁguration Parameters.

Parameter Value Description

base lr 0.01 learning rate for updating the weights

batch size 4 number of images in a batch

freeze at 2 freeze ﬁrst freeze at layers

gamma 0.1 learning rate decay factor

steps [5000,15000,20000] steps to decay base lr by gamma

optimiser SGD with 0.9 momentum network optimiser

max iters 30000 maximum iterations to train

eval period 500 evaluation every eval peroid iterations

warmup iters 500

number of iterations to increase the

learning rate to base lr

warmup type linear ramp learning rate linearly

checkpoint 500 save weights every checkpoint iterations

train aug

resize shortest edge, brightness

saturation, contrast train time data augmentations

horizontal ﬂip

test aug resize shortest edge, horizontal ﬂip test data augmentations

ples, by keeping a 90% - 10% train-test ratio. Further

ﬁgure 5a shows the distribution of test and train sam-

ples for each type of image and ﬁgure 5b provides the

instance-wise distribution of training and test data.

3.4 Development of Model

The Mask-RCNN (pre-trained on coco dataset) model

(source model) is imported directly from the Detec-

tron2 models’ catalog and conﬁguration was changed

to generate detection and segmentation outputs for 2

classes to match objects of interest for the work. The

mask branch of the model generates by default the

class-wise mask of spatial dimensions (28,28) that is

incompatible with the size of the input image. Hence,

in a post-processing step, the masks are up-sampled

to match the input image spatial dimensions. Apart

from that, other conﬁgurations like anchor labelling,

loss functions, and RPN settings are preserved from

the source model. The hyper-parameters conﬁgured

for the training of the models are given in Table 1.

For the target models, the ﬁrst two stages of the

original backbones are frozen, and only in later lay-

ers, training is carried out for child and adult detection

and segmentation. The learning rate of the models is

changed in steps during the training as highlighted in

Table 1. At ﬁrst, the learning rate (base

lr) is set to

0.01 which is reduced to 0.001 after 15000 iterations

and then further reduced to 0.0001 after 20000 itera-

tions during the training and for uniform comparison

of results for all the trained models, a constant seed

value of 42 is used.

Instance Segmentation and Detection of Children to Safeguard Vulnerable Trafﬁc User by Infrastructure

211

4 EXPERIMENTS AND RESULTS

Four different backbone architectures including

ResNet50, ResNet101, ResNeXt101 (Xie et al.,

2017), and ResNeXt152 (with cascaded Mask-RCNN

head) were used as a source model during the train-

ing. Further, the ResNet50 and ResNet101 backbones

are used with two different baselines as highlighted in

Table 2. Backbones with baseline 1 are pre-trained

with a standard 3x schedule and 120 COCO epochs,

whereas the backbones with baseline 2 are pre-trained

for a longer schedule of 400 COCO epochs using

copy-paste augmentations as described in (Ghiasi

et al., 2021). As a result, a total of six different back-

bones with two baselines are used in conjunction with

FPN (feature pyramid network) for training and eval-

uation of the child and adult detection problem.

Each variant was trained independently on the ma-

chine with 16 core CPU of 3.4 GHz, 128 GB RAM,

RTX 3090 GPU with 24 GB memory, and AMD

Ryzon 9 5950x processor. Each variant excluding

ResNeXt152 took around 5 hours for training and

ResNeXt152 took around 18 hours for training. Ta-

ble 2 highlights the performance in form of AP (av-

erage precision) of the bounding box and segmenta-

tion mask of all models after training with the same

hyper-parameters as described in section 3.4. Results

clearly show that the trained model with backbone ar-

chitecture of ResNet50 + FPN with baseline 2 has the

highest AP value of 92.43% for bounding box gen-

eration and 85.85% for segmentation mask. Hence,

for this best-performing model, the training and val-

idation loss, and performance metrics for each class

with respect to training iterations are further given in

ﬁgure 6.

Table 2: Comparison of results (each backbone is with FPN).

Metric

baselines 1 baselines 2

ResNet50 ResNet101 ResNeXt101 ResNeXt152 Cascade ResNet50 ResNet101

box AP 81.54 84.05 84.10 91.57 92.43 90.62

mask AP 82.20 83.16 84.31 83.95 85.85 84.28

box AP- adult 80.65 83.41 83.26 92.07 92.71 91.98

box AP- child 82.53 84.70 84.92 91.06 92.14 89.27

mask AP-adult 81.50 81.68 82.75 83.00 86.40 85.37

mask AP-child 82.80 85.45 85.87 84.87 85.30 83.19

(a) (b) (c)

(d) (e) (f)

Figure 6: Performance and loss graphs of best performing model (Mask-RCNN with baseline 2 ResNet50). The x-axis

represents iterations and the y-axis represents parameters. (6a) compares training loss and validation loss during training.

Further breakdown of training loss and validation loss into sub-categories is shown in (6b) and (6c) respectively. 6d shows

the average precision of the model for box and masks. per-category average precision is shown in (6e) and (6f).

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

212

(a) (b) (c)

(d) (e) (f)

Figure 7: Inference results of the work. All the original images are taken from open source websites, 7a from (Ogino, 2021),

7b from (Khac, 2020), 7c from (Lusina, 2021), 7d from (Kim, 2021), 7e from (Mak, 2021) and 7f from (Cooks, 2022).

Further analysis of the results shows that among

the two models with baseline 2, ResNet50 outper-

forms ResNet101 in terms of all the reported perfor-

mance metrics. These models are pre-trained on a

longer schedule and are therefore capable of capturing

very complex patterns within an image. This explains

the very high AP and low loss of these models from

the beginning as highlighted in ﬁgures 6d, 6e and 6f.

However, the increased complexity of the ResNet101

model with baseline 2 is affecting negatively the two-

category problem which results in lower AP as com-

pared to ResNet50 with baseline 2. Additionally,

when comparing variants from baseline 1, then cas-

cade Mask-RCNN with ResNeXt152 backbone pro-

vides the highest AP for the box and the mask.

5 CONCLUSION

In the described work, an instance segmentation and

detection model using an AI-based neural network

is developed to separately classify and detect hu-

mans as a child or an adult. For this purpose,

Mask-RCNN model with different backbone archi-

tectures of ResNet50, ResNet101, ResNeXt101, and

ResNext152 cascade together with FPN including two

different pre-trained baselines are used with trans-

fer learning. From the results, it is found that the

Mask-RCNN model with ResNet50 with baseline 2

(i.e. pre-trained model with 400 epochs) performs the

best with segmentation mask AP (average precision)

of 85% and bounding box AP of 92%.

For further work, the network will be trained by

collecting boundary cases, e.g. small adults and

adults in cower positions, to analyze and improve

the classiﬁer. However, also when the differentia-

tion might not reach 100% the increase of security

for children will be improved by additionally warning

the ongoing trafﬁc that children are present especially

around bus stops, areas in front of schools and cross-

ings where children passing on the school ways using

intelligent roadside infrastructure.

ACKNOWLEDGEMENTS

The research work is supported by the Bavarian Min-

istry of Economic Affairs, Regional Development and

Energy (StMWi) in the project “INFRA – Intelligent

Infrastructure”.

REFERENCES

Agrawal, S., Song, R., Kohli, A., Korb, A., Andre, M.,

Holzinger, E., and Elger, G. (2022). Concept of smart

Instance Segmentation and Detection of Children to Safeguard Vulnerable Trafﬁc User by Infrastructure

213

infrastructure for connected vehicle assist and trafﬁc

ﬂow optimization. In VEHITS, pages 360–367.

Alzubaidi, L., Al-Shamma, O., Fadhel, M. A., Farhan, L.,

Zhang, J., and Duan, Y. (2020). Optimizing the perfor-

mance of breast cancer classiﬁcation by employing the

same domain transfer learning from hybrid deep con-

volutional neural network model. Electronics, 9(3).

Cao, X., Wang, Z., Yan, P., and Li, X. (2013). Transfer

learning for pedestrian detection. Neurocomputing,

100:51–57. Special issue: Behaviours in video.

Cooks, J. (2022). [Online; accessed September, 2022].

gru, A., Bouarfa, S., Arizar, R., and Aydo

gan, R. (2020).

Using convolutional neural networks to automate

aircraft maintenance visual inspection. Aerospace,

7(12):171.

Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.-Y., Cubuk,

E. D., Le, Q. V., and Zoph, B. (2021). Simple copy-

paste is a strong data augmentation method for in-

stance segmentation. In 2021 IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 2917–2927.

ati, N. and Kiss, A. (2021). Sound classiﬁcation with

transfer learning (13th joint conference on mathemat-

ics and computer science (the 13th macs), on october

1-3, 2020).

Han, R. (2020). Man walking on pedestrian lane. [Online;

accessed September, 2022].

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Hu, D. H. and Yang, Q. (2011). Transfer learning for

activity recognition via sensor mapping. In Twenty-

second international joint conference on artiﬁcial in-

telligence.

Khac, A. (2020). A group of children walking hand in

hand on unpaved road. [Online; accessed September,

2022].

Kim, R. (2021). [Online; accessed September, 2022].

Kocmi, T. and Bojar, O. (2018). Trivial transfer learning

for low-resource neural machine translation. arXiv

preprint arXiv:1809.00357.

Liang, Y., Monteiro, S. T., and Saber, E. S. (2016). Trans-

fer learning for high resolution aerial image classiﬁca-

tion. In 2016 IEEE Applied Imagery Pattern Recogni-

tion Workshop (AIPR), pages 1–8.

Lin, T.-Y., Doll

ar, P., Girshick, R., He, K., Hariharan, B.,

and Belongie, S. (2017a). Feature pyramid networks

for object detection. In 2017 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR),

pages 936–944.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll

ar, P.

(2017b). Focal loss for dense object detection. In

Proceedings of the IEEE international conference on

computer vision, pages 2980–2988.

Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,

R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,

and Doll

ar, P. (2014). Microsoft coco: Common ob-

jects in context.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In European conference on com-

puter vision, pages 21–37. Springer.

Lusina, A. (2021). Unrecognizable black father with son

holding hands on city road. [Online; accessed Septem-

ber, 2022].

Mak (2021). [Online; accessed September, 2022].

Ogino, K. (2021). Asian woman and girl standing near

crosswalk. [Online; accessed September, 2022].

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,

Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,

Antiga, L., Desmaison, A., Kopf, A., Yang, E., De-

Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S.,

Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019).

Pytorch: An imperative style, high-performance deep

learning library. pages 8024–8035.

Productions, P. (2021). Family crossing the street while

holding each other’s hands. [Online; accessed

September, 2022].

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. Advances in neural information

processing systems, 28.

Shenavarmasouleh, F., Mohammadi, F. G., Amini, M. H.,

Taha, T., Rasheed, K., and Arabnia, H. R. (2021).

Drdrv3: Complete lesion detection in fundus images

using mask r-cnn, transfer learning, and lstm. arXiv

preprint arXiv:2108.08095.

Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2022).

Yolov7: Trainable bag-of-freebies sets new state-of-

the-art for real-time object detectors. arXiv preprint

arXiv:2207.02696.

Wolf, K. (2021). Kid and dog crossing the street. [Online;

accessed September, 2022].

Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y.,

and Girshick, R. (2019). Detectron2.

https://github.com/facebookresearch/detectron2.

Xie, S., Girshick, R., Doll

ar, P., Tu, Z., and He, K. (2017).

Aggregated residual transformations for deep neural

networks. In 2017 IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), pages 5987–

5995.

Zhang, Q., Chang, X., and Bian, S. B. (2020). Vehicle-

damage-detection segmentation algorithm based on

improved mask rcnn. IEEE Access, 8:6997–7004.

Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H.,

Xiong, H., and He, Q. (2020). A comprehensive sur-

vey on transfer learning. Proceedings of the IEEE,

109(1):43–76.

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

214