Vision-based Detection of Humans on the Ground from Actual Aerial

Images by Informed Filters using Only Color Features

Takuro Oki

, Risako Aoki

, Shingo Kobayashi

, Ryusuke Miyamoto

Hiroyuki Yomo

and Shinsuke Hara

Department of Computer Science, Graduate School of Science and Technology, Japan

Department of Computer Science, School of Science and Technology,

Meiji University, 1-1-1 Higashimita, Tama-ku, Kawasaki-shi, Japan

Department of Electrical and Electronic Engineering, Faculty of Engineering Science,

Kansai University, 3-3-35 Yamate-cho, Suita-shi, Japan

Graduate School of Engineering, Osaka City University,

3-3-138 Sugimoto Sumiyoshi-ku, Osaka-shi, Japan

Keywords:

Player Detection, Actual Aerial Images, Informed-ﬁlters.

Abstract:

In this paper, we construct a novel sensor network system that can measure real-time vital signs of the human

body during exercise even at high speeds and in crowded regions. The sensor network estimates locations of

sensor nodes using image processing to extract locations of humans wearing sensors. This paper evaluates the

accuracy of human detection by informed-ﬁlters using only color features for actual aerial images. To carry

out the evaluation, a novel dataset composed of actual images captured using a camera mounted on a drone

was created. Experimental results show that signiﬁcant accuracy can be achieved by the detector. In addition,

the number of weak classiﬁers in a strong classiﬁer can be reduced to 125 without signiﬁcant degradation of

the detection accuracy.

1 INTRODUCTION

Real-time sensing of the vital signs of the human body

during exercise can improve the effectiveness of ex-

ercise and prevention of sudden illness. Some re-

searchers have attempted to construct a novel sensor

network system (Hara et al., 2017). To construct such

a sensor network. the most important challenge is the

location estimation of sensor nodes. In this applica-

tion, widely used routing schemes based on Received

signal-strength indicator (RSSI) or global positioning

system (GPS) do not work well because sensor nodes

attached to exercisers moves quickly and their density

often becomes prohibitively high.

To solve this problem, image-assisted routing that

estimates locations of sensor nodes using image pro-

cessing was proposed as a solution (Miyamoto and

Oki, 2016). Image-assisted routing uses images cap-

tured from a few cameras mounted on drones and

ﬁxed tripods as input. Then, vision-based human de-

tection and tracking processes are applied to obtain

the location of humans wearing sensor nodes. In this

process, human detection using aerial images cap-

tured using drones is a challenging problem because

of the required high accuracy of operation on embed-

ded systems.

Previous research in (Oki et al., 2019) evaluated

the accuracy of player detection from aerial images

using a computer graphics (CG) dataset that was cre-

ated from actual locations of players manually anno-

tated using a video sequence of a soccer game held

in an indoor soccer ﬁeld (Miyamoto et al., 2017).

In (Oki et al., 2019), several drone locations were

evaluated in terms of their detection accuracy to ﬁnd

good positions of a drone to record sports scenes. Re-

sults showed that for best player detection accuracy,

the drone should be located where the angle between

the view direction and the ground only approaches 90

degrees. The detector used in the evaluation was a

classiﬁer constructed with informed-ﬁlters using only

color features showing better accuracy than state-of-

the-art schemes based on deep neural networks (Red-

mon and Farhadi, 2017; Girshick et al., 2014; Zoph

et al., 2018) in the CG dataset (Miyamoto et al.,

2019). Considering the detection accuracy and pro-

cessing speed on GPUs (Oki and Miyamoto, 2017),

Oki, T., Aoki, R., Kobayashi, S., Miyamoto, R., Yomo, H. and Hara, S.

Vision-based Detection of Humans on the Ground from Actual Aerial Images by Informed Filters using Only Color Features.

DOI: 10.5220/0008166000840089

In Proceedings of the 7th International Conference on Sport Sciences Research and Technology Support (icSPORTS 2019), pages 84-89

ISBN: 978-989-758-383-4

this scheme is practical for this application.

Previous research has shown that a classiﬁer based

on informed-ﬁlters using only color features works

well in aerial images, but detection accuracy using

actual images recorded by a camera mounted on a

drone has not been evaluated. To verify the detec-

tion performance of the informed-ﬁlters using actual

aerial images, this paper constructs an actual dataset

using aerial images and evaluates the detection accu-

racy of the informed-ﬁlters. In the evaluation, the

relationship between the detection accuracy and the

number of weak classiﬁers in a strong classiﬁer is in-

vestigated in order to determine the possibility of re-

ducing the computational cost while maintaining the

detection accuracy.

The rest of this paper is organized as follows.

Section II summarizes the evaluation of detection ac-

curacy at several view points using the CG dataset

in (Oki et al., 2019) and Section III discusses how to

construct a novel dataset using actual aerial images of

humans during exercise. Section IV discusses how to

train a detector and reduce the number of weak clas-

siﬁers. Section V evaluates the detection accuracy us-

ing the new dataset and the paper is concluded in sec-

tion VI.

2 ACCURACY OF PLAYER

DETECTION IN AERIAL

IMAGES USING THE CG

DATASET

This section summarizes the previousresearch in (Oki

et al., 2019) that evaluates the detection accuracy

from aerial images using the CG dataset.

2.1 The CG Dataset

The CG dataset was created to evaluate the detection

accuracy of humans during exercise at several view-

points in aerial images (Miyamoto et al., 2019). In

the dataset, UnityChan, which is a freely usable 3D

model of a character, is used to represent humans

on the soccer ﬁeld. The locations of the 3D charac-

ters correspond to the locations of humans determined

manually from an actual image sequence during exer-

cise. The dataset is constructed as a virtual 3D space

using the Unity 3D engine so that we can easily gen-

erate two-dimensional images from arbitrary view-

points. Fig. 1 shows examples of the CG dataset cor-

responding to several viewpoints. Fig. 2 and 3 show

examples of positive and negative samples of the CG

dataset.

2.2 Detection Accuracy for Several

Viewpoints

Fig. 4 shows the heat map representing accuracy of

detection at many viewpoints. The deep red color re-

gions correspond to a lower miss rate but the miss rate

is greater where the red color is lighter. When the an-

gle between the view direction and the soccer ﬁeld

(or ground) approaches 90 degrees, the detection ac-

curacy is maximized. However, when the drone is

located right above the center of the soccer ﬁeld, the

detection accuracy is poor. Based on these simulation

results, the drone is located at moderate positions as

described in the following section.

3 A DATASET COMPOSED OF

ACTUAL IMAGES CAPTURED

FROM A CAMERA MOUNTED

ON A DRONE

This section details a novel dataset created using ac-

tual images recorded from a camera mounted on a

drone. The dataset is used to evaluate the accuracy

of human detection during exercise.

3.1 An Experimental Setup for

Capturing Aerial Images using a

Drone

Fig. 5 shows the experimental setup used to capture

aerial images using a drone. Here, a mini game of

soccer was performed by eight humans in the soccer

ﬁeld. The ﬁled is marked by white lines on the ground

and it has a dirt surface. DJI Phantom 4, shown in

Fig. 6, was used to capture the aerial images. It can

record 3840 × 2160 video sequences.

3.2 How to Create Ground Truth

To create ground truth for the recorded image se-

quences, we developed a novel software named

QtKukeiKakuKun using python and Qt. Fig. 7 shows

a screenshot of QtKukeiKakuKun. The software can

initialize the ground truth of the current working

frame by the ground truth of the previous frame. This

function can drastically reduce the human load when

ground truth is created from images corresponding to

successive frames of a video sequence. In addition,

the software was designed to enable easy cooperation

because annotations about ground truth were usually

created by many people. Using QtKukeiKakuKun,

Vision-based Detection of Humans on the Ground from Actual Aerial Images by Informed Filters using Only Color Features

Figure 1: Examples of the CG dataset created in previous research(Oki et al., 2019).

Figure 2: Examples of positive samples in the CG dataset.

Figure 3: Examples of negative samples in the CG dataset.

we created a novel dataset composed of 5000 images.

Fig. 8 and 9 show the positive and negative samples

included in the novel dataset.

4 CONSTRUCTION OF A

CLASSIFIER FOR PLAYER

DETECTION IN ACTUAL

IMAGES

This section describes template design for informed-

ﬁlters that is necessary for this scheme and how to

construct a strong classiﬁer using the designed tem-

plates.

4.1 Template Design for

Informed-ﬁlters

In this paper, we applied previously designed tem-

plates to create a feature pool. We used the template

designed for the VSPETS dataset (VS-PETS, 2001).

Fig. 10 shows an edge map computed using positive

samples of the VSPETS dataset and 11 shows cell di-

vision and labels assigned using the edge map. Ac-

cording to the assigned labels, feature templates are

shown in Fig. 12.

Figure 4: Heatmap.

Figure 5: Experimental setup for capturing aerial images

using a drone.

4.2 How to Construct a Strong

Classiﬁer using the Designed

Templates

After the design of feature templates, a strong clas-

siﬁer is trained by the AdaBoost algorithm (Freund

and Schapire, 1997). In the boosting process, effec-

tive features are selected from a feature pool where

icSPORTS 2019 - 7th International Conference on Sport Sciences Research and Technology Support

Figure 6: The drone used in the experiment.

Figure 7: Screenshot of QtKukeiKakuKun developed for

this experiment.

Figure 8: Examples of positive samples in the new dataset.

Figure 9: Examples of negative samples in the new dataset.

many features are generated randomly using the de-

signed templates. After the selection of effective fea-

tures as weak classiﬁers, a strong classiﬁer composed

of cascaded weak classiﬁers is obtained as a soft cas-

cade classiﬁer constructed by Multiple-Instance Prun-

ing (MIP) (Zhang and Viola, 2007).

In our experiment, 3000 features were generated

randomly using the designed templates. To determine

the detection accuracy for reduced weak classiﬁers,

the number of weak classiﬁers in a strong classiﬁer

was varied from 5 to 200.

5 EVALUATION

This section discusses the results of player detection

in aerial images captured using a camera mounted on

a drone.

Figure 10: Edge map for

PETS samples.

Figure 11: Labeling for

PETS samples.

Figure 12: Examples of designed templates.

5.1 Detection Error Tradeoff Curves

Fig. 13 shows the detection error tradeoffcurves when

the number of weak classiﬁers used in a strong clas-

siﬁer is varied from 5 to 200. The detection perfor-

mance improves as the number of weak classiﬁers in-

creases. The best detection performance is achieved

when the number of weak classiﬁers is 175, though

200 weak classiﬁers should be most powerful to rep-

resent target features. This is because randomness

Figure 13: Detection error trade off curves for several num-

bers of weak classiﬁers.

Vision-based Detection of Humans on the Ground from Actual Aerial Images by Informed Filters using Only Color Features

Figure 14: Detection examples for several numbers of weak classiﬁers.

was included in the training process to construct a

strong classiﬁer.

At the 10

−1

false positive per image (FPPI) that

is often used to compare the detection performance,

125, 150, 175, and 200 weak classiﬁers showed simi-

lar performance. This result means that the number of

weak classiﬁers can be reduced by up to 37.5% with-

out any degradation in detection performance.

5.2 Detection Examples

Fig. 14 shows the detection examples when the

number of classiﬁers used in a strong classiﬁer are

changed. The number of weak classiﬁers were 10,

30, 50, 100, 150, and 200 for the top left, top center,

top right, bottom left, bottom center, and bottom right

images, respectively.

Surprisingly, in the top left regions of the image,

false positives rarely occurred even though only color

features were adopted to construct a strong classiﬁer

for detection. In these examples, a false positive oc-

curred only in the case of 10 weak classiﬁers.

6 CONCLUSION

This paper presented a novel dataset composed of ac-

tual images captured by a camera mounted on a drone.

The dataset includes 5000 images of a mini game of

soccer involving eight people. Locations of humans

in these images were manually marked using a novel

software tool named QtKukeiKakuKun that was de-

veloped using python and Qt for this research.

Experimental results using the actual dataset

showed that excellent detection accuracy could be

achieved by a detector composed of informed-ﬁlters

using only color features. The detection accuracy

is maintained even when the number of weak clas-

siﬁers are reduced from 200 to 125. This indicates

that the processing speed of detection by a classiﬁer

composed of informed-ﬁlters can be improved with-

out any degradation of detection accuracy.

icSPORTS 2019 - 7th International Conference on Sport Sciences Research and Technology Support

ACKNOWLEDGMENT

The research results have been partly achieved by

“Research and development of Innovative Network

Technologies to Create the Future”, the Commis-

sioned Research of National Institute of Information

and Communications Technology (NICT), JAPAN.

REFERENCES

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic

generalization of on-line learning and an application

to boosting. Journal of Computer and System Sci-

ences, 55(1):119–139.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. Proc. IEEE Conf. Com-

put. Vis. Pattern Recognit., pages 580–587.

Hara, S., Yomo, H., Miyamoto, R., Kawamoto, Y., Okuhata,

H., Kawabata, T., and Nakamura, H. (2017). Chal-

lenges in Real-Time Vital Signs Monitoring for Per-

sons during Exercises. International Journal of Wire-

less Information Networks, 24:91–108.

Miyamoto, R., Nakamura, Y., Ishida, H., Nakamura, T.,

and Oki, T. (2019). Comparison of object detection

schemes using datasets of sports scenes.

Miyamoto, R. and Oki, T. (2016). Soccer Player Detection

with Only Color Features Selected Using Informed

Haar-like Features. In Advanced Concepts for Intel-

ligent Vision Systems, volume 10016 of Lecture Notes

in Computer Science, pages 238–249.

Miyamoto, R., Yokokawa, H., Oki, T., Yomo, H., and Hara,

S. (2017). Human detection in top-view images using

only color features. The Journal of the Institute of

Image Electronics Engineers of Japan(in Japanese),

46(4):559–567.

Oki, T. and Miyamoto, R. (2017). Efﬁcient GPU imple-

mentation of informed-ﬁlters for fast computation. In

Image and Video Technology, pages 302–313.

Oki, T., Miyamoto, R., Yomo, H., and Hara, S. (2019).

Detection accuracy of soccer players in aerial images

captured from several viewpoints. MDPI J. Funct.

Morphol. Kinesiol., 4(1).

Redmon, J. and Farhadi, A. (2017). YOLO9000: Better,

Faster, Stronger. pages 6517–6525.

VS-PETS (2001). Proc. IEEE international workshop on vi-

sual surveillance and performance evaluation of track-

ing and surveillance. http://ftp.pets.rdg.ac.uk/.

Zhang, C. and Viola, P. (2007). Multiple-instance pruning

for learning efﬁcient cascade detectors. In Proc. Ad-

vances in Neural Information Processing Systems.

Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. (2018).

Learning transferable architectures for scalable image

recognition. In Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., pages 8697–8710.

Vision-based Detection of Humans on the Ground from Actual Aerial Images by Informed Filters using Only Color Features