Table 6: Comparison of the mAP scores of different neural
networks in object detection.
Method mAP (IoU ≥ 0.5)
SSD 300 Liu et al. (2015) 0.412
Faster R-CNN Ren et al. (2015) 0.427
Inception-v3 with CAM 0.285 (fire: 0.23, smoke: 0.42)
is still competitive. The fact, that the proposed net-
work is trained without using any bounding box anno-
tations, underlines the good performance even more.
The situation description is based on the accuracy
of both detection output from the classification and
localization. A good object detection leads to a better
situation description as can be seen in Figure 11 and
Figure 12, otherwise the situation description would
not be correct by giving the wrong spatial expression
(see Figure 13). Additionally, it is not possible for the
model to differentiate, whether the detected objects is
positioned in the foreground or background, leading
to wrong spatial relationship description. This can
be seen in the description in Figure 13b: the smoke
should be “BEHIND the car” and not “TOP left of
the car”.
In addition, it is not possible for the model to de-
termine the depth information of the objects, because
it can not differentiate the foreground from the back-
ground in normal planar images. This problem can
be seen in the description of the Figure 13b. In or-
der to obtain the depth information of the objects, the
drones can be equipped with additional sensors like
time-of-flight cameras or LiDAR.
Ultimately, the developed model is meant for de-
tecting situations in still images and, thus, not very
suitable for real-time video detection due to the com-
plexity and the large Inception-v3 network. A possi-
ble solution to optimize the speed and reduce the com-
plexity is using network pruning (Molchanov et al.,
2016) or light-weighted network such as SSD, YOLO
for detection and MobileNet vor classification. Typ-
ically, a little accuracy is then sacrificed for faster
speed.
7 CONCLUSION AND FUTURE
WORKS
The results show, that the proposed model is able to
recognize, locate and describe fire and smoke with
CAM in the modified Inception-v3 network. Com-
pared to other related works, only a small image
dataset is required for this model, because the net-
work is retrained only for the last layer. For clas-
sification task, the retrained Inception-v3 is able to
achieve similar results compared to fully trained net-
works. The high classification performance allows the
model to generate CAM precisely and enabled more
than acceptable results in localization task, even out-
performs the Faster R-CNN for smoke objects. The
performance of the situation description is highly af-
fected by the accuracy of the object classification and
localization parts, because they are involved in pro-
viding necessary information for situation analyzing.
Furthermore, a thorough search of the relevant litera-
ture yielded almost no papers, which are able to de-
tect fire and smoke by providing small image-level
dataset and describe the underlying dangerous situ-
ation in the same time. At this point, the proposed
model provides a valuable contribution for solving
those different tasks. For future works, the model per-
formance can be improved by a larger image training
dataset containing different object classes (e.g. ma-
terial spills, hazard symbols etc.), sizes, light condi-
tions and view angles. To apply the model on UAV
for real-time situation detection tasks, the Inception-
v3 needs to be pruned or exchanged with smaller net-
works like MobileNets. At last, further research can
be conducted on the explanatory power of the situ-
ation description by evaluating the sentences against
human annotated descriptions.
REFERENCES
Ahsan, U., Sun, C., Hays, J., and Essa, I. A. (2017). Com-
plex event recognition from images with few training
examples. CoRR, abs/1701.04769.
Arriaga, O., Pl
¨
oger, P., and Valdenegro-Toro, M. (2017).
Image captioning and classification of dangerous situ-
ations. CoRR, abs/1711.02578.
Chino, D. Y. T., Avalhais, L. P. S., Jr., J. F. R., and Traina, A.
J. M. (2015). Bowfire: Detection of fire in still images
by integrating pixel color and texture analysis. CoRR,
abs/1506.03495.
Frizzi, S., Kaabi, R., Bouchouicha, M., Ginoux, J.-M.,
Moreau, E., and Fnaiech, F. (2016). Convolutional
neural network for video fire and smoke detection. In
Industrial Electronics Society, IECON 2016-42nd An-
nual Conference of the IEEE, pages 877–882. IEEE.
Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S.,
Dally, W. J., and Keutzer, K. (2016). Squeezenet:
Alexnet-level accuracy with 50x fewer parameters and
<1mb model size. CoRR, abs/1602.07360.
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin,
I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M.,
Duerig, T., and Ferrari, V. (2018). The Open Im-
ages Dataset v4: Unified image classification, object
detection, and visual relationship detection at scale.
arXiv:1811.00982.
Lin, T. T. (2015). LabelImg - git code.
https://github.com/tzutalin/labelImg.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S. E.,
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
56