inside cars. For this purpose, a dataset was created
with images of the interior of 78 cars (total of 1861
images), under different perspectives. Labeling of
good, damage, stain, and dirt classes was performed
manually at pixel-level, for all of the dataset images.
Once the dataset was created, there was a need
to train and evaluate two segmentation methods,
DeepLabV3+ and U-Net, using our dataset.
Initially, the two networks were trained under two
primary approaches, one with Full image as input, and
the other with Tiled patches of an image. Results (Ta-
ble 4) showed that DeepLabV3+ achieves higher ac-
curacy when using Full image, EV1, in contrast U-
Net performed better with Tiles, EV4. From these
two methodologies, an ablation and hyperparameter
study was carried out (Tables 5 and 7) for each one,
to achieve the best possible accuracy. Results showed
(Table 8) that U-Net achieved highest accuracy in
EV4.4, with Tiled input at 512x512x3, depth 3 and
batch 8, reaching 28.37%, 71.50%, 19.24%, 18.43%,
and 4.15% for mean, good, damage, stain and dirt ac-
curacy, respectively. Moreover, DeepLabV3+ outper-
formed U-Net considerably in EV1.4 (Table 6), with
Full image input at 512x512x3 and batch 8, reaching
67.60%, 77.17%, 59.60%, 66.81%, and 68.82% for
mean, good, damage, stain and dirt accuracy, respec-
tively. Regarding DeepLabV3+, after a brief com-
parison between options of input resolutions, it was
concluded that the approach that obtained the best
metrics was the 512x512x3 instead of 1024x1024x3,
is influenced by the loss of class pixel information
when reducing resolution, thus helping in training
convergence (Figure 9). Although the class estima-
tion is generally good, sometimes there is a swapped
between the damaged classes (Figure 8) when the
distinction among them is not so apparent, in real-
ity, even the visual distinction for humans is difficult
since the appearance of some classes can be very sim-
ilar depending on the type of fabric.
In the case of U-Net, despite the ablation study
and the different training being also carried out, it was
found that in this type of approach and study, this net-
work presents a much lower accuracy in relation to
DeepLabV3+.
6 CONCLUSIONS AND FUTURE
WORK
In this paper, we have shown how to repurpose two
deep learning segmentation methods, for the task of
estimating in-vehicle defects. The objective of such
study is to investigate and monitor the integrity of the
interior of the car in terms of Damage, Stain, and Dirt
that may appear with the use of the car interior space
by passengers. This paper presents the creation of an
in-car dataset, Mola-VI, with images of the interior of
cars.
For this purpose, DeepLabV3+ and U-Net were
trained. The DeepLabV3+ method showed the best
results, with 67.60% mean accuracy, being presented
as a good solution for future implementations in in-
vehicle defect detection. U-Net showed to be more
difficult to develop for this use-case, showing mean
accuracy values around 28%, in all evaluation scenar-
ios.
For future work, we intend to expand the in-car
dataset, trying to add more samples and more diver-
sity at the level of cars and classes found in this con-
text. In addition, we also intend to try other networks
and methods for evaluating this issue in our in-car
dataset.
ACKNOWLEDGEMENTS
This work is supported by: European Structural
and Investment Funds in the FEDER component,
through the Operational Competitiveness and Interna-
tionalization Programme (COMPETE 2020) [Project
nº 039334; Funding Reference: POCI-01-0247-
FEDER-039334].
REFERENCES
Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020).
YOLOv4: Optimal Speed and Accuracy of Object De-
tection.
Borges, J., Oliveira, B., Torres, H., Rodrigues, N., Queir
´
os,
S., Shiller, M., Coelho, V., Pallauf, J., Brito, J. H.,
Mendes, J., and Fonseca, J. C. (2020). Automated
generation of synthetic in-car dataset for human body
pose detection. In VISIGRAPP 2020 - Proceedings of
the 15th International Joint Conference on Computer
Vision, Imaging and Computer Graphics Theory and
Applications, volume 5, pages 550–557. SciTePress.
Borges, J., Queir
´
os, S., Oliveira, B., Torres, H., Rodrigues,
N., Coelho, V., Pallauf, J., Brito, J. H. H., Mendes, J.,
and Fonseca, J. C. (2021). A system for the generation
of in-car human body pose datasets. Machine Vision
and Applications, 32(1):1–15.
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and
Adam, H. (2018). Encoder-decoder with atrous sep-
arable convolution for semantic image segmentation.
In The European Conference on Computer Vision
(ECCV).
Furtado, H., Gonc¸alves, M., Fernandes, N., Paulo, ., and
Gonc¸alves, S. (2001). Estudo dos Postos de Trabalho
de Inspecc¸
˜
ao de Defeitos da Industria T
ˆ
extil. Techni-
cal report.
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
678