
5 CONCLUSIONS
This paper investigates object detection and instance
segmentation for real-time traffic surveillance appli-
cations. To this end, we have adopted existing in-
stance segmentation models by training them on a
proprietary dataset. Since instance-segmentation an-
notations are not available for this dataset, two novel
methods are proposed for generating these annota-
tions in a semi-automated procedure. The first pro-
cedure utilizes existing pre-trained models, while the
second procedure employs box-supervised models
that are first finetuned on the proprietary dataset.
The YOLACT-YOLOv7 model is evaluated as
optimal for traffic surveillance applications because
of its high performance and low latency. Frac-
tion training experiments on the COCO dataset show
that 90% of the instance segmentation performance
can be achieved when only 70% of the dataset con-
tains instance segmentation annotations. Besides
this, the YOLACT-YOLOv7 detection and segmen-
tation performance significantly increases when it
is trained on the proprietary dataset containing au-
tomatically generated instance segmentations. The
instance-segmentation performance is highest when
YOLACT-YOLOv7 is trained on the segmentation
dataset that is generated by the Segment Anything
Model (87.6% mAP). Finetuning of a box-supervised
model to generate the instance segmentation ground-
truth for the proprietary dataset does not result in a
higher performance (85.5% mAP for BoxLevelSet).
Visual inspection of the results show that future re-
search should focus on improving instance segmen-
tation for partially occluded objects, for example by
improving the quality of the automatically generated
dataset even more.
Training YOLACT-YOLOv7 on a segmentation
dataset that is annotated semi-automatically forms an
attractive solution, since it requires low manual anno-
tation effort while the quality of the generated data is
suitable for training. The trained YOLACT-YOLOv7
model achieves high detection and instance segmen-
tation performance of 94.6% and 87.6% respectively,
while maintaining real-time inference speed.
REFERENCES
Bolya, D., Zhou, C., Xiao, F., and Lee, Y. J. (2019). Yolact:
Real-time instance segmentation. 2019 IEEE/CVF
ICCV, pages 9156–9165.
Chen, H., Sun, K., Tian, Z., Shen, C., Huang, Y., and Yan,
Y. (2020). Blendmask: Top-down meets bottom-up
for instance segmentation. In 2020 IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 8570–8578.
Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Gird-
har, R., and Schwing, A. G. (2021). Mask2former for
video instance segmentation. CoRR, abs/2112.10764.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Min-
derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and
Houlsby, N. (2020). An image is worth 16x16 words:
Transformers for image recognition at scale. CoRR,
abs/2010.11929.
Getreuer, P. (2012). Chan-vese segmentation. Image Pro-
cessing On Line, 2:214–224.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask r-cnn. In 2017 IEEE Int. Conf. on Comp. Vision
(ICCV), pages 2980–2988.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C.,
Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C.,
Lo, W.-Y., Doll
´
ar, P., and Girshick, R. (2023). Seg-
ment anything.
Kuhn, H. (2012). The hungarian method for the assignment
problem. Naval Research Logistic Quarterly, 2.
Li, W., Liu, W., Zhu, J., Cui, M., Hua, X.-S., and Zhang,
L. (2022a). Box-supervised instance segmentation
with level set evolution. In Avidan, S., Brostow, G.,
Ciss
´
e, M., Farinella, G. M., and Hassner, T., editors,
Computer Vision – ECCV 2022, pages 1–18, Cham.
Springer Nature Switzerland.
Li, W., Liu, W., Zhu, J., Cui, M., Yu, R., Hua, X., and
Zhang, L. (2022b). Box2mask: Box-supervised in-
stance segmentation via level set evolution. arXiv.
Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick,
R. B., Hays, J., Perona, P., Ramanan, D., Doll’a r, P.,
and Zitnick, C. L. (2014). Microsoft COCO: common
objects in context. CoRR, abs/1405.0312.
Munawar, M. R. and Hussain, M. Z. (2023). Train yolov7
segmentation on custom data.
Sharma, R., Saqib, M., Lin, C. T., and Blumenstein, M.
(2022). A survey on object instance segmentation. SN
Computer Science, 3(6):499.
Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2023).
Yolov7: Trainable bag-of-freebies sets new state-of-
the-art for real-time object detectors. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 7464–7475.
Wang, X., Kong, T., Shen, C., Jiang, Y., and Li, L. (2020a).
SOLO: Segmenting objects by locations. In Proc. Eur.
Conf. Comp. Vision (ECCV).
Wang, X., Zhang, R., Kong, T., Li, L., and Shen, C.
(2020b). Solov2: Dynamic and fast instance segmen-
tation. Proc. Advances in Neural Information Process-
ing Systems (NeurIPS).
Zwemer., M. H., Scholte., D., Wijnhoven., R. G. J., and
de With., P. H. N. (2022). 3d detection of vehicles
from 2d images in traffic surveillance. In Proceed-
ings of the 17th International Joint Conf. on Computer
Vision, Imaging and Computer Graphics Theory and
App. - Volume 5: VISAPP,, pages 97–106. INSTICC,
SciTePress.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
358