Table 3: Average Precision AP per class.
Category AP Category AP
power supply 80.18 working area 90.18
oscilloscope 90.12 welder base 88.82
welder station 89.87 socket 90.27
electric screwdriver 81.45 left red button 100
screwdriver 58.73 left green button 100
pliers 79.18 right red button 81.82
welder probe tip 50.63 right green button 90.91
oscilloscope probe tip 51.72 power supply cables 41.34
low voltage board 88.53 ground clip 44.84
high voltage board 61.44 battery charger connector 15.91
register 71.07 panel A 89.77
electric screwdriver battery 51.72
6 CONCLUSION
We presented HERO, a Conversational Intelligent As-
sistant able to support workers using the natural lan-
guage and observing the surrounding environment to
avoid natural language ambiguity. Experiments high-
light the good performance on the considered indus-
trial laboratory used to evaluate HERO’s ability of
intent-entity prediction considering both text and vi-
sual inputs. Future works will consider the integration
of wearable devices such as Microsoft Hololens2 and
a speech-to-text module to leave the human hands-
free during their work.
ACKNOWLEDGEMENTS
This research is supported by Next Vision
3
s.r.l. and
by the project MEGABIT - PIAno di inCEntivi per la
RIcerca di Ateneo 2020/2022 (PIACERI) – linea di
intervento 2, DMI - University of Catania.
REFERENCES
Amazon (2014). Amazon’s alexa. https://developer.
amazon.com/alexa.
Amazon (2019). Amazon’s show and tell. https://www.
amazon.com/b?ie=UTF8&node=21213731011.
Apple (2012). Apple’s siri. https://www.apple.com/siri/.
Bochkovskiy, A., Wang, C., and Liao, H. M. (2020).
Yolov4: Optimal speed and accuracy of object detec-
tion. CoRR, abs/2004.10934.
Bocklisch, T., Faulkner, J., Pawlowski, N., and Nichol, A.
(2017). Rasa: Open source language understanding
and dialogue management. CoRR, abs/1712.05181.
Bohus, D. and Rudnicky, A. I. (2009). The ravenclaw di-
alog management framework: Architecture and sys-
tems. Comput. Speech Lang., 23.
3
Next Vision: https://www.nextvisionlab.it/
Bordes, A., Boureau, Y.-L., and Weston, J. (2017). Learning
end-to-end goal-oriented dialog. In 5th International
Conference on Learning Representations (ICLR).
Brick, E., Alonso, V., O’Brien, C., Tong, S., Tavernier, E.,
Parekh, A., Addlesee, J.-A., and Lemon, O. (2021).
Am i allergic to this? assisting sight impaired people
in the kitchen. ICMI ’21: Proceedings of the 2021
International Conference on Multimodal Interaction.
Bunk, T., Varshneya, D., Vlasov, V., and Nichol, A. (2020).
DIET: lightweight language understanding for dia-
logue systems. CoRR, abs/2004.09936.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and
Lin, C.-J. (2008). Liblinear: A library for large linear
classification. J. Mach. Learn. Res., 9.
Fedorenko, D. G., Smetanin, N., and Rodichev, A. (2017).
Avoiding echo-responses in a retrieval-based conver-
sation system. ArXiv, abs/1712.05626.
Gao, J., Galley, M., and Li, L. (2018). Neural approaches
to conversational AI. In Proceedings of the 56th An-
nual Meeting of the Association for Computational
Linguistics: Tutorial Abstracts.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE
international conference on computer vision.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).
Rich feature hierarchies for accurate object detec-
tion and semantic segmentation. In Proceedings of
the IEEE conference on computer vision and pattern
recognition.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Conference on
Computer Vision and Pattern Recognition.
Joachims, T. (1998). Text categorization with support vec-
tor machines: learning with many relevant features. In
N
´
edellec, C. and Rouveirol, C., editors, Proceedings
of ECML-98, 10th European Conference on Machine
Learning, number 1398.
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.
(2017). Bag of tricks for efficient text classification.
In EACL.
Kennington, C. R., Kousidis, S., and Schlangen, D. (2014).
Inprotks: A toolkit for incremental situated process-
ing. In SIGDIAL Conference.
Leonardi, R., Ragusa, F., Furnari, A., and Farinella, G. M.
(2021). Egocentric human-object interaction detection
exploiting synthetic data. In International Conference
on Image Analysis and Processing.
Lison, P. and Kennington, C. (2016). OpenDial: A toolkit
for developing spoken dialogue systems with prob-
abilistic rules. In Proceedings of ACL-2016 System
Demonstrations.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,
Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot
multibox detector. In Computer Vision – ECCV 2016.
Microsoft (2014). Microsoft’s cortana.
https://support.microsoft.com/en-us/topic/what-
is-cortana-953e648d-5668-e017-1341-7f26f7d0f825.
Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. (2013).
Efficient estimation of word representations in vector
space.
SIGMAP 2022 - 19th International Conference on Signal Processing and Multimedia Applications
92