
vento 3 “Starting Grant” - University of Catania, and
by the project Future Artificial Intelligence Research
(FAIR) – PNRR MUR Cod. PE0000013 - CUP:
E63C22001940006.
REFERENCES
Bonanno, C., Ragusa, F., Furnari, A., and Farinella, G. M.
(2023). Hero: A multi-modal approach on mo-
bile devices for visual-aware conversational assistance
in industrial domains. In International Conference
on Image Analysis and Processing, pages 424–436.
Springer.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. (2020). Language models are few-
shot learners. Advances in neural information pro-
cessing systems, 33:1877–1901.
Chen, Q., Zhuo, Z., and Wang, W. (2019). Bert for joint
intent classification and slot filling. arXiv preprint
arXiv:1902.10909.
Cui, C., Wang, W., Song, X., Huang, M., Xu, X.-S., and
Nie, L. (2019). User attention-guided multimodal dia-
log systems. In Proceedings of the 42nd international
ACM SIGIR conference on research and development
in information retrieval, pages 445–454.
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura,
J. M., Parikh, D., and Batra, D. (2017a). Visual Dia-
log. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR).
Das, A., Kottur, S., Moura, J. M., Lee, S., and Batra, D.
(2017b). Learning cooperative visual dialog agents
with deep reinforcement learning. In Proceedings of
the IEEE international conference on computer vi-
sion, pages 2951–2960.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805.
Hassan, A. and Mahmood, A. (2018). Convolutional recur-
rent deep learning model for sentence classification.
Ieee Access, 6:13949–13957.
Huang, T.-H., Chang, J. C., and Bigham, J. P. (2018).
Evorus: A crowd-powered conversational assistant
built to automate itself over time. In Proceedings of
the 2018 CHI conference on human factors in com-
puting systems, pages 1–13.
Johnson, J., Douze, M., and J
´
egou, H. (2019). Billion-scale
similarity search with gpus. IEEE Transactions on Big
Data, 7(3):535–547.
Kim, Y. (2014). Convolutional neural networks for sentence
classification. arXiv preprint arXiv:1408.5882.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,
V., Goyal, N., K
¨
uttler, H., Lewis, M., Yih, W.-t.,
Rockt
¨
aschel, T., et al. (2020). Retrieval-augmented
generation for knowledge-intensive nlp tasks. Ad-
vances in Neural Information Processing Systems,
33:9459–9474.
Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., and Wen,
J.-R. (2019). Recursive visual attention in visual di-
alog. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
6679–6688.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K.,
Ray, A., et al. (2022). Training language models to
follow instructions with human feedback. Advances
in Neural Information Processing Systems, 35:27730–
27744.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).
Exploring the limits of transfer learning with a uni-
fied text-to-text transformer. The Journal of Machine
Learning Research, 21(1):5485–5551.
Ragusa, F., Furnari, A., Lopes, A., Moltisanti, M., Ragusa,
E., Samarotto, M., Santo, L., Picone, N., Scarso, L.,
and Farinella, G. M. (2023). Enigma: Egocentric nav-
igator for industrial guidance, monitoring and antici-
pation. In VISIGRAPP (4: VISAPP), pages 695–702.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster
r-cnn: Towards real-time object detection with region
proposal networks. Advances in neural information
processing systems, 28.
Shih, K. J., Singh, S., and Hoiem, D. (2016). Where to look:
Focus regions for visual question answering. In Pro-
ceedings of the IEEE conference on computer vision
and pattern recognition, pages 4613–4621.
Sreeharsha, A., Kesapragada, S. M., and Chalamalasetty,
S. P. (2022). Building chatbot using amazon lex and
integrating with a chat application. Interantional Jour-
nal of Scientific Research in Engineering and Man-
agement, 6(04):1–6.
Tan, H. and Bansal, M. (2019). Lxmert: Learning cross-
modality encoder representations from transformers.
arXiv preprint arXiv:1908.07490.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava,
P., Bhosale, S., et al. (2023). Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in neural
information processing systems, 30.
IMPROVE 2024 - 4th International Conference on Image Processing and Vision Engineering
82