
body gestures and speech. Doctoral Consortium of
ACII, Lisbon.
Ding, W., Xu, M., Huang, D., Lin, W., Dong, M., Yu, X.,
and Li, H. (2016). Audio and face video emotion
recognition in the wild using deep neural networks
and small datasets. In Proceedings of the 18th ACM
International Conference on Multimodal Interaction,
pages 506–513.
Franco, A., Magnani, A., and Maio, D. (2020). A multi-
modal approach for human activity recognition based
on skeleton and rgb data. Pattern Recognition Letters,
131:293–299.
Kessous, L., Castellano, G., and Caridakis, G. (2010). Mul-
timodal emotion recognition in speech-based interac-
tion using facial expression, body gesture and acous-
tic analysis. Journal on Multimodal User Interfaces,
3(1):33–48.
Li, H., Shrestha, A., Fioranelli, F., Le Kernec, J., Heidari,
H., Pepa, M., Cippitelli, E., Gambi, E., and Spinsante,
S. (2017). Multisensor data fusion for human activ-
ities classification and fall detection. In 2017 IEEE
SENSORS, pages 1–3. IEEE.
Lin, Y.-S., Gau, S. S.-F., and Lee, C.-C. (2020). A mul-
timodal interlocutor-modulated attentional blstm for
classifying autism subgroups during clinical inter-
views. IEEE Journal of Selected Topics in Signal Pro-
cessing, 14(2):299–311.
Masmoudi, M., Jarraya, S. K., and Hammami, M. (2019).
Meltdowncrisis: Dataset of autistic children during
meltdown crisis. In 15th International Conference on
Signal-Image Technology and Internet-Based Systems
(SITIS), pages 239–246. IEEE.
Masurelle, A., Essid, S., and Richard, G. (2013). Multi-
modal classification of dance movements using body
joint trajectories and step sounds. In 2013 14th inter-
national workshop on image analysis for multimedia
interactive services (WIAMIS), pages 1–4. IEEE.
Metri, P., Ghorpade, J., and Butalia, A. (2011). Facial emo-
tion recognition using context based multimodal ap-
proach.
Ouyang, X., Kawaai, S., Goh, E. G. H., Shen, S., Ding,
W., Ming, H., and Huang, D.-Y. (2017). Audio-visual
emotion recognition using deep transfer learning and
multiple temporal models. In Proceedings of the 19th
ACM International Conference on Multimodal Inter-
action, pages 577–582.
Pandeya, Y. R. and Lee, J. (2021). Deep learning-based late
fusion of multimodal information for emotion classi-
fication of music video. Multimedia Tools and Appli-
cations, 80(2):2887–2905.
Pimpalkar, A., Nagalkar, C., Waghmare, S., and Ingole, K.
(2014). Thin slices of expressive behavior as predic-
tors of interpersonal consequences: A meta-analysis.
International Journal of Computing and Technology
(IJCAT), 1(2).
Psaltis, A., Kaza, K., Stefanidis, K., Thermos, S., Apos-
tolakis, K. C., Dimitropoulos, K., and Daras, P.
(2019). Multimodal affective state recognition in se-
rious games applications. In 2019 IEEE International
Conference on Imaging Systems and Techniques (IST),
pages 435–439. IEEE.
Radoi, A., Birhala, A., Ristea, N.-C., and Dutu, L.-C.
(2021). An end-to-end emotion recognition frame-
work based on temporal aggregation of multimodal
information. IEEE Access, 9:135559–135570.
Tian, J., Cheng, W., Sun, Y., Li, G., Jiang, D., Jiang,
G., Tao, B., Zhao, H., and Chen, D. (2020). Ges-
ture recognition based on multilevel multimodal fea-
ture fusion. Journal of Intelligent & Fuzzy Systems,
38(3):2539–2550.
Yu, J., Gao, H., Yang, W., Jiang, Y., Chin, W., Kubota, N.,
and Ju, Z. (2020). A discriminative deep model with
feature fusion and temporal attention for human action
recognition. IEEE Access, 8:43243–43255.
Zhang, S., Zhang, S., Huang, T., and Gao, W. (2016). Mul-
timodal deep convolutional neural network for audio-
visual emotion recognition. In Proceedings of the
2016 ACM on International Conference on Multime-
dia Retrieval, pages 281–284.
Zhu, H., Wang, Z., Shi, Y., Hua, Y., Xu, G., and Deng,
L. (2020). Multimodal fusion method based on self-
attention mechanism. Wireless Communications and
Mobile Computing, 2020.
ICSOFT 2024 - 19th International Conference on Software Technologies
350