is the most probable cause but since the dataset is
large, in our opinion this is not the case and the in-
ference drawn holds.
5 CONCLUSIONS
In this work, we show that a model for personal-
ity recognition will benefit from more modalities and
data as input. We propose a new handcrafted be-
haviour encoding where each element is the proba-
bility of a low level action relevant to the task. We
show the effectiveness of all the inputs in the data
through ablation studies. We also give our opinion
on the trends shown in the ablation studies. Owing
to the interdisciplinary nature of the project, there are
numerous additions that will further improve perfor-
mance. From intuition, there are some which might
improve performance by a higher margin than others.
Using better backbones for feature extraction would
be interesting. We use the same ones as in the base-
line we choose but there are existing models with bet-
ter performance for similar tasks that can be utilised.
Transformers have been shown to perform better than
LSTMs. In the future, we will try to increase temporal
scale of attention in the transformer rather than using
a separate module for combining information across
chunks. This might tackle the problem that is seen
with neuroticism as discussed in section 4.3. One of
the major drawbacks of multimodal data is that pre-
processing takes a lot of time. Thus, it will be in-
teresting to explore Knowledge Distillation to allow
the model to utilise one or a subset of modalities and
give a similar performance but with lesser inputs. We
would also like to test our approach on other big scale
multimodal datasets, when they are available in the
future. This area of work has a lot of applications in
healthcare which we are exploring and hope that this
work leads to advancement in the area. We also hope
that it motivates other people to work on this interest-
ing problem.
REFERENCES
Alam, F., Stepanov, E., and Riccardi, G. (2013). Personality
traits recognition on social network -facebook.
Aslan, S. and G
¨
ud
¨
ukbay, U. (2019). Multimodal video-
based apparent personality recognition using long
short-term memory and convolutional neural net-
works.
Baltrusaitis, T., Robinson, P., and Morency, L.-P. (2016).
Openface: An open source facial behavior analysis
toolkit. pages 1–10.
Bekhouche, S. E., Dornaika, F., Ouafi, A., and Taleb-
Ahmed, A. (2017). Personality traits and job candi-
date screening via analyzing facial videos. In 2017
IEEE Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), pages 1660–1663.
Dai, W., Cahyawijaya, S., Liu, Z., and Fung, P. (2021). Mul-
timodal end-to-end sparse model for emotion recogni-
tion.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
formers for language understanding.
Escalante, H. J., Kaya, H., Salah, A. A., Escalera, S., Gu-
cluturk, Y., Guclu, U., Baro, X., Guyon, I., Junior,
J. J., Madadi, M., Ayache, S., Viegas, E., Gurpinar,
F., Wicaksana, A. S., Liem, C. C. S., van Gerven, M.
A. J., and van Lier, R. (2019). Explaining first im-
pressions: Modeling, recognizing, and explaining ap-
parent personality from videos.
Farnadi, G., Sitaraman, G., Sushmita, S., Celli, F., Kosin-
ski, M., Stillwell, D., Davalos, S., Moens, M.-F.,
and De Cock, M. (2016). Computational personal-
ity recognition in social media. User Modeling and
User-Adapted Interaction, 26.
G
¨
uc¸l
¨
ut
¨
urk, Y., G
¨
uc¸l
¨
u, U., P
´
erez, M., Escalante, H. J., Bar
´
o,
X., Andujar, C., Guyon, I., Junior, J. J., Madadi, M.,
Escalera, S., Van Gerven, M. A., and Van Lier, R.
(2017). Visualizing apparent personality analysis with
deep residual networks. In 2017 IEEE International
Conference on Computer Vision Workshops (ICCVW),
pages 3101–3109.
G
¨
uc¸l
¨
ut
¨
urk, Y., G
¨
uc¸l
¨
u, U., van Gerven, M. A. J., and van
Lier, R. (2016). Deep impression: Audiovisual deep
residual networks for multimodal apparent personal-
ity trait recognition. Computer Vision – ECCV 2016
Workshops, page 349–358.
G
¨
urpinar, F., Kaya, H., and Salah, A. A. (2016). Multimodal
fusion of audio, scene, and face features for first im-
pression estimation. In 2016 23rd International Con-
ference on Pattern Recognition (ICPR), pages 43–48.
G
¨
urpınar, F., Kaya, H., and Salah, A. A. (2016). Combining
deep facial and ambient features for first impression
estimation. In Hua, G. and J
´
egou, H., editors, Com-
puter Vision – ECCV 2016 Workshops, pages 372–
385, Cham. Springer International Publishing.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-
ual learning for image recognition.
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F.,
Jansen, A., Moore, C., Plakal, M., Platt, D., Saurous,
R. A., Seybold, B., Slaney, M., Weiss, R., and Wil-
son, K. (2017). Cnn architectures for large-scale audio
classification. In International Conference on Acous-
tics, Speech and Signal Processing (ICASSP).
Junior, J. C. S. J., Lapedriza, A., Palmero, C., Baro, X., and
Escalera, S. (2021). Person perception biases exposed:
Revisiting the first impressions dataset. pages 13–21.
Kaya, H., G
¨
urpinar, F., and Salah, A. A. (2017). Multi-
modal score fusion and decision trees for explainable
automatic job candidate screening from video cvs. In
2017 IEEE Conference on Computer Vision and Pat-
tern Recognition Workshops (CVPRW), pages 1651–
1659.
Multimodal Personality Recognition using Cross-attention Transformer and Behaviour Encoding
507