distinguish the training set and the test set. The model
proposed in this paper is compared with the single-
modal ResNet, ConvNet, and CNN-Transformer in
the experiments.
Figure 4: The average test accuracy of emotion detection.
The average test results of these methods on
annotated datasets are shown in Figure 4. As can be
seen from the figure, the Sad and Angry of ConvNet
and ResNet are relatively low. The reason is that Sad
and Angry are indistinguishable. The Happy of CNN-
Transformer has obvious advantages over the former
two, and there is still the problem that Sad and Angry
are indistinguishable; In contrast, the accuracy rate of
Sad and Angry of the multi-modal transformer
structure proposed in this paper is slightly improved.
It can be considered that due to the fusion of
multimodal information and the ability of the
transformer to map to the same space to calculate the
similarity, coupled with the abstraction ability of
semantic topology, the discrimination ability of
emotion classification in these four emotions is
further improved.
5 CONCLUSION
The field of deep learning based on computer audio
and video is developing rapidly. Combining language
and visual information to carry out emotion detection
and recognition is a new research hotspot in the field
of artificial intelligence. In addition to detecting and
recognizing human emotions, emotion recognition
and management for pets has begun to receive
attention. By collecting and processing pet sounds
and corresponding facial expressions, it is possible to
understand the pet's current condition and respond
positively to these conditions. This paper proposes a
pre-trained multi-modal transformer emotion
detection system, which is first pre-trained on a
human emotion detection dataset including speech
and facial expression data, and then takes the labelled
animal speech and expression data as small-sample
task data, The method utilizes an unlabelled corpus
for pre-training, which satisfies the requirements of
adequately training model parameters and preventing
model overfitting, and finally uses the representations
of these models for few-shot tasks. Experimental
results on video datasets show that the proposed
multimodal Transformer structure has good accuracy
compared to other algorithms.
REFERENCES
Bennett, V., Gourkow, N., & Mills, D. (2017). Facial
correlates of emotional behaviour in the domestic cat
(Felis catus). Behavioural Processes. 141.
10.1016/j.beproc.2017.03.011.
Cheng, W. K., Leong, W. C., Tan, J. S., Hong, Z. W., &
Chen, Y. L. (2022). Affective Recommender System
for Pet Social Network. Sensors (Basel,
Switzerland), 22(18), 6759.
https://doi.org/10.3390/s22186759
Garcia-Garcia, J., & Penichet, V., & Lozano, M. (2017).
Emotion detection: a technology review. 1-8.
10.1145/3123818.3123852.
Ho, J., Hussain, S., & Sparagano, O. (2021). Did the
COVID-19 Pandemic Spark a Public Interest in Pet
Adoption?. Frontiers in veterinary science, 8, 647308.
https://doi.org/10.3389/fvets.2021.647308
Hussain, A., Ali S, S., Abdullah, M., & Kim, H. (2022).
Activity Detection for the Wellbeing of Dogs Using
Wearable Sensors Based on Deep Learning. IEEE
Access. 10. 10.1109/ACCESS.2022.3174813.
Ju, X., Zhang, D., Li, J., & Zhou, G. (2020). Transformer-
based Label Set Generation for Multi-modal Multi-
label Emotion Detection. Proceedings of the 28th ACM
International Conference on Multimedia.
Kim, D. H., & Song, B. C. (2018). Multi-modal Emotion
Recognition using Semi-supervised Learning and
Multiple Neural Networks in the Wild. Journal of
Broadcast Engineering, 23(3), 351–360.
https://doi.org/10.5909/JBE.2018.23.3.351
Kim, E.S., Bryant, D.G., Srikanth, D., & Howard, A.M.
(2021). Age Bias in Emotion Detection: An Analysis of
Facial Emotion Recognition Performance on Young,
Middle-Aged, and Older Adults. Proceedings of the
2021 AAAI/ACM Conference on AI, Ethics, and Society.
Mendl, M.T., Burman, O.H., Parker, R.M., & Paul, E.S.
(2009). Cognitive bias as an indicator of animal
emotion and welfare: Emerging evidence and
underlying mechanisms. Applied Animal Behaviour
Science, 118, 161-181.
Meng, L., Liu, Y., Liu, X., Huang, Z., Jiang, W., Zhang, T.,
Deng, Y., Li, R., Wu, Y., Zhao, J., Qiao, F., Jin, Q., &
Liu, C. (2022). Multi-modal Emotion Estimation for in-
the-wild Videos. ArXiv, abs/2203.13032.
Paul, E. S., Sher, S., Tamietto, M., Winkielman, P., &
Mendl, M. T. (2020). Towards a comparative science of
emotion: Affect and consciousness in humans and