Deep Learning-Based Multimodal Sentiment Analysis

Zijian Wang

Mathematics BSc, University College London, London, U.K.

Keywords: Deep Learning, Sentiment Analysis, Multimodal, Natural Language Process.

Abstract: The area of natural language processing has a substantial amount of research that focuses on multimodal

sentiment analysis. It aims at how people express their feelings through different types of speech and can be

used in many areas, such as e-commerce, film and TV reviews, and more. With the advent of technologies

like as machine learning, deep learning, and others, significant progress has been achieved in the area of

multimodal sentiment analysis. First, this paper introduces multimodal emotion analysis, and then divides

emotion analysis into narrative and interactivity according to the presence or absence of dialogue. The

characteristics and distinctions of these two sentiment analysis approaches are then introduced, with respect

to data, algorithm, and application, by analyzing pertinent recent domestic and international research. Lastly,

this work addresses the future directions for research as well as the current drawbacks of multimodal sentiment

analysis. With that said, this paper provides a reference for sentiment analysis researchers and outlines future

research in this dynamic topic.

1 INTRODUCTION

Sentiment analysis, which is also called opinion or

motion mining, is considered as a key area of NLP

(natural language processing) that identifies and

categorizes subjective information about products,

services, organizations, events, and subjects

(Meliville et al. 2009). This field uses methods from

NLP, statistics, and machine learning in light of

figuring out the meaning behind spoken or written

language, which shows how people feel, what they

think, and what they believe. In today’s data-driven

society, sentiment analysis is vital for boosting

customer service and leading product development.

The rise of social media has significantly altered

communication patterns, resulting in a need for

multimodal sentiment analysis (Liu & Zhang 2012,

S.L.C. & Sun 2017). This newer approach seeks to

understand sentiments expressed not only through

words but also through images, audio, and video. This

comprehensive method aims to capture the complex

and multifaceted nature of sentiment expression,

which often includes a mix of verbal as well as non-

verbal cues. The transition to multimodal sentiment

analysis is a significant step in the area, which holds

the promise of advancements in emotion

identification technology that are both more

sophisticated and accurate.

However, the move to multimodal sentiment

analysis brings its own set of challenges. The main

difficulty lies in effectively combining and

interpreting the varied data types involved in

multimodal communication. This includes the

challenge of aligning and merging information from

different sources, each with its unique characteristics,

and dealing with the dynamic nature of interactive

communications. These hurdles underscore the need

for innovative solutions in areas like multimodal

representation learning, alignment, and fusion to push

the field forward.

Despite these challenges, the potential advantages

of successful multimodal sentiment analysis are

significant. It can lead to more precise and nuanced

interpretations of emotions, enhancing customer

insights, media experiences, and communication

strategies across various platforms. By addressing the

limitations of unimodal analysis and tapping into the

rich potential of multimodal data, researchers aim to

deepen our understanding of human sentiment. This

progress is not just a leap forward in natural language

processing and artificial intelligence but also marks a

step towards creating more empathetic and human-

centered technology (Zhang et al. 2018).

This research delves into the complexities of

multimodal sentiment analysis, examining its

challenges and the opportunities it presents. It adopts

356

Wang, Z.

Deep Learning-Based Multimodal Sentiment Analysis.

DOI: 10.5220/0012842200004547

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Data Science and Engineering (ICDSE 2024), pages 356-363

ISBN: 978-989-758-690-3

a multidisciplinary approach, drawing from

linguistics, psychology, computer science, and data

science, to develop new computational models and

algorithms. These models aim to better capture how

humans express and perceive emotions across

different contexts and cultures. Through this

exploration, the study identifies crucial challenges in

the field and suggests innovative solutions, shedding

light on the evolution from text-only methods to

multimodal approaches. It also evaluates the strengths

and weaknesses of current technological

methodologies, providing insights into potential

applications. The dissertation concludes by

summarizing the findings and looking ahead to future

research directions in multimodal sentiment analysis.

This includes considering the impact on natural

language processing and proposing a roadmap for

further research to enhance sentiment analysis

technologies.

2 DATASET

2.1 Data Requirements and Datasets

Multimodal sentiment analysis stands at the

intersection of various data types, each contributing a

unique perspective to the understanding of sentiments

and emotions. This analysis relies heavily on the

amalgamation of text, images, audio, and video data

to offer a multidimensional view of sentiment

expression. Textual data, ranging from concise tweets

to detailed product reviews, serves as a direct

articulation of thoughts and opinions, providing clear

indicators of sentiment polarity. Images, whether they

are standalone pictures or part of video content,

convey emotions through visual elements such as

colors, expressions, and symbols, offering insights

into the sentiment without the need for words. Audio

data adds another layer, with the tone, pace, and pitch

of voice carrying subtle cues about the speaker's

emotional state. Video data combines these elements,

presenting a rich narrative of sentiment through

dynamic interactions, facial expressions, and verbal

communication, encapsulated in a temporal

sequence.

To facilitate research and development in this

area, several public datasets have become invaluable

resources, each characterized by its multimodal

content and annotations:

• CMU-MOSI: This dataset serves as a benchmark

and contains video clips. Each clip is annotated

with sentiment scores across multiple modalities,

making it a comprehensive resource for analyzing

opinion dynamics (Zadeh et al. 2016).

• IEMOCAP: The abbreviation of interactive

emotional dyadic motion capture. Video and

audio recordings of dialogues that have been acted

out by professional actors and annotated with a

variety of emotional states are included in this

dataset. It is particularly useful for studies

focusing on emotional expressions in

conversational contexts.

• SEMAINE: A collection of audio-visual

recordings from interactions with a Sensitive

Artificial Listener (SAL) system, designed to

elicit emotional responses. Annotations include

dimensional and categorical emotion labels,

facilitating research into affective computing.

• YouTube-8M: A large-scale dataset that offers a

wide array of YouTube videos tagged with labels,

including topics and sentiments. While it

primarily serves as a resource for video

understanding tasks, its extensive collection

allows for sentiment analysis across diverse video

content.

These datasets originate from varied sources,

including social media platforms, dedicated research

efforts, and public contributions, encompassing a

wide range of subjects, contexts, and emotional

expressions. They contain raw multimodal data and

annotations that mark emotional states, giving an

elementary truth for training and assessing sentiment

analysis models.

The variety and depth of these datasets show how

multimodal sentiment analysis is changing over time.

They also give researchers and practitioners a base for

making models that are sophisticated and more

accurate. By leveraging these resources, the field

continues to advance, enhancing our ability to

decipher the complex tapestry of human emotions as

expressed through the myriad channels of

communication in the digital age.

2.2 Data Processing Techniques

The efficacy of multimodal sentiment analysis hinges

on sophisticated data processing techniques tailored

to prepare and integrate diverse data types—text,

images, audio, and video—into a coherent framework

that models can analyze. Each modality undergoes

specific preprocessing steps to transform raw data

into structured forms suitable for sentiment analysis

(Soleymani et al. 2017, Poria et al. 2017).

Deep Learning-Based Multimodal Sentiment Analysis

357

2.2.1 Text Data Processing

Textual information, with its rich semantic and

syntactic diversity, is preprocessed through

tokenization and vectorization. Advanced language

models like BERT or Word2Vec are employed to

encode text into numerical vectors. These models

capture the nuances of language, including context,

sentiment, and the relationships between words,

converting unstructured text into a structured form

that sentiment analysis algorithms can interpret. This

process involves natural language processing

techniques, which include lemmatization, stemming,

and the removal of stop words to refine the text data

further before vectorization.

2.2.2 Image Data

Images are processed using techniques that allow for

the extraction of emotional cues embedded in visual

content. Convolutional Neural Networks (CNNs)

then analyze these preprocessed images, extracting

features that reflect visual sentiments, such as colors,

textures, and facial expressions. This process enables

the model to understand and interpret the sentiment

conveyed through visual information directly.

2.2.3 Audio Data

Audio data requires conversion into a format that

highlights features relevant to sentiment analysis,

such as tone, pitch, and rhythm. Preprocessing steps

include sampling, noise reduction, and the extraction

of features like Mel-Frequency Cepstral Coefficients

(MFCCs) or spectrograms. These features

encapsulate the emotional nuances present in audio

data, preparing it for analysis by models capable of

processing sequential and time-series data, such as

Recurrent Neural Networks (RNNs) or LSTMs.

2.2.4 Video Data

Video, as a combination of audio and visual data

along with potential textual components (like

subtitles), undergoes a composite preprocessing

routine. Frames extracted from videos are processed

in a manner similar to images, while the audio track

is treated as standalone audio data. Additionally,

textual information embedded in videos is extracted

and processed using text data techniques. The

challenge lies in effectively synchronizing these

modalities to maintain the temporal coherence of

sentiment expressions throughout the video.

By employing these data processing techniques,

multimodal sentiment analysis models are equipped

to handle the complexities and subtleties of human

emotions as conveyed through multiple modes of

communication. This comprehensive approach to

data preparation and integration is crucial for the

development of accurate and effective tools to

analyze sentiment.

3 NARRATIVE MULTIMODAL

SENTIMENT ANALYSIS

The goal of narrative multimodal sentiment analysis

is to classify subjective attitudes into categories such

as Positive, Negative, and Neutral. This approach

stands apart from unimodal sentiment analysis by not

only necessitating feature learning but also requiring

the process of information fusion (Qian et al. 2019,

Verma et al. 2019). This process integrates data from

different modalities which include text, images, or

videos, in what is referred to as multimodal

interaction or multimodal fusion. The prevalent

methods for multimodal fusion are categorized into

three main types: feature-level fusion, decision-level

fusion, and hybrid fusion, each offering unique

advantages and facing distinct challenges (Atrey et al.

2010, Zhang et al. 2020), as shown in Figure 1.

• Feature-level fusion primarily combines feature

vectors from each modality, such as textual and

visual feature vectors, into a singular multimodal

feature vector for subsequent decision-making

analysis. Its strength lies in capturing the

intermodal feature correlations, facilitating a

richer sentiment analysis. This method requires

the early fusion of modal features, simplifying the

classification process to a single classifier.

However, the challenge arises from the need to

map features from differing semantic spaces and

dimensions into a shared space, considering their

variance in time and semantic dimensions.

• Decision-level fusion, on the other hand, operates

by independently extracting and classifying

features from each modality to achieve local

decisions, which are then merged to form the final

decision vector. This method offers simplicity and

flexibility, allowing each modality to utilize the

most suitable feature extractors and classifiers for

optimal local decisions. Despite its advantages,

the necessity to learn classifiers for all modalities

elevates the time cost of the analysis process.

• Hybrid fusion represents a mix of fusion of

feature-level and decision-level, aiming to

harness their benefits while mitigating the

drawbacks of each. This approach endeavors to

ICDSE 2024 - International Conference on Data Science and Engineering

358

provide a comprehensive and efficient strategy for

multimodal sentiment analysis, optimizing the

fusion process for improved sentiment

classification.

Moreover, multimodal sentiment analysis extends

beyond textual analysis to include images, audio, and

video, capturing the dynamic nature of speech or

movement across different time frames. By

classifying text and images as static documents and

audio and video as dynamic documents, this paper

explores both static and dynamic multimodal

sentiment analysis. It emphasizes the development

and current state of technology-driven static and

dynamic analyses, respectively, highlighting the

significance of combining textual, visual, and

auditory data to discern subjective tendencies.

(a)

Feature-level

(b)

Decision-level

(c)

Hybrid-level

Figure 1: Multimodal Fusion Strategies (Atrey et al. 2010).

3.1 Static Multimodal Emotion

Analysis

Nowadays people are becoming more and more

interested in digital photography as social media sites

become more famous. Usage has soared, and more

and more images are widely distributed on the web.

These images accompany the text together to express

the author's emotional information and make the

connection. Using images and texts to explore public

opinions, preferences, and emotions becomes a

channel. In view of the user's increasing demand for

emotional expression, in the highest language, the

meaning level (that is, the emotional level) is the

analysis of the multimodal content of the text. The

more urgent it becomes, the more graphically (i.e.,

static) sentiment analysis attracts to the eyes of more

researchers. There are many research papers in the

area of graphic emotion analysis right now. Among

various techniques, machine learning and deep

learning have made much progress (Yuan et al. 2013).

3.2 Machine Learning and Deep

Learning Approaches

With the introduction of machine learning, static

multimodal sentiment analysis has been greatly

advanced. This has been accomplished through the

utilization of statistical algorithms such as SVM, RF,

and NB (Cao et al. 2014, Wagner et al. 2011). By

approaching graphic emotion analysis as a supervised

classification task, these methods have made it

simpler to investigate the complex link that exists

between emotions expressed in written language and

visual representations. They do this by using features

like ANP to improve emotion inference and adding

textual titles for a more complete analysis. However,

despite their high recognition rates, these techniques

heavily depend on the painstaking process of feature

engineering, making them labor-intensive and time-

consuming.

Deep learning has revolutionized static

multimodal sentiment analysis, offering end-to-end

solutions that circumvent the need for manual feature

engineering (Devlin et al. 2019, You et al. 2016, Poria

et al. 2017). Models like CNNs and LSTM networks

have demonstrated superior performance across

various tasks, including image processing and natural

language processing. By employing techniques such

as attention mechanisms and tensor fusion networks,

deep learning approaches have achieved significant

improvements in classifying emotions from static

multimodal data. Nevertheless, these methods require

extensive data and computational resources, posing

Deep Learning-Based Multimodal Sentiment Analysis

359

challenges in terms of training time and

computational cost.

3.3 Dynamic Multimodal Emotion

Analysis

Massive amounts of video data are uploaded to the

Internet every day. Therefore, the study is not content

to use only images and text information, but to start

paying attention to other media resources related to it,

such as accompanying voice and audio. It is

committed to integrating a variety of media

information to make the video emotion more

complete, a comprehensive, accurate understanding,

while illuminating psychology, philosophy,

linguistics, etc., other disciplines, rich research areas

(S.L.C. & Sun 2017). Compared to graphic emotion

analysis, see frequency-emotion analysis often

involves motion changes in different time frames,

showing dynamic. Thus, dynamic emotion analysis

becomes static emotion analyze the inevitable trend

of development.

3.4 Machine Learning and Deep

Learning Approaches

Before deep learning gained prominence, machine

learning algorithms were the mainstay for analyzing

video emotions, focusing on integrating textual,

visual, and auditory cues for sentiment analysis (Tao

2009, Sebe et al. 2006). Techniques such as the HMM

and SVM have been employed to analyze emotions

through multimodal feature fusion, demonstrating the

effectiveness of combining different modalities for

enhanced emotion recognition. However, similar to

static analysis, these methods require significant

efforts in feature engineering.

Deep learning methods, like CNNs, LSTMs, and

GANs, are being used more and more in dynamic

multimodal sentiment analysis because they can

perform end-to-end learning. These models excel in

analyzing the complex interplay of textual, visual,

and auditory information in videos, employing

strategies like bidirectional LSTMs and feature fusion

based on Gaussian kernels for emotion recognition

(Zhang et al. 2009). Deep learning-based models have

set new benchmarks in accuracy for video sentiment

analysis, albeit with the challenges of requiring large

datasets and extensive computational resources.

4 INTERACTIVE MULTIMODAL

SENTIMENT ANALYSIS

Interactive multimodal sentiment analysis, which is

also interpreted as multimodal conversational

sentiment analysis, is used to figure out how people's

feelings change during chat conversions. This field

extends beyond the scope of declarative multimodal

sentiment analysis by focusing on the fluidity and

evolution of emotional states among participants in a

dialogue. It presents unique challenges distinct from

narrative multimodal sentiment analysis because of

several factors:

• Interactions with other people have the potential

to affect the emotional state of each person, which

can result in shifts in sentiment as the

conversation progresses or progresses.

• Conversations encapsulate hidden layers of

information, such as cultural backgrounds,

professional settings, and the nature of social

relationships, which can significantly impact the

emotional undertones of the dialogue.

• The thought process of speakers during a chat may

not follow a linear trajectory, resulting in non-

coherent discourse and sudden shifts in the topic

of discussion.

These complexities necessitate a nuanced approach to

sentiment analysis that not only processes textual and

multimodal data but also deciphers the intricate web

of interactions among speakers. The purpose of

interactive multimodal sentiment analysis is to

determine whether or not a chat session contains

subjective information and, if it does, to identify the

emotional state of each message while also tracking

how the emotions of each participant changed

throughout the conversation.

Addressing these challenges requires advanced

models capable of capturing the nuanced interplay of

emotions in conversational contexts. This task is not

only pivotal for refining sentiment analysis

techniques but also contributes to broader

advancements in artificial intelligence by enhancing

the understanding of human-machine interactions.

4.1 Multimodal Conversational

Emotion Dataset

Over the years, researchers have built various types

of multimodal emotion datasets, providing

experimental data for multimodal emotion analysis

models. Table 1 below shows the frequently used

datasets.

ICDSE 2024 - International Conference on Data Science and Engineering

360

Table 1: Multi-Modal Sentiment Datasets.

Type Dataset Modality Link

Narrative

Multi-modal

T4SA Image, Text http://www.t4sa.it/

CHEAVD2.0 Video, Audio http://www.chineseldc.org/emotion.html

Multi-ZOL Image, Text https://github.com/xunan0812/MIMN

SEED Brainwave http://bcmi.sjtu.edu.cn/~seed/

Yelp Image, Text https://www.yelp.com/dataset/challenge

HUMAINE Video, Audio http://emotion-research.net/download/pilot-db

Belfast Video, Audio http://belfast-naturalistic-db.sspnet.eu

YouTube

Text, Video,

Audio

E-mail request

CMU-MOSI Image, Text https://www.amir-zadeh.com/datasets

Interactive

Multi-modal

EmotionLines Text https://academiasinicanlplab.github.io/#download

DailyDialog Text http://yanran.li/dailydialog

ScenarioSA Text https://github.com/anonymityanonymity/

SEMAINE Video, Audio http://semaine-db.eu

IEMOCAP Video, Audio http://sail.usc.edu/iemocap/

MELD

Text, Video,

Audio

https://affective-meld.github.io/

EmoContext Text http://humanizing-ai.com/emocontext.html

These datasets provide resources for studying

human interaction related to various emotions, with

each dataset designed to capture different aspects of

conversational sentiment.

4.2 Multimodal Conversational

Emotion Analysis Model

Interaction is a way to change the actions or thoughts

of other entities that are indirect and unseen. Given

the complexity and concealment of the interaction

mechanism, understanding and computing

interlocutor interactions has been a challenging field

in social sciences. Researchers have made numerous

attempts to address this issue, with early models like

HMM and influence models attempting to formalize

and calculate interpersonal interactions.

In the area of sentiment analysis, it has been

difficult to model how discourse interacts with each

other in conversation. Early versions of

conversational sentiment analysis looked at each

utterance on its own, without taking into account how

they related to each other. However as deep learning

has grown, researchers have started to pay more

attention to how words interact and affect each other.

They have created multimodal conversational mood

analysis models that can record this data. These

models are based on deep learning technology and

aim to understand the complex and concealed

interactions within conversations, although research

in this area is still relatively sparse.

Models like the contextual LSTMs designed by

Poria et al. (2017), and DialogueRNN by Majumder

et al. (2019) represent efforts to trace each

conversationalist's emotional state and model the

evolution of emotion throughout the conversation.

These and other models aim to advance the area of

interactive multimodal sentiment analysis by

capturing the nuanced interactions that occur within

conversations.

In conclusion, interactive multimodal sentiment

analysis is gaining popularity as researchers produce

cutting-edge results and advance the discipline.

5 CHALLENGE FACED FOR

INTERACTIVE

CONSTRUCTION OF

MULTI-MODAL SENTIMENT

ANALYSIS

Multimodal sentiment analysis, with its convergence

of disciplines like linguistics, computer science, and

cognitive science, faces significant challenges in

deciphering the complex interplay of sentiments

across different communication modalities. As

technology becomes ever more entwined with our

daily communication, the ability to accurately

interpret sentiments from varied data sources—text,

images, and videos—becomes increasingly critical.

This analysis involves not just the examination of

Deep Learning-Based Multimodal Sentiment Analysis

361

sentiment within individual modalities but also

understanding how these different forms of

expression interact and combine to convey

comprehensive emotional narratives.

5.1 Lexical Interaction Problem in

Modalities

The first challenge pertains to the lexical interactions

within individual modalities, particularly text. Words

in a sentence are not isolated entities but are closely

interconnected, influencing each other to convey

comprehensive semantic meanings. One-hot

encoding, bag-of-words, and N-gram models have

been successful but struggle with polysemy, when a

word's meaning changes with context. Dynamic word

representation techniques like ELMo and BERT,

which consider the context on both sides of a word,

have shown improvement in various natural language

processing tasks (Peters et al. 2018). However, these

methods primarily capture proximate contextual

information, leaving the modeling of long-distance

lexical interactions as an area ripe for exploration.

5.2 Inter-Modal Multimodal

Interaction Problems

The second challenge focuses on the interactions

between different modalities, aiming to integrate

information from various sources like text, images,

and audio to model their associations and

interactions. This integration is crucial for generating

richer and more accurate multimodal outputs. The

main issues here involve aligning and fusing features

from different modalities, each extracted from

distinct semantic spaces and temporal instances.

Researchers have experimented with several

approaches, such as feature concatenation, deep

network-based shared latent learning, tensor fusion,

and attention mechanism-based fusion. Despite these

efforts, achieving effective alignment and fusion

within a shared space remains a significant challenge,

necessitating further research to develop a unified

theory of inter-modal interaction.

5.3 Discourse Flow Interaction Beyond

Modalities

The third challenge emerges from the discourse flow

interactions that extend beyond the modalities

themselves, highlighting the importance of

conversational context in sentiment analysis. The

same phrase can have different meanings and

emotional connotations depending on the dialogue

context, making the analysis of utterance interactions

crucial. Conversational sentiment analysis requires

understanding the nuanced implications of statements

within various dialogic backgrounds, particularly in

chat scenarios where statements entail strong and

repetitive interactions. Current research often treats

utterances in isolation, lacking a systematic study of

their interplay. Developing models that can

comprehend and represent these discourse flow

interactions remains a core challenge, pointing

toward the need for a generalized framework for

modeling discourse context interactions.

Each of these interaction challenges represents a

distinct layer of complexity in multimodal sentiment

analysis, from intra-modal lexical dependencies to

inter-modal dynamics and beyond to the overarching

conversational context. To solve these problems, new

ways are required to properly record and combine the

complex ways that people express their feelings

across and within different modes of communication.

This will help the fields of artificial intelligence and

sentiment analysis make progress.

6 CONCLUSION

Multimodal sentiment analysis is well-known for its

ability to figure out how people are feeling by looking

at different types of communication. This paper has

delved into the essentials of multimodal sentiment

analysis, covering its research background, problem

definitions, and current advancements. It has

highlighted the challenges in this field and outlined

potential future research directions.

As we look to the future, several areas are ripe for

exploration. One such area is the integration of

unconventional data types like touch feedback and

physiological signals, which could offer new insights

into emotional analysis. Even though there has been

growth, there are still not enough complete datasets

for interactive multimodal sentiment analysis. The

datasets we have now, like MELD and IEMOCAP,

are mostly based on scripted scenarios, which might

not fully reflect how people behave in real life. This

gap indicates a need for more authentic datasets that

reflect genuine human communication.

Future research should also focus on developing

lightweight models that prioritize data privacy and

require less data to operate effectively. This approach

addresses practical challenges like model deployment

on limited-resource devices and ensures user privacy.

Additionally, finding ways to enhance model

performance with minimal data could solve the

ICDSE 2024 - International Conference on Data Science and Engineering

362

problem of data scarcity and reduce reliance on

extensive datasets.

Another important direction is the exploration of

general interaction theories that can manage the

complex interactions within discourse flows,

especially in settings involving multiple speakers.

Creating a systematic framework for modeling these

relationships could improve sentiment analysis with

multi-model.

All in all, multimodal sentiment analysis has the

potential for major advances. Researchers can

develop more complex, ethical, and practical

technology-based solutions for understanding human

emotions by addressing these problems and

researching the suggested future possibilities.

REFERENCES

P. Melville, W. Gryc, R.D. Lawrence. "Sentiment analysis

of blogs by combining lexical knowledge with text

classification." ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining,

pp. 1275–1284 (2009).

B. Liu, L. Zhang. "A survey of opinion mining and

sentiment analysis." Mining Text Data, Springer U.S.,

pp. 415–463 (2012).

S. L. C. & C.J. Sun. "A review of natural language

processing techniques for opinion mining systems."

Inform. Fusion, 36, 10-25 (2017).

Y. Z. Zhang, D. W. Song, P. Zhang, et al. "A Quantum-

Inspired Multimodal Sentiment Analysis Framework."

Theoretical Computer Science, 752, 21-40 (2018).

A. Zadeh, R. Zellers, E. Pincus, L.P. Morency. "Mosi:

multimodal corpus of sentiment intensity and

subjectivity analysis in online opinion videos." arXiv

preprint arXiv:1606.06259 (2016).

M. Soleymani, D. Garcia, B. Jou, B. Schuller, S.F. Chang,

M. Pantic. "A survey of multimodal sentiment

analysis." Image Vis. Comput., 65, 3-14 (2017).

S. Poria, E. Cambria, R. Bajpai, A. Hussain. "A review of

affective computing: from unimodal analysis to

multimodal fusion." Inform. Fusion, 37, 98-125 (2017).

Y. F. Qian, Y. Zhang, X. Ma, et al. "EARS: Emotion-Aware

Recommender System Based on Hybrid Information

Fusion." Information Fusion, 46, 141-146 (2019).

S. Verma, C. Wang, L. M. Zhu, et al. "DeepCU: Integrating

Both Common and Unique Latent Information for

Multimodal Sentiment Analysis." Proceedings of the

28th International Joint Conference on Artificial

Intelligence, New York, USA: ACM, 3627-3634

(2019).

P. K. Atrey, M. A. Hossain, A. El Saddik, et al.

"Multimodal Fusion for Multimedia Analysis: A

Survey." Multimedia Systems, 16(6), 345-379 (2010).

Y. Zhang, L. Rong, D. Song, P. Zhang. "A Survey on

Multimodal Sentiment Analysis." Pattern Recognition

and Artificial Intelligence, 33(5), 426-438 (2020).

J. B. Yuan, S. McDonough, Q. Z. You, et al. "Sentribute:

Image Sentiment Analysis from a Mid-level

Perspective." Proceedings of the 2nd International

Workshop on Issues of Sentiment Discovery and

Opinion Mining, pp. 10-12 (2013).

Cao, D. L., Ji, R. R., Li, "Visual Sentiment Topic Model

Based Microblog Image Sentiment Analysis."

Multimedia Tools and Applications, 75(15), 8955-

8968.

J. Wagner, E. Andre, F. Lingenfelser, et al. "Exploring

Fusion Methods for Multimodal Emotion Recognition

with Missing Data." IEEE Transactions on Affective

Computing, 2(4), 206-218 (2011).

J. Devlin, M. W. Chang, K. Lee, et al. "BERT: Pre-training

of Deep Bidirectional Transformers for Language

Understanding." Proceedings of the Annual Conference

of the North American Chapter of the Association for

Computational Linguistics, pp. 4171-4186 (2019).

Q. Z. You, L. L. Cao, H. L. Jin, et al. "Robust Visual-

Textual Sentiment Analysis: When Attention Meets

Tree-Structured Recursive Neural Networks."

Proceedings of the 24th ACM International Conference

on Multimedia, pp. 1008-1017 (2016).

S. Poria, E. Cambria, D. Hazarika, et al. "Multi-level

Multiple Attentions for Contextual Multimodal

Sentiment Analysis." Proceedings of the IEEE

International Conference on Data Mining, pp. 1033-

1038 (2017).

S. L. C. & C.J. Sun. "A review of natural language

processing techniques for opinion mining systems."

Inform. Fusion, 36, 10-25 (2017).

J. H. Tao. "A Novel Prosody Adaptation Method for

Mandarin Concatenation-Based Text-to-Speech

System." Acoustical Science and Technology, 30(1),

33-41 (2009).

N. Sebe, I. Cohen, T. Gevers, et al. "Emotion Recognition

Based on Joint Visual and Audio Cues." Proceedings of

the 18th International Conference on Pattern

Recognition, pp. 1136-1139 (2006).

X. Y. Zhang, C. S. Xu, J. Cheng, et al. "Effective

Annotation and Search for Video Blogs with

Integration of Context and Content Analysis." IEEE

Transactions on Multimedia, 11(2), 272-285 (2009).

S. Poria, E. Cambria, D. Hazarika, et al. "Context-

Dependent Sentiment Analysis in User-Generated

Videos." Proceedings of the 55th Annual Meeting of the

Association for Computational Linguistics,

Stroudsburg, USA: ACL, 873-883 (2017).

N. Majumder, S. Poria, D. Hazarika, et al. "DialogueRNN:

An Attentive RNN for Emotion Detection in

Conversations." Proceedings of the AAAI Conference

on Artificial Intelligence, Palo Alto, USA: AAAI Press,

6818-6825 (2019).

M. E. Peters, M. Neumann, M. Iyyer, et al. "Deep

Contextualized Word Representations." Proceedings of

the Conference of the North American Chapter of the

Association for Computational Linguistics,

Stroudsburg, USA: ACL, 2227-2237 (2018).

Deep Learning-Based Multimodal Sentiment Analysis

363