Multimodal Approach Based on Autistic Child Behavior Analysis for

Meltdown Crisis Detection

Marwa Masmoudi

1 a

, Salma Kammoun Jarraya

1,2 b

and Mohamed Hammami

1,3 c

Mir@cl Laboratory, University of Sfax, Tunisia

CS Department, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia

Department of Computer Science, Faculty of Science, Sfax, Tunisia

Keywords:

Multimodal Approach, Meltdown Crisis Detection, Autism, Behavior Analysis.

Abstract:

This paper presents an innovative method for addressing the challenge of recognizing and responding to melt-

down crises in autistic children. It focuses on integrating information from emotional and physical modalities,

employing multimodal fusion with an emphasis on the early fusion technique. Existing literature outlines three

fusion techniques – early, late, and hybrid fusion, each with unique advantages. Due to the distinct nature of

datasets representing emotions and physical activities, late and hybrid fusion were considered impractical.

Therefore, the paper adopts the early fusion method and introduces a Multi-modal CNN model architecture

for efﬁcient meltdown crisis recognition. The architecture comprises three Convolution layers, Max-pooling

Layers, a Fully Connected (FC) layer, and Softmax activation for classiﬁcation. The decision to opt for early

fusion is driven by the inconsistent detection of children’s faces in all video frames, resulting in two different

output sizes for emotion and physical activity systems. The presented pseudo-code outlines the architecture

development steps. The proposed model’s efﬁciency is highlighted by its outstanding recognition rate and

speed, making it the preferred choice for the time-sensitive Smart-AMD (Smart-Autistic Meltdown Detector)

System. Beyond technical aspects, the model aims to enhance the well-being of autistic children by promptly

recognizing and alerting caregivers to abnormal behaviors during a meltdown crisis. This paper introduces a

comprehensive system that integrates advanced technology and a profound understanding of autism, offering

timely and effective support to those in need.

1 INTRODUCTION

The recognition of abnormal behaviors in autistic

children during a meltdown crisis is a crucial aspect

of developing effective support systems. This pa-

per introduces a novel approach to address this chal-

lenge through the utilization of multimodal fusion

techniques in the context of deep learning. Speciﬁ-

cally, we focus on the early fusion method as an opti-

mal strategy for combining information from two dis-

tinct datasets representing emotional states and phys-

ical activities.

The process of multimodal fusion involves the

integration of information from various sources, a

concept well-deﬁned in literature as demonstrated by

(Pandeya and Lee, 2021). Three primary fusion tech-

https://orcid.org/0000-0002-5248-8525

niques, namely early fusion, late fusion, and hybrid

fusion, have been identiﬁed. Early fusion involves

the merging of low-level features from each modality

using correlations, enhancing task accomplishment.

However, it may face challenges related to tempo-

ral synchronization problems based on multiple input

sources. Late fusion, on the other hand, combines

unimodal decision values in the decision phase, al-

lowing ﬂexibility and simplicity in predictions even

when certain modalities are missing. Hybrid fusion

attempts to combine the advantages of both early and

late fusion.

In our case, late and hybrid fusion techniques

are deemed unattainable due to distinct datasets and

the inability to detect children’s faces in all video

frames. This results in two different output sizes for

the systems focusing on emotions and physical activi-

ties. Consequently, we adopt the early fusion method,

utilizing a Multi-modal CNN model architecture for

meltdown crisis recognition.

Masmoudi, M., Jarraya, S. and Hammami, M.

Multimodal Approach Based on Autistic Child Behavior Analysis for Meltdown Crisis Detection.

DOI: 10.5220/0012724800003753

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Software Technologies (ICSOFT 2024), pages 339-350

ISBN: 978-989-758-706-1; ISSN: 2184-2833

339

The proposed CNN model consists of three Con-

volution layers with ReLu function activation, fol-

lowed by two Max-pooling Layers for mapping ex-

tracted features. A Fully Connected (FC) layer with

ReLu function is employed, and the extracted features

are then concatenated and treated as input to a single

Fully Connected Layer. Finally, a Fully Connected

layer with Softmax function is utilized for classifying

children’s states based on two modalities: compound

emotions and abnormal-complex physical activities.

The presented pseudo-code outlines the develop-

ment steps of the architecture, emphasizing the details

of the early fusion approach. The efﬁciency of this

model is highlighted, as it not only provides the best

recognition rate but also proves to be the fastest. Con-

sidering the signiﬁcance of time in the Smart-AMD

System, this efﬁciency becomes a strong point.

In conclusion, the proposed model represents a

comprehensive system designed to recognize abnor-

mal behaviors during a meltdown crisis in autistic

children. The goal is to assist caregivers in promptly

addressing and preventing harm to the children, em-

phasizing the role of efﬁcient recognition in ensuring

the well-being of autistic individuals during crisis sit-

uations.

The remainder of this paper is structured as fol-

lows: Section 2 conducts a review of the litera-

ture covering emotion, physical activity, and human

behavior recognition using multimodal approaches.

Then, in Section 3, the ”MELTDOWN-CRISIS

DATASET” is introduced, followed by a focus on

autistic child behavior analysis for crisis detection. In

Section 5, the experimental results showcase ﬁndings

from various experiments, including CNN-model ar-

chitecture applications and validation processes, en-

suring result reliability. Finally, the conclusion sum-

marizes key ﬁndings and their signiﬁcance and sug-

gests future research directions.

2 RELATED WORKS

Multimodal fusion is a prominent area in both mul-

timodal and artiﬁcial intelligence research, aiming to

leverage diverse data types for reliable model classiﬁ-

cation. This process involves transforming data from

various single-mode representations into a compact

multimodal form (Zhu et al., 2020). Four distinct data

fusion techniques have been identiﬁed: data-level fu-

sion, early-level fusion, late fusion, and hybrid fu-

sion. Data fusion involves merging different datasets

into a uniﬁed database. Early fusion integrates low-

level features from each modality through correla-

tion, resulting in enhanced task performance. Late

fusion combines unimodal decision values to derive

the ﬁnal decision, while hybrid fusion seeks to com-

bine the strengths of both early and late fusion in a

uniﬁed framework (Pandeya and Lee, 2021). In the

context of emotion and activity recognition, various

works utilize multiple data modalities such as im-

ages, videos, audio, and information from wearable

sensors. These works explore different fusion lev-

els to enhance recognition accuracy. Despite numer-

ous studies addressing similar problems using data

from images and videos with handcrafted features,

few have attempted to fuse emotion and physical ac-

tivity modalities for human behavior recognition. To

address this gap and analyze autistic behaviors, es-

tablishing a foundation for multimodal approaches is

essential. These approaches combine facial expres-

sions and physical activities, with a focus on review-

ing relevant literature in this ﬁeld. In the following

subsections, we will present state-of-the-art methods

that fuse modalities for recognizing emotions, ges-

tures, activities, and human behavior.

2.1 Emotion Recognition Based on

Multimodal Approaches

Emotion recognition, a proliﬁc research area, spans

various ﬁelds and involves diverse data types such

as facial expressions from images, videos, trajecto-

ries, and speech. Robust frameworks have emerged

through the fusion of modalities, including combina-

tions of handcrafted features with deep features, au-

ditory and visual modalities, and various other mul-

timodal approaches. Both classical machine learning

and deep learning techniques are employed in these

endeavors.

In a study by (Busso et al., 2004), the strengths

and weaknesses of facial expression and acoustic

emotion classiﬁers were analyzed. Unimodal sys-

tems often encountered misclassiﬁcations for certain

emotion pairs, but these confusions were mitigated

by introducing another modality. Consequently, the

bimodal emotion classiﬁer outperformed individual

unimodal systems. Two fusion approaches, namely

feature-level and decision-level fusion, were com-

pared, yielding similar overall performance. How-

ever, speciﬁc emotions exhibited notable variations.

The feature-level bimodal classiﬁer excelled in rec-

ognizing anger and neutral states, while the decision-

level bimodal classiﬁer achieved high accuracy in

classifying happiness and sadness. Additionally,

(Castellano et al., 2007)presented a multimodal ap-

proach for recognizing eight emotions. This approach

integrated information from facial expressions, body

movement, gestures, and speech, demonstrating the

ICSOFT 2024 - 19th International Conference on Software Technologies

340

potential of combining multiple modalities for com-

prehensive emotion recognition. In analyzing a

Bayesian classiﬁer model, the authors used a multi-

modal corpus comprising eight emotions and ten sub-

jects. Initially, individual classiﬁers were trained for

each modality, and subsequent fusion of data at both

feature and decision levels resulted in recognition rate

improvements exceeding 10% compared to unimodal

systems. Notably, feature-level fusion outperformed

decision-level fusion. (Kessous et al., 2010) proposed

a speech-based multimodal emotion recognition sys-

tem during interactions. Their dataset featured in-

dividuals pronouncing sentences with various emo-

tions during interactions with an agent. Combining

facial expressions, gestures, and acoustic speech anal-

ysis, a Bayesian classiﬁer was employed for auto-

matic classiﬁcation of unimodal, bimodal, and mul-

timodal data. Fusion at the feature and results lev-

els signiﬁcantly enhanced recognition rates, surpass-

ing unimodal systems by over 10%. Investigation into

bimodal emotion recognition combinations revealed

’gesture-speech’ as the most effective pairing, with a

3.3% improvement over the best bimodal results.

(Psaltis et al., 2019) explored integrating emotion

recognition technology into gaming applications to

enhance interaction and the gaming experience. They

presented an emotion recognition methodology us-

ing multimodal fusion analysis to identify players’

emotional states during gameplay scenarios. In this

context, two mono-modal classiﬁers were devised for

extracting affective state information from facial ex-

pression and body motion analysis. To amalgamate

modalities, the authors introduced a deep model for

determining the player’s affective state. Evaluating

their approach involved collecting a bimodal dataset

using Microsoft’s Kinect sensor, incorporating feature

vectors from users’ facial expressions and body ges-

tures. This method outperformed mono-modal and

early-fusion algorithms, achieving a recognition rate

of 98.3%. Similarly, (Pandeya and Lee, 2021) pro-

posed a multimodal approach for comprehending hu-

man emotions. They constructed a balanced music

video emotion dataset, testing it over four unimodal

and four multimodal convolutional neural networks

(CNNs) for music and video. Evaluation results

demonstrated improved performance for multimodal

architectures compared to individual unimodal emo-

tion classiﬁers, with an accuracy of 88.56% achieved

by integrating all multimodal structures. In a parallel

vein, (Radoi et al., 2021) presented a robust end-to-

end architecture incorporating multimodal informa-

tion for emotion recognition. The Temporally Ag-

gregated Audio-Visual Network (TA-AVN) architec-

ture ﬂexibly merges audio and video data at various

sampling rates across modalities. This approach ac-

commodates an asynchronous combination of tempo-

ral multimodal information, achieving competitive re-

sults on challenging datasets, with overall accuracies

of 84.0% for CREMA-D and 78.7% for RAVDESS.

2.2 Physical Activity Recognition Based

on Multimodal Approaches

Motion recognition, advancing with diverse sen-

sor applications like wearables, vision-based, and

speech sensors, beneﬁts from the integration of mul-

tiple modalities for robust performance. In a mul-

timodal approach by (Masurelle et al., 2013), iso-

lated complex human body movements, speciﬁcally

Salsa dance steps, were recognized. The system

utilized motion features from 3D sub-trajectories of

dancers’ body-joints (extracted from Kinect depth

map sequences) through Principal Component Analy-

sis (PCA). Sub-trajectories were obtained from a foot-

step impact detection module, utilizing piezoelectric

sensors on the dance ﬂoor. Two classiﬁers, Gaussian

mixture models and hidden Markov models (HMM),

tested on a multimodal Salsa Dataset using HMM

classiﬁers, achieved a 74% F-measure in recognizing

gestures among six classes.

(Li et al., 2017) highlighted the performance limi-

tations of individual sensors, especially for categoriz-

ing similar activities. They addressed this by fusing

information from experimental data collected using

different sensors, including a tri-axial accelerometer,

a micro-Doppler radar, and a depth camera. The fu-

sion of heterogeneous information improved the over-

all system performance, leading to a global classiﬁca-

tion rate increase up to 91.3% based on the combi-

nation of accelerometer, radar, and RGB-Depth data.

In their work, (Tian et al., 2020) introduced a sample

database of RGB-D gesture images, preprocessed the

samples, and devised a multimodal, multilevel fusion

gesture recognition framework. They designed a con-

volutional neural network structure with two modes,

extracting features at different abstract levels for each

mode. To address the challenge of varying feature

dimensions in different modes, they proposed a fea-

ture mapping model to align features into a common

space, creating a uniﬁed feature set. (Lin et al., 2020)

proposed a data fusion framework to merge data from

Microsoft Kinect and wearable sensors, aiming to en-

hance Human Action Recognition (HAR) accuracy.

While Kinect captures body motion characteristics for

various activities, its accuracy depends on the viewing

angle. The integration of Kinect and wearable sensors

compensates for each other’s limitations. The authors

introduced a novel system utilizing incremental learn-

Multimodal Approach Based on Autistic Child Behavior Analysis for Meltdown Crisis Detection

341

ing, a decision table, and swarm-based feature selec-

tion for quick and accurate HAR based on both sen-

sor data. Experimental results demonstrated a signif-

icant improvement in HAR accuracy (from 23.51%

to 68.35%) when combining Kinect sensors viewed

at a ninety-degree angle with wearable sensors. Hu-

man action recognition is pivotal for developing in-

telligent solutions in home environments, especially

in ambient assisted living applications. According to

(Franco et al., 2020), automated systems, leveraging

the capabilities of Kinect sensors, can signiﬁcantly

enhance human quality of life. By interpreting user

needs, recognizing unusual behaviors, and preventing

potential hazards, these systems contribute to a safer

and more efﬁcient living environment. This study ex-

ploits the full potential of the Kinect sensor, combin-

ing Skeleton and RGB data streams for a robust ac-

tivity recognition method. The Skeleton representa-

tion tracks body postures, while the RGB images cap-

ture the temporal evolution of actions. In the work of

(Yu et al., 2020), the authors introduced D3D-LSTM,

featuring real-time feature fusion for enhanced dis-

crimination of similar actions. The model includes a

high-attention mechanism assigning different weights

to frames in real-time. An alternating optimization

strategy further reﬁnes the model. Evaluating D3D-

LSTM on Realset, SBU-Kinect, and MSR-action-3D

datasets demonstrated its effectiveness, pushing the

average rate of SBU-Kinect to 92.40% and MSR-

action-3D to 95.40%.

2.3 Human-Behavior Recognition

Based on Multimodal Approaches

According to (Ambady and Rosenthal, 1992), hu-

mans assess expressive behaviors through both ver-

bal and non-verbal channels. Verbal channels involve

speech, while non-verbal channels encompass eye

gaze, blink, facial and body expressions, and speech

prosody. Various approaches have been proposed to

fuse multiple modalities for recognizing human be-

haviors. (Pimpalkar et al., 2014) explored human-

computer interaction to enhance computer awareness

of user behaviors, particularly for assisting disabled

individuals in expressing themselves. They intro-

duced a multimodal approach for behavior recogni-

tion, exempliﬁed by a gesture recognition system us-

ing a webcam. The model incorporated facial ex-

pression and hand gesture recognition, utilizing the

”FABO bimodal database” (Metri et al., 2011) that

recorded combined face and body expressions simul-

taneously. To assess their software, the authors em-

ployed the Principle Component Analysis algorithm

(PCA) for face recognition and the Cam Shift algo-

rithm for tracking hands and predicting their locations

in images.

(Lin et al., 2020) proposed a computational frame-

work for modeling vocal behaviors and body gestures

during Autism Diagnostic Observation Schedule in-

terviews. The learnable Interlocutor-Modulated (IM)

attention mechanism categorized ASD subgroups

considering the subtle and challenging nature of

ASD behaviors. The multimodal network comprised

speech-IM-aBLSTM and motion-IM-aBLSTM net-

works, fused to differentiate Autistic Disorder (AD),

High-Functioning Autism (HFA), and Asperger Syn-

drome (AS). The IM attention mechanism tracked

non-linear behavioral dependencies between inter-

locutors, achieving a UAR of 66.8% on a large ADOS

collection. (Alban et al., 2021) emphasized tech-

nology’s utility in detecting and improving therapy

for challenging behaviors in autistic children. They

explored detecting behaviors using a wearable sen-

sor (Empatica E4 wristband) and machine learning.

The annotation approach recorded instances of chal-

lenging behaviors and stimuli group interactions with

social robots. Features were analyzed using Sup-

port Vector Machine (SVM), Multi-Layer Perceptron

(MLP), and Decesion Tree (DT) techniques. The

model achieved promising results (97% accuracy),

suggesting potential efﬁciency in addressing chal-

lenging behaviors. A detection system with wear-

able sensors could notify parents or caregivers for

timely intervention, while social companion robots

could mediate and react to alleviate challenging be-

haviors.

2.4 Discussion

All the aforementioned works proposed efﬁcient

frameworks for emotion, activity and behavior recog-

nition based on multimodal approaches. These works

suggested several types of fusion methods such as

data-fusion, feature-level fusion, decision-fusion or

hybrid-fusion. Moreover, all these studies have shown

that the fusion of several modalities (two or more

modalities) increased the performance and robustness

of the proposed systems. In addition, in these ap-

proaches, authors used either machine learning tech-

niques (SVM, k-Nearest Neighbors (KNN), Na

ıve

Bayes, Hidden Markov Model (HMM), etc.) or deep

learning techniques (Convolutional Neural Network

or CNN (CNN), Long Short Term Memory (LSTM))

to classify emotions, activities, or behaviors. Mod-

els with high accuracy levels are implemented using

deep learning techniques. However, none of these

works collected multimodal data for autistic children

in an uncontrolled environment and during a Melt-

ICSOFT 2024 - 19th International Conference on Software Technologies

342

down crisis by using Kinect sensors. Moreover, no

work processed facial and body data during a syn-

chronized period to recognize the behaviors of autis-

tic people in normal states or during crises. In our

work, we suggest to use the early fusion to merge

two different modalities, namely compound emotions

and abnormal-complex physical activities, to recog-

nize behaviors and detect meltdown crises. To this en-

deavor, we tested custom architectures of deep learn-

ing to determine the most suitable and efﬁcient model

for our case (See Table 1).

3 ”MELTDOWN-CRISIS

DATASET” DESCRIPTION

In this research project, we created a novel dataset

called Meltdown Crisis, which contains realistic sit-

uations of autistic children in daily activities as well

as during a meltdown crisis. This was necessary be-

cause there were no publicly available and/or realistic

datasets. Making videos is an extremely important,

delicate, and serious undertaking. Furthermore, it’s

crucial to take ethics into account when ﬁlming autis-

tic youngsters on camera. In fact, in any society, get-

ting permission to ﬁlm young kids might be challeng-

ing. We were able to register 23 autistic children at the

healthcare center ”ASSAADA”for autistic children,

whose ages range from 6 to 15 years old, because of

our extensive study in healthcare centers for autistic

children around the world and speciﬁcally in Tunisia.

Thirteen of the twenty-three autistic youngsters who

had the worst meltdown symptoms participated in our

study. They were between the ages of ﬁve and nine.

Using a Kinect V2 camera set to record at 30 frames

per second, we watched and documented the behavior

of the thirteen youngsters who were chosen over three

months in real-world settings. Three rooms are used

for video acquisition, with preset parameters and an

average video length of one hour. Further diagnosis

and description of our ”Meltdown Crisis” Dataset can

be found in (Masmoudi et al., 2019).

4 AUTISTIC CHILD BEHAVIOR

ANALYSIS FOR MELTDOWN

CRISIS DETECTION

The process of combining information from numer-

ous sources for regression tasks is often deﬁned as

multimodal fusion. Multimodal fusion introduces the

advantages of using a robust and complementary in-

formation gain model and the functional continuity

of the system even in case of failure of one or more

modalities ((Ouyang et al., 2017), (Ding et al., 2016),

(Zhang et al., 2016)). In other words, early fusion

merges low-level features from each modality using

correlations for a better task accomplishment. How-

ever, it is sometimes difﬁcult to implement due to

same temporal synchronization problems based on

multiple input sources. The decision phase fusion,

however, gains unimodal decision values which are

combined to reach the ﬁnal decision. Although late

fusion ignores a few low-level interactions, it permits

easy training with more ﬂexibility and simplicity for

making predictions when one or more modalities are

missing. The hybrid fusion (mid-level) attempts to

exploit the advantages of both early and late fusion in

a common framework.

In our case, the aforementioned fusions (late and

hybrid) are unattainable because the suggested sys-

tems of emotions and physical activities were trained,

particularly on two distinct datasets. Children’s faces

cannot be detected at the present time in all video

frames. Hence, two different output sizes emerged

for both systems. Thus, we adopt the early fusion

method. Figure 1 shows the Multimodal CNN-model

architecture for meltdown crisis recognition. In this

CNN-model, we used three Convolution layers with

ReLu function activation. Two Max-pooling Layers

are also used for mapping extracted features from the

Convolution Layer. Then, a Fully Connected (FC)

layer with ReLu function is used. Once these fea-

tures are extracted, they are concatenated and consid-

ered as input to the single Fully Connected Layer. Af-

ter that, a Fully Connected layer with Softmax func-

tion is used to classify the children’s states (0 for

Normal state and 1 for Meltdown state) based on

two modalities, compound emotions, and abnormal-

complex physical activities. The pseudo-code pre-

sented subsequently, reports the details of the archi-

tecture development steps:

In this context, this model represents the most ef-

ﬁcient model because it is the fastest model that pro-

vides us with the best recognition rate. Thus, this

is a strong point because time is a signiﬁcant factor

for Smart-AMD System. Consequently, this step al-

lows us to propose a complete system that recognizes

abnormal behaviors during a meltdown crisis. Thus,

this model should help autistic children avoid harm-

ing themselves in case of a meltdown crisis by alerting

their caregivers.

Multimodal Approach Based on Autistic Child Behavior Analysis for Meltdown Crisis Detection

343

Table 1: Overview of emotion and physical activity recognition based on multimodal approaches.

Proposal Fusion Technique Key Findings Recognition Rate

(Busso et al., 2004) Feature-level and

Decision-level fusion

Bimodal emotion classiﬁer outperformed uni-

modal systems, with feature-level fusion ex-

celling in anger and neutral states recognition.

85.2%

(Castellano et al.,

2007)

Feature-level Fusion Fusion of data at feature and decision levels

signiﬁcantly improved recognition rates, par-

ticularly for eight emotions.

91.6%

(Kessous et al.,

2010)

Feature and Decision-

level Fusion

Gesture-speech fusion showed the most effec-

tive improvement in emotion recognition.

89.3%

(Psaltis et al.,

2019)

Early Fusion Deep model for emotion recognition achieved

higher accuracy compared to mono-modal and

early-fusion algorithms.

92.1%

(Pandeya and Lee,

2021)

Hybrid Fusion Multimodal CNNs outperformed unimodal

classiﬁers, achieving 88.56% accuracy in

emotion recognition.

88.56%

(Radoi et al., 2021) Late Fusion Temporally Aggregated Audio-Visual Net-

work (TA-AVN) achieved competitive results,

with accuracies of 84.0% and 78.7%.

84.0%

(Masurelle et al.,

2013)

Utilized Gaussian mix-

ture models and HMM

classiﬁers

Multimodal approach achieved 74% F-

measure in recognizing Salsa dance steps.

74.0%

(Li et al., 2017) Utilized fusion of ac-

celerometer, radar, and

RGB-Depth data

heterogeneous fusion improved overall system

performance, leading to a global classiﬁcation

rate increase.

91.3%

(Tian et al., 2020) Multilevel Fusion Proposed framework aligned features from

different modes into a common space, improv-

ing gesture recognition accuracy.

92.5%

(Lin et al., 2020) Data Fusion Integration of Kinect and wearable sensors

signiﬁcantly improved Human Action Recog-

nition accuracy.

68.35%

(Franco et al.,

2020)

Utilized Skeleton and

RGB data fusion

Fusion of Skeleton and RGB data streams en-

hanced activity recognition in home environ-

ments.

(Yu et al., 2020) Utilized real-time fea-

ture fusion

D3D-LSTM model achieved high accuracy in

discriminating similar actions.

95.40%

(Pimpalkar et al.,

2014)

Utilized gesture and fa-

cial expression recog-

nition

Gesture recognition system using webcam

showed potential in enhancing computer

awareness of user behaviors.

(Lin et al., 2020) Utilized speech and

motion-IM-aBLSTM

networks)

IM attention mechanism differentiated Autis-

tic Disorder (AD), High-Functioning Autism

(HFA), and Asperger Syndrome (AS).

66.8%

(Alban et al., 2021) Utilized Empatica E4

wristband and machine

learning

Detection system with wearable sensors

achieved promising results in addressing chal-

lenging behaviors in autistic children.

97.0%

5 EXPERIMENTAL RESULTS OF

MELTDOWN CRISIS

DETECTION BASED ON

AUTISTIC CHILDREN’S

BEHAVIOR ANALYSIS

To evaluate and validate our multimodal approach, we

carried out a set of experiments by using the proposed

deep model architecture with different settings and

parameters.

5.1 Experimental Results

In this section, an evaluation step is carried out to se-

lect the best model that allows to detect a Meltdown

crisis state. To this endeavor, we tested three architec-

tures that will be described subsequently.

ICSOFT 2024 - 19th International Conference on Software Technologies

344

Figure 1: Multi-modal CNN model architecture for meltdown crisis recognition based on compound emotions and abnormal-

complex physical activities.

Table 2: Architecture development steps of our Multimodal

CNN model.

INPUTS -Input 1: D

as the labeled data

for emotions

-Input 2: D

as the labeled data for

activities.

PROCESSING - The layers of the second modality

are renamed to remove ambiguity

(see Algorithm 1).

- Features extracted from the emo-

tion modality are deﬁned as the set

of {X

, X

,.............,X

}

- Features extracted from the activ-

ity modality are deﬁned as the set

of {X

, X

,.........,X

}

OUTPUT - Y

f +b

= 0 for Normal state

-Y

f +b

= 1 for Meltdown Crisis

state

5.1.1 Experiment 1: CNN-Model Architecture

Based on Inception-Resnet-V2 Features

In this experiment, we suggested to concatenate fea-

tures extracted from the pre-trained InceptionRes-

netV2 model. The latter is considered as the best

model that achieved the best results for compound

emotion and the abnormal-complex physical activity

systems (cf. Figure 2). Then we applied the Principal

Component Analysis (PCA) method to select relevant

features (2286 features). Once these features were se-

lected, they were concatenated and presented as in-

put for machine learning algorithms such as DT, MLP,

KNN, NB and SVM. The best results were recorded

(86.9% of Accuracy rate) using DT classiﬁer with

Bayesian optimization (See Table 3).

The core of this approach that combine fea-

tures extracted from each modality generate decisions

about the tackle based on multimodal analysis. Based

on the experiments, we have obtained a performance

about 86.9%. Of course this rate is acceptable because

ﬁrst it is based on multimodal approach. Second, it is

considered as the ﬁrst work analyzing complex and

realistic dataset. In fact, based on the experiments of

the literature review, the multimodal approach usually

improves the results. However, in our case, the results

showed a decreasing of performance comparing to the

single modalities. At the same time, we cannot con-

sider single modality approach either based on activ-

ity or based on emotion because we cannot guarantee

that this modality appears in meltdown crisis. So, we

should rely on a multimodal approach.

For these reasons, we have tried to enhance our

proposed model by proposing other architectures. So,

by taking advantage of the early fusion, we proposed

a multimodal CNN-model architecture for meltdown

crisis behavior recognition. These architectures are

described in the following subsections.

5.1.2 Experiment 2: Customized Multimodal

CNN-Model Architecture

As shown in Figure 3, we proposed a CNN-based ar-

chitecture with two input streams for detecting facial

expressions and for physical activities. Then, we in-

troduced the same layers with the same parameters;

Multimodal Approach Based on Autistic Child Behavior Analysis for Meltdown Crisis Detection

345

Data: Input 1: D

//Labeled Data Frame of

Emotions

Input 2: D

// Labeled Data Frame of Activities

Result: Output: Y

f +b

= 0 //for normal state;

f +b

= 1 //for meltdown crisis state

// X

: representative vector of D

// X

: representative vector of D

// Y

f +b

: Predicted classes

// numHiddenDimension: dimension of hidden

layer

// numClasses=2

// Layers=Layer1 = convolution2dLayer,

Layer2=maxPooling2dLayer,

Layer3=convolution2dLayer,

Layer4=maxPooling2dLayer,

Layer5=convolution2dLayer,

Layer6=fullyConnectedLayer]:layers of the

proposed model

//Layer: one of model’s layers

Layers = createLayer(

,numHiddenDimension) //create layers for

input1

Layers2 = createLayer(X

,numHiddenDimension) //create layers for

input2

//When the two layers are merged, the same

name of the layers cannot be used. So, we Use

renameLayerfunction to rename the layer

name in layers2

For Layer in Layers do

// for each layer in the model Layers2 =

renameLayerFunction(Layers2,’-2’) ;

// rename layers of input2 by adding’-2’

end

LayersAdd = concatenationLayer(1, 2,

’Name’, ’cat’) //add concatenation layer in

order to fuse features extracted from our two

inputs

FeaturesMap = ConcatenateFeatures{X

, X

,.............,X

+ X

, X

,.........,X

}

//The Feature map of the concatenate features

extracted from D

et D

LayersAdd =

fullyConnectedLayer(numClasses,’Name’)

//add fullyConnectedLayer for classiﬁcation

purposes

Algorithm 1: Rename layers of our CNN-model with two

inputs.

• The ﬁrst layer is allocated to the Convolution

layer with kernel - size = 4 and ReLu activation

function.

• The second layer is a max-pooling layer with pool

- size (2, 2).

• The third layer is allocated to Convolution layer

with kernel - size = 4 and ReLu as activation layer.

• The fourth layer is a max-pooling layer with pool

- size (2, 2).

• The ﬁfth layer is allocated to Convolution layer

with kernel - size = 4 and ReLu as activation layer.

• The sixth layer is a max-pooling layer with pool

— size (2, 2).

• The seventh layer is a ﬂatten layer.

• A concatenation Layer is employed to concate-

nate features extracted from two modalities.

• For classiﬁcation purposes, three Dense layers are

employed; the ﬁrst layer consists of 128 nodes

and ReLu activation function, the second Dense

layer is composed by 10 nodes and ReLu activa-

tion function.Finally, a Dense Layer is allocated

for classiﬁcation purposes with the Softmax Func-

tion.

• To compile this network, the Adam optimizer is

utilized with these default settings (Learning rate

= 0.001, beta-1 = 0.9, beta-2 = 0.999, epsilon =

le-07, amsgrad = False). Moreover, to ﬁt this net-

work, we used 100 epochs, 128 as a batch size.

We obtained motivating results with a validation

accuracy of 76.80%.

Table 3: Classiﬁcation results of the Inception-Resnet-V2

Fused Features.

Algorithm Parameters Accuracy

Fine Tree 80.7%

Medium Tree 85.7%

Coarse Tree 85.5%

Optimizable 86.9%

SVM

LinearSVM 69.2%

Quadratic SVM 74.4%

Cubic SVM 74.4%

Fine Gaussian SVM 74.4%

Medium Gaussian

SVM

74.4%

Coarse Gaussian

SVM

60.6%

KNN

FineKNN 73.9%

MediumKNN 64.0%

CoarseKNN 39.5%

CosineKNN 72.2%

CubicKNN 59.6%

Weighted KNN 71.2%

MLP

With 2 hidden layers

(10 nodes -20 nodes)

85.1%

With 3 hidden layers

(10 nodes -10 nodes -

10 nodes)

79.8%

With 3 hidden layers

(20 nodes -10 nodes -

10 nodes)

80.0%

ICSOFT 2024 - 19th International Conference on Software Technologies

346

Figure 2: CNN-model architecture based on Features extracted with Inception-Resnet-V2 model.

Figure 3: CNN-Model architecture.

5.1.3 Experiment 3: Multimodal CNN-Model

Architecture for Meltdown Crisis

Recognition Based on Compound

Emotions and Abnormal-Complex

Physical Activities.

As illustrated in Figure 1, we proposed a CNN based

architecture with two input streams for detecting fa-

cial expressions and for physical activities. Then, we

put forward the same layers with the same parame-

ters;

• The ﬁrst layer is allocated to a Convolution layer

with kernel-size= 4 and ReLu as activation func-

tion.

• The second layer is a max-pooling layer with

pool-size(2, 2).

• The third layer is allocated to a Convolution layer

with kernel-size = 4 and ReLu as an activation

layer.

• The fourth layer is a max-pooling layer with pool-

size(2, 2).

• The ﬁfth layer is allocated to the Convolution

layer with kernel-size = 4 and ReLu as the acti-

vation layer.

• The sixth layer is allocated to the Fully Connected

layer with 128 nodes.

• A concatenation Layer is employed to concate-

nate the features extracted from two modalities.

• For classiﬁcation purposes, a Dense layer is em-

ployed with the Softmax Function.

Multimodal Approach Based on Autistic Child Behavior Analysis for Meltdown Crisis Detection

347

Figure 4: Values for training Loss and validation Loss.

• To compile this network, the Adam optimizer

is utilized with these settings (Learning-rate =

0.001, beta-1 = 0.9, beta-2 = 0.999, epsilon = le-

07, amsgrad = False). Moreover, to ﬁt this net-

work, we used 50 epochs, 32 as a batch-size.

Our ”MeltdownCrisis” dataset is composed of

Emotions: (15113 Frames); Physical activities

(24929 Frames). Because, we run our model on

”cpu” environment, by using 3000 frames for each

modality. These data are split into 70% for the

training dataset, 20% for the testing dataset and

10% for the validation dataset. The recorded re-

sult achieved a validation accuracy of 99.50%.

The function Loss of this model is presented in the

Figure 4. The red line represents the Loss function

of training data and the blue line stands for the

Loss function of validation data. So, this model

represents the most efﬁcient model because it is

the fastest model that provides us with the best

recognition rate. Thus, this is a strong point be-

cause time is a signiﬁcant factor for Smart-AMD

System.

5.2 Validation

To perform a quantitative evaluation of the best

recorded result and validate our proposed approach,

we measured Recall, Precision and F-measure values.

The obtained results are 99.50% Validation-accuracy,

99.75% Precision, 98.50% Recall and 99.62% F-

measure.

5.2.1 Discussion

To achieve a more realistic evaluation, we assessed

our suggested approach by comparing it with exist-

ing literature works. According to Table 4, the work

of (Psaltis et al., 2019) permits to recognize of the

behaviors of normal people based on their facial and

body gesture modalities. In this work, the dataset was

acquired using the Kinect camera. The authors pro-

posed a CNN network for feature extraction and SVM

for classiﬁcation purposes. The recorded classiﬁca-

tion rate was 98.3%. In addition, based on two facial

expressions and speech modalities, [94] suggested

an Attentional BLSTM model to recognize autistic

behaviors and classify autistic children in different

groups. The dataset was acquired using High Deﬁ-

nition cameras during interviews between the psychi-

atrist and each autistic child. The obtained classiﬁ-

cation rate is 68.6%. However, our proposed model

proves its effectiveness, compared with both (Psaltis

et al., 2019) and (Lin et al., 2020). This can be ex-

plained by the fact that these approaches did not ad-

dress the challenges of recognizing autistic abnormal

behaviors during a Meltdown crisis. In addition, they

were not conducted in an uncontrolled environment.

6 CONCLUSION

In conclusion, this paper presented a comprehensive

approach for analyzing autistic children’s behavior

during a meltdown crisis, with a focus on the de-

tection of such crises. The proposed multimodal fu-

sion method adopted early fusion due to challenges

arising from different output sizes of the suggested

ICSOFT 2024 - 19th International Conference on Software Technologies

348

Table 4: A Comparative study with state of the art methods.

Paper Approach People