AI-Driven Early Mental Health Screening: Analyzing Selﬁes of Pregnant

Women

Gustavo A. Bas

ılio

1 a

, Thiago B. Pereira

1 b

, Alessandro L. Koerich

2 c

, Hermano Tavares

3 d

Ludmila Dias

1 e

, Maria G. S. Teixeira

4 f

, Rafael T. Sousa

5 g

, Wilian H. Hisatugu

4 h

Amanda S. Mota

3 i

, Anilton S. Garcia

, Marco Aur

elio K. Galletta

7 j

and Thiago M. Paix

1 k

Federal Institute of Esp

ırito Santo (IFES), Campus Serra, Serra, Brazil

Ecole de Technologie Sup

erieure (

ETS), Montreal, Canada

Department of Psychiatry, University of S

ao Paulo Medical School (FMUSP), S

ao Paulo, Brazil

Department of Computing and Electronics, Federal University of Esp

ırito Santo (UFES), Campus S

ao Mateus, S

Mateus, Brazil

Federal University of Mato Grosso (UFMT), Barra do Garc¸as, Brazil

Federal University of Esp

ırito Santo (UFES), Campus Goiabeiras, Vit

oria, Brazil

Department of Obstetrics and Gynecology, University of S

ao Paulo Medical School (FMUSP), S

ao Paulo, Brazil

Keywords:

Mobile Health, Mental Health, Pregnancy Healthcare, Affective Computing, Facial Analysis, Convolutional

Neural Networks, Visual-Language Models, Deep Learning.

Abstract:

Major Depressive Disorder and anxiety disorders affect millions globally, contributing signiﬁcantly to the bur-

den of mental health issues. Early screening is crucial for effective intervention, as timely identiﬁcation of

mental health issues can signiﬁcantly improve treatment outcomes. Artiﬁcial intelligence (AI) can be valuable

for improving the screening of mental disorders, enabling early intervention and better treatment outcomes.

AI-driven screening can leverage the analysis of multiple data sources, including facial features in digital im-

ages. However, existing methods often rely on controlled environments or specialized equipment, limiting

their broad applicability. This study explores the potential of AI models for ubiquitous depression-anxiety

screening given face-centric selﬁes. The investigation focuses on high-risk pregnant patients, a population

that is particularly vulnerable to mental health issues. To cope with limited training data resulting from our

clinical setup, pre-trained models were utilized in two different approaches: ﬁne-tuning convolutional neural

networks (CNNs) originally designed for facial expression recognition and employing vision-language models

(VLMs) for zero-shot analysis of facial expressions. Experimental results indicate that the proposed VLM-

based method signiﬁcantly outperforms CNNs, achieving an accuracy of 77.6%. Although there is signiﬁcant

room for improvement, the results suggest that VLMs can be a promising approach for mental health screen-

ing.

https://orcid.org/0009-0008-6329-9966

https://orcid.org/0009-0009-6652-2913

https://orcid.org/0000-0001-5879-7014

https://orcid.org/0000-0002-6632-2745

https://orcid.org/0009-0007-6486-6259

https://orcid.org/0000-0003-0398-8029

https://orcid.org/0000-0001-5998-046X

https://orcid.org/0000-0001-8333-0539

https://orcid.org/0000-0003-1553-5297

https://orcid.org/0000-0002-0016-1803

https://orcid.org/0000-0003-1554-6834

1 INTRODUCTION

Major depressive disorder (depression) is a prevalent

mental health disorder affecting over 280 million peo-

ple globally and a leading cause of disability world-

wide. It contributes signiﬁcantly to the global dis-

ease burden (World Health Organization, 2023). Sim-

ilarly, anxiety disorders, marked by excessive worry

and fear, impact millions of lives, often leading to

debilitating effects on daily functioning and quality

of life. Particularly, there is a recognized vulnerabil-

ity to mood disorders during pregnancy, with possible

worsening of previous emotional conditions and the

emergence of new altered mental states that increase

308

Basílio, G. A., Pereira, T. B., Koerich, A. L., Tavares, H., Dias, L., Teixeira, M. G. S., Sousa, R. T., Hisatugu, W. H., Mota, A. S., Garcia, A. S., Galletta, M. A. K. and Paixão, T. M.

AI-Driven Early Mental Health Screening: Analyzing Selﬁes of Pregnant Women.

DOI: 10.5220/0013372100003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 2: HEALTHINF, pages 308-318

ISBN: 978-989-758-731-3; ISSN: 2184-4305

the risk of depression and anxiety (Biaggi et al.,

2016). This vulnerability is certainly inﬂuenced by

the signiﬁcant hormonal changes of pregnancy. How-

ever, it is also important to recognize that this is a time

of great physical, psycho-emotional, cultural, and so-

cial changes, generating psychic stress and increased

anxiety.

Several studies have pointed to the association

between depression and anxiety during pregnancy

with unfavorable obstetric and neonatal outcomes.

These disorders can increase the risk of obstetric

complications such as cesarean delivery, preeclamp-

sia, preterm birth, low birth weight, small for ges-

tational age newborns, and newborns with low Ap-

gar scores, indicating lower oxygenation at birth (Li

et al., 2021; Nasreen et al., 2019; Dowse et al., 2020;

Kurki et al., 2000). These ﬁndings underline the im-

portance of making a correct and early diagnosis, al-

lowing for appropriate psychiatric and psychological

follow-up during pregnancy to improve both maternal

and neonatal outcomes.

For early intervention and improved treatment

outcomes, screening processes are essential for iden-

tifying individuals who may have conditions such as

depression or anxiety disorders before they present

symptoms or seek treatment (Thombs et al., 2023).

Various methods can be employed for screening, in-

cluding self-reported questionnaires, clinical inter-

views with mental health professionals, and observ-

ing behaviors and physical symptoms. Additionally,

studies in affective computing, a ﬁeld that explores

the interaction between human emotions and compu-

tational systems, have demonstrated the potential of

artiﬁcial intelligence (AI) in screening mental disor-

ders (Kumar et al., 2024). AI-driven technologies can

enhance screening by analyzing data from multiple

sources, such as text inputs, voice recordings, or fa-

cial expressions (Liu et al., 2024c), which is the very

focus of this work.

In this work, we address the challenge of ubiq-

uitous depression-anxiety screening from face-centric

selﬁe images in a real-world clinical setting involving

high-risk pregnant patients. This effort is part of a

broader research project led by our group, aimed at

developing mobile applications for mental health as-

sessment. In our application scenario, the user takes

a selﬁe with a smartphone front-facing camera. The

application sends the image to a server, where an

AI model analyzes it and provides a label indicat-

ing whether the patient is normal or has symptoms of

depression-anxiety. The server returns the label to the

application, providing feedback to the user. To train

and evaluate the models, we use image data (selﬁes)

and responses to the Patient Health Questionnaire-4

(PHQ-4), a brief screening tool for anxiety and de-

pression. The PHQ-4 responses are used to derive im-

age labels – normal or abnormal (depression-anxiety)

–, which play the role of supervisory signals during

training and ground truth for evaluation. To the best

of our knowledge, this is the ﬁrst study that investi-

gates the use of face-centric selﬁes for screening anx-

iety and depression in high-risk pregnant patients in a

clinical context.

In line with the state-of-the-art, deep models are

employed in facial analysis. Pre-trained models are

leveraged using two distinct approaches to address

the challenge of limited training data (most partici-

pants contributed with only a single photo). In a more

traditional approach, we employ a transfer learning

strategy where convolutional neural networks (CNNs)

originally trained for facial expression recognition

are ﬁne-tuned for depression-anxiety detection. In a

more innovative approach, we propose using power-

ful vision-language models (VLMs) as a facial ana-

lyzer. While VLMs are not suitable for directly as-

sessing depression-anxiety, as demonstrated in our

experiments, they excel in zero-shot detailed descrip-

tions of facial expressions that correspond to basic

emotions (e.g., anger, happiness, and sadness). This

approach detects depression-anxiety by classifying

the generated text with simple neural networks in-

stead of directly classifying the image. Experiments

conducted under a rigid Leave One Subject Out pro-

tocol revealed that our VLM-based approach outper-

formed the traditional CNNs, achieving an accuracy

of 77.6%, which represents a gain of approx. 10.0

percentage points (p.p.) compared to the CNNs, and

an F1-score of 56.0%, an improvement of approx.

11.0 p.p..

In summary, the main contributions of this work

are:

• A novel VLM-based approach for ubiquitous

mental-health screening from selﬁes.

• A study on the use of AI for depression-anxiety

screening in high-risk pregnant patients.

• Collection of a dataset

comprising selﬁes and

PHQ-4 responses from high-risk pregnant pa-

tients.

• Comprehensive assessment of VLMs and CNNs

for depression-anxiety screening with limited

data.

The rest of this paper is organized as follows. The

next section discusses the related works. Section 3

Anonymized descriptions produced by the VLMs and

corresponding PHQ-4 responses are publicly available at

https://bit.ly/3E1EPKw.

AI-Driven Early Mental Health Screening: Analyzing Selﬁes of Pregnant Women

309

describes the two approaches for AI-driven screen-

ing methodology. Section 4 presents the experimental

setup, while Section 5 discusses the results. Finally,

Section 6 concludes the paper and discusses future

work opportunities.

2 RELATED WORK

The use of facial features to identify non-basic emo-

tions (depression, stress, engagement, shame, guilt,

envy, among others) is becoming increasingly popular

in various applications. Nepal et al. (2024) highlight

several elements explored in this type of analysis,

such as facial expression itself, gaze direction, as well

as general image characteristics like luminosity and

background settings. In this domain, machine learn-

ing (ML) plays an essential role, particularly with the

use of deep learning (DL) models and related tech-

niques, such as transfer learning and attention mech-

anisms. As highlighted by Kumar et al. (2024), DL

models have become the primary choice for detect-

ing non-basic emotions since 2020, surpassing tradi-

tional machine learning and image processing algo-

rithms. Regarding models, convolutional neural net-

works (CNNs) are the most popular choice for image-

based analysis of non-basic emotions. For instance,

Gupta et al. (2023) used CNNs to quantify student en-

gagement in online learning, while Zhou et al. (2018)

applied them for detecting depression based on facial

images. Transfer learning plays a crucial role as pre-

trained models for tasks like basic facial expression

recognition can be ﬁne-tuned for speciﬁc targets like

detecting stress (Voleti et al., 2024). Usually, pre-

training leverages larger datasets, which is particu-

larly useful when the target dataset is small, as is of-

ten the case in mental health applications. Attention

mechanisms are also relevant in this context, as they

can help models focus on speciﬁc regions of the face

that are more informative for the target task (Viegas

et al., 2018; Belharbi et al., 2024).

Despite the promising results, most studies on de-

pression detection through facial expressions involve

capturing images in controlled environments where

individuals follow a predetermined script, which lim-

its the realism of the facial expressions (Liu et al.,

2022; Kong et al., 2022) and the broad applicabil-

ity of pre-trained models. Nonetheless, a few stud-

ies focus on natural images captured by smartphone

cameras for ubiquitous screening. Darvariu et al.

(2020) developed an application where users record

their emotional state through photographs taken by

the rear camera of smartphones. Alternatively, front-

facing cameras can facilitate the capture of facial im-

ages, whose analysis can offer valuable insights into

a person’s emotional state (Wang et al., 2015). In re-

cent work, Nepal et al. (2024) introduced a depression

screening approach named MoodCapture, which was

evaluated on a large dataset of facial images collected

in the wild (over 125,000 photos). A key aspect of

MoodCapture is the passive collection of images in

multiple shots while users complete a self-assessment

depression questionnaire on their smartphones. The

authors argue that these images are preferable to tra-

ditional selﬁes as they capture more authentic and un-

guarded facial expressions.

Similarly to MoodCapture, the proposed work fo-

cuses on analyzing selﬁes captured by smartphone

cameras for anxiety and depression screening. How-

ever, our study differs from MoodCapture in two ma-

jor aspects. First, our work targets a speciﬁc popu-

lation: high-risk pregnant patients accompanied by a

team of healthcare professionals. This imposes a lim-

itation on the dataset size when compared to crowd-

sourced data. Second, our methodology focuses on

face-centric selﬁes rather than analyzing general im-

age aspects. To the best of our knowledge, this is the

ﬁrst study to explore the use of face-centric selﬁes for

anxiety and depression screening in high-risk preg-

nant patients. Another relevant aspect of our work is

the use of vision-language models (VLMs). Unlike

traditional models that predict a limited set of basic

emotions, VLMs can capture more nuanced emotions

such as awe, shame, and emotional suppression (Bian

et al., 2024). Despite being used for general emo-

tion analysis, the particular application of VLMs for

mental health screening is another contribution of our

work.

3 AI-DRIVEN SCREENING VIA

SELFIE ANALYSIS

This study relies on selﬁes taken by pregnant pa-

tients paired with their responses to the Patient Health

Questionnaire-4 (PHQ-4), a four-item instrument for

brief screening of anxiety and depression. The PHQ-

4 was chosen for this study due to its efﬁciency and

brevity in screening for both anxiety and depression,

making it particularly useful in clinical settings where

time is limited. Comprising only four items (Figure

1), it integrates questions from the PHQ-2 and GAD-

2, which are validated tools for detecting depression

and anxiety, respectively. This dual focus allows for a

rapid assessment of two of the most prevalent mental

health conditions, while maintaining strong psycho-

metric properties, making it a reliable and practical

tool for screening purposes in a clinical sample. The

HEALTHINF 2025 - 18th International Conference on Health Informatics

310

PHQ-4

Over the last 2 weeks, how often have you been bothered by the following problems?

Not at all

Several days

More than

half the days

Nearly every

day

Feeling nervous, anxious or on edge

Not being able to stop or control worrying

Little interest or pleasure in doing things

Feeling down, depressed, or hopeless

Figure 1: Patient Health Questionnaire-4 (PHQ-4) items and the respective frequency scores.

Outputs

VLM-based

* IA-generated image for illustration purpose.

Test (online)

Train (offline)

Image dataset

Total score

e.g.

Total score

e.g.

Filtering

Selfies

Total score

e.g.

PHQ-4

Label

e.g.

Cropping

Total score

e.g.

Training

Descriptions (VLM)

+ feat. extraction

(removed images)

FFNN

CNN-based

Training

Outputs

VLM-based

Training

Descriptions (VLM)

+ feat. extraction

Trained FFNN

CNN-based

Inference

Cropping

Selfie

Trained CNN

CNN

Figure 2: Overview of the proposed methodology for depression-anxiety detection using selﬁes and PHQ-4 responses. The

training pipeline (top ﬂow) involves ﬁltering invalid selﬁes, cropping face regions with MTCNN, and labeling images based

on PHQ-4 scores. The model (CNN or FFNN) is trained either directly from images (CNN-based) or text descriptions (VLM-

based). The test pipeline (bottom ﬂow) uses the trained model to classify new selﬁes.

total score, calculated as the sum of individual scores,

ranges from 0 to 12, with higher scores indicating

greater severity of anxiety and depression symptoms.

A score of 6 or higher on the PHQ-4 is typically used

as a cut-off point for identifying cases where either

anxiety or depression (or both) may be present and

warrant further clinical evaluation (Caro-Fuentes and

Sanabria-Mazo, 2024). This cut-off point helps to

identify individuals with moderate to severe symp-

toms of either condition, ensuring efﬁcient screening,

including mixed anxiety-depression states, which are

fairly common among pregnant women. It allows for

initial assessment without the need to differentiate be-

tween the two disorders (Javadekar et al., 2023).

To enable AI-driven screening, we propose a

methodology that leverages selﬁes and PHQ-4 re-

sponses to train machine learning (ML) models for

depression-anxiety detection. Figure 2 provides a

joint overview of the VLM- and CNN-based ap-

proaches addressed in this work. There are two

pipelines: the training pipeline (top ﬂow) and the test

pipeline (bottom ﬂow). The training pipeline lever-

ages image and PHQ-4 data to train an ML model

for depression-anxiety detection. The training re-

quires an image dataset comprising faces and the re-

spective labels (0-normal, 1-abnormal). To assem-

ble this dataset, the ﬁrst step is manually ﬁltering out

invalid selﬁes, which are those taken with the rear-

facing camera or with faces covered by a mask. The

face region of the remaining samples is subsequently

AI-Driven Early Mental Health Screening: Analyzing Selﬁes of Pregnant Women

311

VLM

Description dataset

<text-prompt>

Labels

<image-prompt>

"Describe in detail the emotional state of

the person in the photo based on her/his

facial expression. Provide straight

sentences in your answers."

The person in the photo appears

to be experiencing a negative

emotional state. Her facial

expression suggests a sense of

displeasure or dissatisfaction.

Her eyebrows are slightly

furrowed, which typically

indicates concern or frustration.

Zero-shot

generation

Labels

* IA-generated image for illustration purpose.

Figure 3: Zero-shot description generation with VLMs. The VLM prompt consists of an image (cropped face from a selﬁe)

and a text instruction (text prompt). A description is generated for each face image in the image dataset. The label from each

source image is transferred to the respective generated description, giving rise to an annotated dataset of textual descriptions.

cropped by using a multi-task cascaded convolutional

network (MTCNN) (Zhang et al., 2016). Selﬁes with

multiple detected faces are discarded, as the PHQ-4

responses are individual. As previously explained, a

label is derived for each face image by thresholding

the PHQ-4 overall score.

With a valid set of face images and labels, an ML

model can be trained. In the CNN-based approach,

the face images are directly input into the classiﬁ-

cation model. The VLM-based approach, by con-

trast, involves generating descriptions of emotional

states through face analysis and then extracting fea-

tures from these textual descriptions. These features

are subsequently used as inputs for the classiﬁca-

tion head, speciﬁcally a feed-forward neural network

(FFNN). Once the model (CNN or FFNN) is trained,

it can be used to classify an input selﬁe, as illustrated

in the bottom ﬂow of Figure 2. The face region in the

selﬁe is cropped using the MTCNN, as in the top ﬂow.

The face image (or text features) is then input to the

trained model, which outputs class probabilities. The

depression-anxiety condition is veriﬁed if, and only

if, Pr(abnormal |sample) > 0.5. More details of the

two detection approaches are provided in the follow-

ing sections.

3.1 CNN-Based Approach

This approach employs CNNs for predicting

depression-anxiety directly from image data, as

illustrated in Figure 2. Training from scratch is

unfeasible in our context due to the reduced number

of selﬁes/PHQ-4 responses: 147 samples from a

total of 108 participants. To circumvent this issue,

we start with CNN models pre-trained on the Im-

ageNet dataset and ﬁne-tune them on large-scale

facial expression recognition (FER) datasets, which

are closely related to our target task. A second

ﬁne-tuning is performed to adapt the FER pre-trained

models to our task.

This work investigates the adaptation of mod-

els pre-trained on the FER2013 (Goodfellow et al.,

2013), and RAF-DB (Li et al., 2017) datasets, both

of which encompass seven basic emotions (classes):

anger, disgust, fear, happiness, sadness, surprise, and

neutral state. Four CNN architectures were investi-

gated: EfﬁcientNetV2 (Tan and Le, 2021), ResNet-

18 and ResNet-50 (He et al., 2016), and VGG11 (Si-

monyan and Zisserman, 2014). In the second ﬁne-

tuning stage, the classiﬁer consists of a CNN back-

bone pre-trained on FER or RAF-DB appended with

a 2-output fully connected layer (classiﬁcation head).

The backbone is frozen during training to prevent

overﬁtting, ensuring that only the classiﬁcation head

is trained.

3.2 VLM-Based Approach

This approach leverages large generative vision-

language models (VLMs) for facial analysis. Mod-

ern VLMs (Bordes et al., 2024) combine visual en-

coders with large language models (LLMs) and can

simultaneously learn from images and text. This en-

ables them to perform various tasks, such as answer-

ing visual questions and captioning images. In partic-

ular, we beneﬁt from the VLM zero-shot instruction-

following ability to produce high-quality descrip-

tions when provided with an image and text prompt.

Three modern VLMs were investigated in this work:

LLaVA-NeXT (Liu et al., 2024a), which is an im-

provement on LLaVA (Liu et al., 2024b), Kosmos-2

(Peng et al., 2023), and the proprietary GPT-4o.

Figure 3 illustrates the intended usage of VLMs

in this work. Instead of using the image-labeled

dataset directly, textual descriptions (in the form of

sentences) are generated by following the instruction

HEALTHINF 2025 - 18th International Conference on Health Informatics

312

Figure 4: Data distribution after removing invalid samples.

Most subjects contributed with a single sample, while one

contributed with nine. The imbalance is evident, with a bias

towards negative samples (PHQ-4 < 6).

in the input prompt: “Describe in detail the emotional

state of the person in the photo based on her/his fa-

cial expression. Provide straight sentences in your

answer.”. It is worth mentioning that the prompt does

not address the target task directly, i.e., detecting anx-

iety and/or depression. Nonetheless, experimental

results (Section 5) reveal that the pre-trained VLMs

were unable to directly predict the depression-anxiety

condition. This motivated using VLMs as analysts

rather than judges, delegating the ﬁnal decision to a

secondary model, speciﬁcally an FFNN.

To build the classiﬁcation model, features are

extracted from the text descriptions by using a

pre-trained Sentence-BERT (Reimers and Gurevych,

2019) model called all-MiniLM-L6-v2, which is de-

signed to embed sentences and paragraphs into a 384-

dimensional representation. Two FFNN architectures

(default and alternative) are investigated in this work,

as expressed in the Equation (1):

net

(x) =

(

Wx + b (default)

(ReLU(W

x + b

) + b

(alternative),

(1)

where W

•

and b

•

denote a weight matrix and a bias

vector, respectively. The default model is a simple

linear classiﬁer, while the alternative model includes

a hidden layer and is explored in a sensitivity analysis

in the experiments. The logits of the network, f

net

(x),

are used to calculate the class probabilities through

softmax normalization: Pr(

•

|x) = softmax( f

net

(x)).

4 EXPERIMENTAL

METHODOLOGY

This section outlines the elements of the experimental

methodology: the data collection procedure involving

selﬁes and PHQ-4 responses, the experiments con-

ducted, and, ﬁnally, the hardware and software setup.

4.1 Data Collection

Selﬁes and PHQ-4 responses constitute a subset of

data collected for a research project (local ethics

board consent: CAAE 64158717.9.0000.0065) in-

volving high-risk pregnant patients from the Clinical

Hospital of the University of S

ao Paulo (S

ao Paulo,

Brazil). The research focused on pregnant women

aged 18 and above who could provide informed con-

sent, had at least an elementary education, and owned

a smartphone. Data collection utilized smartphones,

with the participants capturing natural photos (selﬁes)

in such a way that their faces were visible. They were

responsible for uploading image data to the REDCap

platform (Harris et al., 2009) instance hosted at the

University of S

ao Paulo, which was also utilized for

completing the PHQ-4 questionnaire.

For this study, 108 participants were recruited and

instructed to submit bi-weekly reports, each consist-

ing of the PHQ-4 questionnaire

and a selﬁe. Figure 4

illustrates the data distribution after manually remov-

ing invalid selﬁes. The left chart displays a total of

147 selﬁes from the 108 participants. Although each

participant was expected to contribute multiple sam-

ples (11 in total), most participants submitted only a

single selﬁe. The right-side chart highlights the im-

balance in the data, with a bias towards negative sam-

ples (PHQ-4 < 6): 106 negative versus 41 positive

samples.

4.2 Comparative Evaluation

The comparative evaluation aims to assess the perfor-

mance of the CNN- and VLM-based approaches for

detecting depression-anxiety in selﬁes. Recall that

while a CNN architecture is trained in the ﬁrst ap-

proach, a FFNN architecture is trained in the latter

approach. The performance of pre-trained VLMs for

zero-shot screening was also evaluated to serve as a

baseline for the VLM-based approach. To conduct

the experiments, the collected data was processed to

create an image dataset consisting of cropped faces

and labels, as discussed in Section 3. The CNN and

FFNN models were trained and tested under a Leave

One Subject Out (LOSO) cross-validation protocol

to avoid performance overestimation. This protocol

ensures that each subject’s data is used as a unique

test set while the remaining data forms the training

set, providing a robust measure of model performance

across different individuals.

In each training-testing session, the samples of

a single subject are classiﬁed, and the predictions

(normal or abnormal) are recorded. Once all pre-

dictions are available, performance metrics are cal-

culated. Traditional metrics in binary classiﬁcation

are utilized: precision, recall, F1-score, area under

The Portuguese version of the PHQ-4 used in this work

AI-Driven Early Mental Health Screening: Analyzing Selﬁes of Pregnant Women

313

ROC curve (AUC), and accuracy. F1-score is par-

ticularly relevant because it accounts for the dataset

imbalance, a characteristic of our data. Speciﬁc im-

plementation details for each depression-anxiety de-

tection approach are provided below.

CNN-Based Approach. The pre-training on both

FER2013 and RAF-DB was conducted for 30 epochs

with a ﬁxed learning rate of 10

−4

and a ℓ2-

regularization factor of 10

−5

. In the default set-

tings, ﬁne-tuning was conducted for 30 epochs, with

a learning rate of 10

−4

, and an ℓ2-regularization fac-

tor of 10

−5

. Different parameters were used based

on empirical evidence of overﬁtting in the training.

ResNet-18 (RAF-DB): for 50 epochs, a learning rate

of 2 × 10

−5

, and a weight decay of 10

−6

. ResNet-50

(RAF-DB): it included a dropout regularization with

a probability of 50%. Pre-training and ﬁne-tuning uti-

lized Adam optimization.

VLM-Based Approach. VLMs were conﬁgured

for deterministic inference, employing a strategy that

selects the most probable next token at each step of

the generation process. The default (linear) model

(Equation 1) was adopted as the classiﬁer. To ad-

dress the dataset imbalance, the descriptions associ-

ated with positive samples (minority class) were up-

sampled to match the number of negative samples.

For each VLM, the classiﬁcation head underwent

training using the Adam optimizer for 15 epochs. Ad-

ditional training parameters included a learning rate

of 10

−4

, a batch size of 2, and a ℓ2-regularization

factor of 10

−4

. During each training session, a ran-

domly selected subset comprising 10% of the training

data was reserved for validation purposes. The best-

epoch checkpoint was determined based on achieving

the highest F1-score on the validation set.

Zero-Shot Screening with VLMs. A natural ques-

tion arises when using powerful models such as

VLMs: “Are pre-trained VLMs capable of zero-shot

screening depression-anxiety?” To answer this ques-

tion, we conducted an experiment in which VLMs

were asked to classify an input image based on the

following instruction prompt: “Describe in detail the

emotional state of the person in the photo based on

her/his facial expression. Provide straight sentences

in your answers. Based on your description, clas-

sify the emotional state as either ’normal’, ’anxi-

ety’, or ’depression’. The output must be exactly

one of these words. Follow the template: Output:

{result}”. In preliminary tests, only GPT-4o followed

strictly the instruction prompt, while LLaVA-NeXT

and Kosmos-2 generated longer and more complex

descriptions. Therefore, this experiment was per-

formed only for GPT-4o. GPT-4o classiﬁed each of

the 147 samples into one of the three classes: normal,

anxiety, or depression. The samples classiﬁed as anx-

iety or depression were considered positive, while the

normal samples were considered negative.

4.3 Sensitivity Analysis

This additional experiment explores the effects of

incorporating a hidden layer into the FFNN of the

VLM-based approach (alternative model, Equation

1). The model was evaluated with various hidden

units: h = 4, 8, 16,.. . ,256. This investigation was

motivated by the promising results observed in the

VLM-based approach, as elaborated further in Sec-

tion 5. Furthermore, understanding the impact of

the classiﬁcation head is relevant because it is the

only trainable component in the VLM-based pipeline.

The LOSO protocol was also employed in this in-

vestigation. Additionally, during model training, a

dropout layer was incorporated after the ReLU acti-

vation function to prevent overﬁtting.

4.4 Hardware-Software Setup

Hardware: Intel(R) Xeon(R) CPU @ 2.20GHz with

32GB of RAM, running Linux Ubuntu 22.04.4

LTS, and equipped with an NVIDIA L4 GPU

with 24GB of memory. Software: The source

code was written in Python 3.10, mostly using

PyTorch 2.3 for model training and inference.

The Sentence-BERT (Reimers and Gurevych,

2019) is implemented in SentenceTransformers

library

. The LLaVA-NeXT and Kosmos-2 models

– llava-hf/llava-v1.6-mistral-7b-hf and

microsoft/kosmos2-patch14-224, respectively,

are available at the HuggingFace

, while GPT-4o was

accessed via the OpenAI API.

5 RESULTS AND DISCUSSION

This section presents the results of the conducted

experiments: comparative evaluation and sensitivity

analysis. Limitations and challenges are also dis-

cussed at the end of this section.

https://sbert.net

https://huggingface.co

HEALTHINF 2025 - 18th International Conference on Health Informatics

314

Table 1: Overall results for the comparative evaluation (%).

Model Pre-train Prec. Rec. F1-score AUC Acc.

CNN-based

ResNet-18

FER2013

36.2 51.2 42.4 62.7 61.2

ResNet-50 36.5 46.3 40.9 60.1 62.6

VGG11 39.0 39.0 39.0 57.1 66.0

EfﬁcientNetV2 39.6 51.2 44.7 65.7 64.6

ResNet-18

RAF-DB

37.9 53.7 44.4 64.5 62.6

ResNet-50 28.0 34.1 30.8 52.1 57.1

VGG11 28.6 43.9 34.6 54.8 53.7

EfﬁcientNetV2 41.7 48.8 45.0 68.4 66.7

VLM-based

GPT-4o

⋆

61.8 51.2 56.0 72.6 77.6

Kosmos-2

⋆

40.0 53.7 45.8 72.0 64.6

LLaVA-NExT

⋆

41.2 51.2 45.7 66.2 66.0

Zero-shot Screening

GPT-4o - 53.9 34.2 41.8 61.4 73.5

Bold values indicate the highest metric value within each approach.

⋆

VLM with (default) linear classiﬁer model.

5.1 Comparative Evaluation

Table 1 shows the results of the comparative eval-

uation experiment. Overall, EfﬁcientNetV2 outper-

formed the compared models among the CNNs for

both FER2013 and RAF-DB pre-training, while GPT-

4o yielded the best performance among the VLMs.

Pre-training on RAF-DB improved most metrics for

EfﬁcientNetV2 and ResNet-18, although a notable

decrease in performance was observed for ResNet-

50 and VGG11. Notably, ResNet-18 yielded better

performance than ResNet-50, despite its reduced size.

This suggests that larger models might require more

data to achieve better performance.

Regarding the F1-score, using RAF-DB resulted

in only a 0.3 p.p. improvement over FER2013 for

EfﬁcientNetV2, while recall decreased by 2.4 p.p.. In

medical applications, lower recall indicates the risk of

missing a signiﬁcant number of actual cases, which

can lead to undiagnosed conditions and potentially

severe consequences for patient health and safety.

In this context, a remarkable gain was observed for

ResNet-18, whose recall raised from 46.3 to 53.7%

with RAF-DB. Despite its lower precision compared

to EfﬁcientNetV2, ResNet-18 presents itself as a vi-

able alternative given the critical importance of the

recall metric in this context. Moreover, its F1-score is

only marginally lower by less than 1 p.p. compared to

EfﬁcientNetV2. The use of CNNs represents a more

traditional way to address this problem. The obtained

results for these models reveal how challenging the

task is for the addressed scenario, speciﬁcally when

using a small dataset.

As an alternative, we proposed using pre-trained

VLMs due to their ability to analyze faces. Table 1

shows the results for the VLM-based approach with

the classiﬁcation head (FFNN) in its default conﬁgu-

Figure 5: Sensitivity analysis. The FFNN classiﬁer (VLM-

based) was evaluated with various hidden units: h =

4,8, 16,. . . , 256. The dashed line in each chart represents

the highest value for the respective metric as reported in Ta-

ble 1.

ration. Overall, the F1-score for the three evaluated

VLMs surpassed the CNN-based models, with GPT-

4o achieving the highest value. GPT-4o outperformed

the open-source models, Kosmos-2 and LLavA-NExt,

across all metrics, except for the recall metric, where

Kosmos-2 achieved the same 53.7% as ResNet-18.

LLAvA-NExT and Kosmos-2 showed a similar F1-

score, with the most signiﬁcant difference observed

in the AUC metric: nearly 6 p.p. in favor of Kosmos-

2. A particularly notable result is the F1-score ob-

tained with GPT-4o, which surpasses its competitors

by a large margin, almost 10 p.p. higher, being the

only model to perform above 50% in this metric. In

summary, GPT-4o yielded the best performance con-

sidering both CNN- and VLM-based approaches.

The last row of Table 1 shows the results for the

zero-shot screening with GPT-4o. While the accu-

racy in this scenario was superior to that obtained with

LLAvA-NExT and Kosmos-2, the recall and F1-score

demonstrated the opposite trend. Remarkably, the re-

call in zero-shot screening was 17 p.p. lower than that

obtained with the VLM-based classiﬁcation models.

By comparing with the GPT-4o in the VLM-based

approach, we conclude that the proposed formulation

based on description generation and classiﬁcation is

crucial for achieving high performance.

5.2 Sensitivity Analysis

While Table 1 focused on the default FFNN model

(VLM-based), this experiment investigated the alter-

native (single hidden layer) FFNN model (Equation

(1)). More speciﬁcally, it was analyzed the impact of

increasing the model complexity by varying the num-

ber of hidden units (h = 4, 8, 16,.. . ,256). Figure 5

shows the results of this experiment. The dashed line

in each chart represents the highest value for the re-

spective metric as reported in Table 1. Notably, the

F1-score achieved with GPT-4o and the default clas-

siﬁcation model (56.0%) – indicated by the dashed

AI-Driven Early Mental Health Screening: Analyzing Selﬁes of Pregnant Women

315

(a) PHQ-4 = 8 (b) PHQ-4 = 2 (c) PHQ-4 = 7 (d) PHQ-4 = 2

Figure 6: Challenging cases that resulted in misclassiﬁ-

cation. The Grad-CAM attention maps highlight the ar-

eas around the mouth as the most inﬂuential for Efﬁcient-

NetV2’s prediction in (a) and (b).

line in the third chart (from left to right) – demarcates

an empirical upper bound for this metric.

The F1-score curves follow a similar pattern to the

recall curves, both exhibiting a ‘V’ shape from h = 4

to 64. Kosmos-2 outperformed the other VLMs in

recall in most cases, achieving a maximum recall of

85.3% for h = 4 (with an F1-score of 52.2%). The

high recall indicates a solid ability to ﬂag more po-

tential cases, which is highly desirable for this type of

application. Kosmos-2 (with h = 256) and LLAvA-

NExT (with h = 32) were able to match the F1-score

of GPT-4o with the default model, however, with sig-

niﬁcantly higher recall: 65.3 and 68% for Kosmos-2

and LLAvA-NExT, respectively, compared to 51.2%

achieved by GPT-4o. This shows that open-source

VLMs can yield competitive performance when com-

bined with a more complex classiﬁcation model.

5.3 Limitations and Challenges

Learning the relationship between facial expressions

and self-reported PHQ-4 scores from single selﬁes

can be challenging with small datasets. When us-

ing CNNs, prediction is mainly inﬂuenced by fea-

tures around the mouth, as evidenced by the Grad-

CAM (Selvaraju et al., 2017) attention maps for Ef-

ﬁcientNetV2 in Figures 6a and 6b. In Figure 6a,

the slight smile led the model to classify a positive

sample (PHQ-4 ≥ 6) as negative. However, facial

cues around the eyes might suggest a posed (non-

Duchenne) smile, which is not necessarily a sign of

happiness or well-being. This issue could be miti-

gated by incorporating facial action units (AUs) into

the learning process, enhancing attention guidance to

other critical regions, as recently proposed in a study

on basic emotions (Belharbi et al., 2024).

Approximately 12% of the errors with GPT-4o

involve positive samples misclassiﬁed as negative

where the individual is smiling. Figure 6c shows an

example of a genuine (Duchenne) smile associated

with a PHQ-4 score of 7 (close to the threshold) that

was misclassiﬁed as negative. The description gen-

erated by GPT-4o includes terms such as “The per-

son in the photo appears to be happy”, “She is smil-

ing broadly”, and “Her eyes are slightly squinted”,

which accurately reﬂect the visible facial expressions.

This situation is potentially misleading even for expe-

rienced face analysts, as the individual’s mood is pos-

itive, but the PHQ-4 score is close to the threshold.

Nearly 18% of the errors with GPT-4o are related

to negative samples with PHQ-4 score lower than 3

described by the model as “neutral”. The sample in

Figure 6d (PHQ-4 = 2) was described by GPT-4o as

“neutral or somewhat weary expression”, “corners of

their mouth are slightly downturned”, “lack of en-

thusiasm”. This suggests that the negative elements

critically inﬂuenced the positive response despite the

neutral emotional state described by the VLM. From

a data perspective, multiple captures or fusion with

complementary data modalities could yield a signiﬁ-

cant improvement in the overall performance for both

smiling and neutral expressions scenarios.

AI systems are notorious for showing errors and

biases in different racial groups (Nazer et al., 2023).

Despite the signiﬁcance of this issue and its implica-

tions for fairness and equity, this sensitive topic was

not explored in the present study.

6 CONCLUSION

This work addressed the AI-driven mental health

screening in mobile applications using face-centric

selﬁes. The scope of the study included collect-

ing a dataset of selﬁes paired with responses to a

self-reported questionnaire (PHQ-4) and evaluating

two depression-anxiety detection approaches. In the

comparative evaluation, the proposed VLM-based ap-

proach yielded better results than the typical trans-

fer learning with CNNs or zero-shot screening with

GPT-4o. Notably, GPT-4o achieved the best F1-score

(56%) and accuracy (77.6%), while Kosmos-2 at-

tained the highest recall (53.7%).

The sensitivity analysis demonstrated that open-

source VLMs can yield competitive performance,

nearing the F1-score of GPT-4o. When combined

with more complex FFNNs, Kosmos-2 and LLAvA-

NExT achieved 56% of F1-scores but with signiﬁ-

cantly higher recall, which is particularly important in

this context. Speciﬁcally, Kosmos-2’s recall increased

to 85.3% (with an F1-score of 52.2%) by adding a 4-

unit hidden layer.

Future work will explore two main directions.

From a data perspective, we plan to collect a larger

dataset with more reliable data by developing a smart-

phone application for multiple-shot passive capture

of face images. From a methodological perspective,

HEALTHINF 2025 - 18th International Conference on Health Informatics

316

smile analysis and action unit information will be in-

vestigated to address the limitations of the current

approach. Furthermore, fusion with complementary

data modalities, such as audio and text transcriptions,

will be investigated to enhance the screening perfor-

mance.

ACKNOWLEDGEMENTS

We acknowledge the support of FAPES and CAPES

(Process 2021-2S6CD, FAPES No. 132/2021)

through the PDPG (Programa de Desenvolvimento da

os-Graduac¸

ao – Parcerias Estrat

egicas nos Estados).

We sincerely thank Vera Lucia Carvalho Tess, Kelly

dos Santos Prado and the team at the Clinical Outpa-

tient Department for Pregnancy and Postpartum Care

at the Institute of Psychiatry at HCFMUSP for their

invaluable assistance in the recruitment and evalua-

tion of participants. Our gratitude also extends to Ana

Maria S. S. Oliveira and Luciana Cristina Ranhel for

their support in participant recruitment, and to Lucas

Batini Araujo and Herbert Dias de Souza for their

help in facilitating communication with participants

and acquiring the necessary data.

REFERENCES

Belharbi, S., Pedersoli, M., Koerich, A. L., Bacon, S., and

Granger, E. (2024). Guided interpretable facial ex-

pression recognition via spatial action unit cues. In

2024 IEEE 18th International Conference on Auto-

matic Face and Gesture Recognition (FG), pages 1–

10.

Biaggi, A., Conroy, S., Pawlby, S., and Pariante, C. M.

(2016). Identifying the women at risk of antenatal

anxiety and depression: A systematic review. Jour-

nal of affective disorders, 191:62–77.

Bian, Y., K

uster, D., Liu, H., and Krumhuber, E. G.

(2024). Understanding naturalistic facial expressions

with deep learning and multimodal large language

models. Sensors, 24(1):126.

Bordes, F., Pang, R. Y., Ajay, A., Li, A. C., Bardes, A.,

Petryk, S., Ma

nas, O., Lin, Z., Mahmoud, A., Jayara-

man, B., et al. (2024). An introduction to vision-

language modeling. arXiv preprint arXiv:2405.17247.

Caro-Fuentes, S. and Sanabria-Mazo, J. P. (2024). A

systematic review of the psychometric properties of

the patient health questionnaire-4 (phq-4) in clinical

and non-clinical populations. J. of the Academy of

Consultation-Liaison Psychiatry.

Darvariu, V.-A., Convertino, L., Mehrotra, A., and Mu-

solesi, M. (2020). Quantifying the relationships be-

tween everyday objects and emotional states through

deep learning based image analysis using smart-

phones. Proc. of the ACM on Interactive, Mobile,

Wearable and Ubiquitous Tech., 4(1):1–21.

Dowse, E., Chan, S., Ebert, L., Wynne, O., Thomas, S.,

Jones, D., Fealy, S., Evans, T.-J., and Oldmeadow, C.

(2020). Impact of perinatal depression and anxiety on

birth outcomes: a retrospective data analysis. Mater-

nal and child health journal, 24:718–726.

Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A.,

Mirza, M., Hamner, B., Cukierski, W., Tang, Y.,

Thaler, D., Lee, D.-H., et al. (2013). Challenges in

representation learning: A report on three machine

learning contests. In Neural Inf. Process.: 20th Int.

Conf., Daegu, Korea, November 3-7, 2013. Proc., Part

III 20, pages 117–124. Springer.

Gupta, S., Kumar, P., and Tekchandani, R. K. (2023). Fa-

cial emotion recognition based real-time learner en-

gagement detection system in online learning context

using deep learning models. Multimedia Tools and

Applicat., 82(8):11365–11394.

Harris, P. A., Taylor, R., Thielke, R., Payne, J., Gonza-

lez, N., and Conde, J. G. (2009). Research electronic

data capture (redcap)—a metadata-driven methodol-

ogy and workﬂow process for providing translational

research informatics support. J. of Biomed. Informat.,

42(2):377–381.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proc. of the

IEEE Conf. on Comput. Cision and Pattern Recogni-

tion, pages 770–778.

Javadekar, A., Karmarkar, A., Chaudhury, S., Saldanha,

D., and Patil, J. (2023). Biopsychosocial correlates

of emotional problems in women during pregnancy

and postpartum period. Industrial Psychiatry Journal,

32(Suppl 1):S141–S146.

Kong, X., Yao, Y., Wang, C., Wang, Y., Teng, J., and Qi, X.

(2022). Automatic identiﬁcation of depression using

facial images with deep convolutional neural network.

Medical Science Monitor: Int. Med. J. of Experimen-

tal and Clinical Research, 28:e936409–1.

Kumar, P., Vedernikov, A., and Li, X. (2024). Measur-

ing non-typical emotions for mental health: A sur-

vey of computational approaches. arXiv preprint

arXiv:2403.08824.

Kurki, T., Hiilesmaa, V., Raitasalo, R., Mattila, H., and

Ylikorkala, O. (2000). Depression and anxiety in early

pregnancy and risk for preeclampsia. Obstetrics &

Gynecology, 95(4):487–490.

Li, H., Bowen, A., Bowen, R., Muhajarine, N., and Bal-

buena, L. (2021). Mood instability, depression, and

anxiety in pregnancy and adverse neonatal outcomes.

BMC Pregnancy and Childbirth, 21:1–9.

Li, S., Deng, W., and Du, J. (2017). Reliable crowdsourcing

and deep locality-preserving learning for expression

recognition in the wild. In Proc. of the IEEE Conf. on

Comput. Vision and Pattern Recognition, pages 2852–

2861.

Liu, D., Liu, B., Lin, T., Liu, G., Yang, G., Qi, D., Qiu,

Y., Lu, Y., Yuan, Q., Shuai, S. C., et al. (2022). Mea-

suring depression severity based on facial expression

and body movement using deep convolutional neural

network. Frontiers in psychiatry, 13:1017064.

AI-Driven Early Mental Health Screening: Analyzing Selﬁes of Pregnant Women

317

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee,

Y. J. (2024a). Llava-next: Improved reasoning, ocr,

and world knowledge.

Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2024b). Visual

instruction tuning. Advances in Neural Inf. Proc. Syst.,

36.

Liu, Y., Wang, K., Wei, L., Chen, J., Zhan, Y., Tao, D., and

Chen, Z. (2024c). Affective computing for healthcare:

Recent trends, applications, challenges, and beyond.

arXiv preprint arXiv:2402.13589.

Nasreen, H. E., Pasi, H. B., Riﬁn, S. M., Aris, M. A. M.,

Rahman, J. A., Rus, R. M., and Edhborg, M. (2019).

Impact of maternal antepartum depressive and anxiety

symptoms on birth outcomes and mode of delivery:

a prospective cohort study in east and west coasts of

malaysia. BMC pregnancy and childbirth, 19:1–11.

Nazer, L. H., Zatarah, R., Waldrip, S., Ke, J. X. C.,

Moukheiber, M., Khanna, A. K., Hicklen, R. S.,

Moukheiber, L., Moukheiber, D., Ma, H., et al.

(2023). Bias in artiﬁcial intelligence algorithms

and recommendations for mitigation. PLOS Digital

Health, 2(6):e0000278.

Nepal, S., Pillai, A., Wang, W., Grifﬁn, T., Collins, A. C.,

Heinz, M., Lekkas, D., Mirjafari, S., Nemesure, M.,

Price, G., et al. (2024). Moodcapture: Depression de-

tection using in-the-wild smartphone images. In Proc.

of the CHI Conf. on Human Factors in Comput. Syst.,

pages 1–18.

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma,

S., and Wei, F. (2023). Kosmos-2: Grounding mul-

timodal large language models to the world. arXiv

preprint arXiv:2306.14824.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. arXiv

preprint arXiv:1908.10084.

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,

Parikh, D., and Batra, D. (2017). Grad-cam: Visual

explanations from deep networks via gradient-based

localization. In Proc. of the IEEE Int. Conf. on Com-

put. Vision, pages 618–626.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Tan, M. and Le, Q. (2021). EfﬁcientNetV2: Smaller Models

and Faster Training. In Meila, M. and Zhang, T., edi-

tors, Proc. of the 38th Int. Conf. on Machine Learning,

volume 139 of Proc. of Machi. Learning Research,

pages 10096–10106. PMLR.

Thombs, B. D., Markham, S., Rice, D. B., and Ziegelstein,

R. C. (2023). Screening for depression and anxiety in

general practice. BMJ, 382.

Viegas, C., Lau, S.-H., Maxion, R., and Hauptmann, A.

(2018). Towards independent stress detection: A de-

pendent model using facial action units. In 2018 Int.

Conf. on Content-based Multimedia Indexing, pages

1–6. IEEE.

Voleti, S., NagaRaju, M. S., Kumar, P. V., and Prasanna, V.

(2024). Stress detection from facial expressions us-

ing transfer learning techniques. In 2024 Int. Conf. on

Distrib. Comput. and Optimization Tech., pages 1–6.

IEEE.

Wang, R., Campbell, A. T., and Zhou, X. (2015). Using

opportunistic face logging from smartphone to infer

mental health: challenges and future directions. In

Adjunct Proc. of the 2015 ACM Int. Joint Conf. on

Pervasive and Ubiquitous Comput. and Proc. of the

2015 ACM Int. Symp. on Wearable Comput., pages

683–692.

World Health Organization (2023). Depressive disorder

(depression). Accessed: 2024-04-07.

Zhang, K., Zhang, Z., Li, Z., and Qiao, Y. (2016). Joint

face detection and alignment using multitask cascaded

convolutional networks. IEEE Signal Processing Let-

ters, 23(10):1499–1503.

Zhou, X., Jin, K., Shang, Y., and Guo, G. (2018). Visu-

ally interpretable representation learning for depres-

sion recognition from facial images. IEEE Trans. on

Affective Comput., 11(3):542–552.

HEALTHINF 2025 - 18th International Conference on Health Informatics

318