MentalRAG: Developing an Agentic Framework for Therapeutic

Support Systems

Francisco R. E Silva

1, 2 a

, Pedro A. Santos

1, 2 b

and Jo

ao Dias

2, 3, 4 c

Instituto Superior T

ecnico, Universidade de Lisboa, Av. Rovisco Pais, Lisboa, Portugal

INESC-ID, Rua Alves Redol, 9, Lisboa, Portugal

Faculty of Science and Technology, University of Algarve, Campus de Gambelas, Faro, Portugal

CISCA, Campus de Gambelas, Faro, Portugal

{francisco.r.e.silva, pedro.santos}@tecnico.ulisboa.pt, jmdias@ualg.pt

Keywords:

Artiﬁcial Intelligence, Large Language Models, Mental Health, Retrieval Augmented Generation, Agentic

Workﬂow.

Abstract:

This paper introduces MentalRAG, a multi-agent system built upon an agentic framework designed to sup-

port mental health professionals through the automation of patient data collection and analysis. The system

effectively gathers and processes high-sensitivity mental health data from users. It employs locally run open-

source models for most tasks, while leveraging advanced state-of-the-art models for more complex analyses,

ensuring the maintenance of data anonymity. The system’s models have showed improvements in delivering

empathetic and contextually adaptive responses, particularly in sensitive contexts such as emotional distress

and crisis management. Notably, an integrated agent for detecting levels of suicidal ideation allows the sys-

tem to assess and respond sensitively to diverse levels of risk, promptly alerting mental health professionals

as needed. This innovation represents a stride towards creating a more reliable, efﬁcient, and ethically re-

sponsible mental health support tool, capable of addressing both patient and doctor needs effectively while

minimizing associated risks.

1 INTRODUCTION

Mental health is a global issue, with millions affected

around the world. In the United States, one in ﬁve

adults experience mental illness annually, and seri-

ous conditions such as schizophrenia or bipolar dis-

order affect 1 in 25. The economic impact of mental

health disorders, such as depression and anxiety, leads

to global productivity losses estimated at $1 trillion

annually (National Alliance on Mental Illness, 2023).

As patient numbers increase, mental health profes-

sionals face overwhelming caseloads, highlighting the

urgent need for innovative solutions that can improve

care delivery while easing the workload on healthcare

providers.

Mental health, as deﬁned by the World Health

Organization (WHO), extends beyond the absence

of mental disorders to encompass positive attributes

such as emotional resilience, psychological function-

https://orcid.org/0009-0005-9834-8615

https://orcid.org/0000-0002-1369-0085

https://orcid.org/0000-0002-1653-1821

ing, and the ability to manage everyday stresses.

It inﬂuences how individuals think, feel, and be-

have, affecting their capacity to navigate relation-

ships, make decisions, and contribute productively to

society. Given the rising rates of mental health is-

sues worldwide, understanding the complex interac-

tion between biological, psychological, and environ-

mental factors is essential. Research in psychology

seeks to identify these underlying processes, enabling

professionals to diagnose mental illnesses accurately

and develop effective interventions aimed at improv-

ing individuals’ quality of life.

In response to these challenges, we present a

system using large language models (LLMs) to as-

sist mental health professionals by automating patient

data collection and analysis. The system incorporates

LLM-powered agents that handle tasks such as chat-

ting with patients, anonymizing data, detecting sui-

cide ideation risk, and generating detailed reports.

By leveraging patient history, previous session data,

and psychologist feedback, the system personalizes

responses and clinical recommendations. It focuses

on key areas such as suicide risk assessment, diag-

Silva, F. R. E., Santos, P. A. and Dias, J.

MentalRAG: Developing an Agentic Framework for Therapeutic Support Systems.

DOI: 10.5220/0013267400003938

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health (ICT4AWE 2025), pages 46-57

ISBN: 978-989-758-743-6; ISSN: 2184-4984

nostics, and generating actionable insights, while en-

suring that ﬁnal decision-making remains in the hands

of human experts. Local execution of critical system

components guarantees data security, addressing both

privacy and performance concerns.

Understanding the complexity of mental health

conversations requires advanced approaches, as stan-

dard models fail to grasp the emotional depth and sub-

tle nuances involved. Our solution uses natural lan-

guage processing techniques to handle the intricate

nature of these discussions. It combines a RAG pro-

cess, which retrieves relevant information to enrich

context, with specialized mental health datasets to im-

prove analysis. This ensures a more accurate interpre-

tation of patient emotions and intentions, providing

nuanced insights for psychological care.

The lack of tools that integrate seamlessly into

psychologists’ workﬂows is a challenge in mental

health. Our system overcomes this by ﬁne-tuning

LLMs with mental health data and incorporating a

user-friendly interface that adapts to different stages

of care, from diagnosis to progress monitoring. The

integration of RAG technology enriches the model’s

understanding, providing contextually relevant infor-

mation. This allows for more effective clinical deci-

sions while keeping psychologists at the center of the

process, ensuring both efﬁciency and the ethical man-

agement of patient care.

Figure 1 shows an overview of our proposed sys-

tem.

Our contributions are:

• We propose, implement, and evaluate an agentic

framework to collect and analyze high-sensitive

mental health data from users, where most of the

agents run “small” open-source models locally,

while more demanding tasks are performed by

state-of-the-art powerful models over anonymized

data.

• We tested several open source models and, in par-

ticular, found that LLaMA-3-8B-Instruct (Dubey

et al., 2024) model and Mental-LLaMA-3-8B-

Instruct, our ﬁne-tuned model, outperforms Chat-

Counselor model (Liu et al., 2023) and exist-

ing open-source models in the benchmark Coun-

selling Bench.

This paper is organized as follows: Section 2 con-

tains an overview of the State of the Art; Section 3 de-

scribes the datasets used to ﬁne tune and validate our

model; in Section 4 the approach is presented with de-

tailed description of the agents, techniques and solu-

tion; Section 5 shows the evaluation of the presented

approach; in Section 6 we present the conclusions.

2 RELATED WORK

In this section we present an overview of related

work, focusing on describing recent advances on large

language models for the mental health domain, and

overview of agent centric approaches.

2.1 Mental-LLM

Recent studies have initiated evaluations of LLMs

on mental health-related tasks, with a focus on zero-

shot settings and simple prompt engineering. Prelim-

inary ﬁndings suggest that LLMs can predict men-

tal health disorders using natural language, although

with performance limitations compared to state-of-

the-art domain-speciﬁc NLP models. This perfor-

mance gap is attributed to the general-purpose nature

of existing LLMs, which are not speciﬁcally trained

for mental health tasks.

2.1.1 Mental-RoBERTa

Mental-RoBERTa and its counterpart MentalBERT

(Xu et al., 2024) are masked language models specif-

ically tailored for the domain of mental healthcare.

These models represent a signiﬁcant contribution to

the ﬁeld, as they are the ﬁrst attempt to pre-train lan-

guage models for mental healthcare applications.

MentalBERT and MentalRoBERTa are pre-

trained on a corpus collected from Reddit, which in-

cludes discussions from various mental health-related

subreddits. This approach aims to capture the nuances

and linguistic characteristics of language used in the

context of mental health discussions, making these

models particularly suited for analyzing and under-

standing text data related to mental health conditions.

The performance of MentalRoBERTa was eval-

uated across several mental health detection tasks,

demonstrating superior results in comparison to both

general-domain models like BERT (Devlin et al.,

2019) and RoBERTa (Liu, 2019), and other domain-

speciﬁc models such as BioBERT (Lee et al., 2019)

and ClinicalBERT (Huang et al., 2019). This in-

dicates the signiﬁcant advantage of domain-speciﬁc

pretraining for mental health applications.

2.1.2 Mental-FLAN-T5

The FLAN-T5 model (Chung et al., 2024) repre-

sents a signiﬁcant advancement in the ﬁeld of natu-

ral language processing through instruction ﬁnetun-

ing. The FLAN-T5 models have been ﬁne-tuned on a

diverse set of instructions to enhance their zero-shot,

few-shot, and chain-of-thought reasoning capabili-

ties, outperforming their T5 (Raffel et al., 2020) coun-

MentalRAG: Developing an Agentic Framework for Therapeutic Support Systems

User

Conversation

Agent

Anonymization

Agent

Symptoms

Detection

Agent

Clinical

Issues Detection

Agent

Report

Agent

Proposal

Agent

Therapy

Proposal

Psychologist

Clinical

Practice

Guidelines

...

Suicide

Ideation

Detection

Agent

Report

Experts

Figure 1: Architecture of MentalRAG.

terparts and even surpassing the performance of larger

models like PaLM 62B (Chowdhery et al., 2023) on

certain benchmarks.

Mental-FLAN-T5 was developed by instruction

ﬁnetuning FLAN-T5 on the six tasks across the four

datasets simultaneously. This approach aimed to en-

hance the model’s capability in the mental health do-

main. The instruction ﬁnetuning involved updating

model parameters over 3 epochs using Adam opti-

mizer, with a learning rate of 2e

−5

and a cosine sched-

uler warmup ratio of 0.03, utilizing eight Nvidia A100

GPUs.

Compared to larger models like GPT-3.5 (Brown

et al., 2020) and GPT-4 (OpenAI et al., 2023),

Mental-FLAN-T5 achieved signiﬁcantly better results

when ﬁne-tuned for mental health-related tasks, ex-

hibiting a 10.9% improvement in balanced accuracy

over GPT-3.5, despite being 15 times smaller. This

outcome highlights the capabilities of Mental-FLAN-

T5 in managing complex, multitask scenarios, rein-

forcing its suitability for nuanced tasks such as sui-

cide risk prediction on mental health datasets contain-

ing user-level annotations based on the Columbia Sui-

cide Severity Rating Scale (C-SSRS) guidelines. (Xu

et al., 2024)

2.1.3 Mental-Alpaca

Alpaca (Taori et al., 2023) is a ﬁne-tuned language

model developed from Meta’s LLaMA 7B model

(Touvron et al., 2023a). Created by researchers

at the Center for Research on Foundation Models

(CRFM) at Stanford University, Alpaca stands out

for its cost-effective and efﬁcient training process,

aimed at fostering academic research in instruction-

following models.

The creation of Mental-Alpaca involved

instruction-based ﬁne-tuning on mental health-

speciﬁc tasks. This model was further reﬁned using

a combination of mental health datasets to handle

tasks such as depression, stress, and suicidal ideation

detection. Fine-tuning with instruction learning

techniques enables Mental-Alpaca to excel in com-

plex, multi-task environments, particularly in binary

suicide ideation prediction (Xu et al., 2024).

2.2 Agentic Workﬂow

The Agentic Workﬂow is based on the concept

of multi-agent systems (MAS), where autonomous

agents collaborate to solve complex tasks by leverag-

ing their specialized capabilities. Each agent focuses

on a speciﬁc task, working in parallel or indepen-

dently to contribute to the overall objective, similar

to human teamwork. This modular approach breaks

down intricate problems into smaller tasks, allowing

agents to efﬁciently handle subtasks. Agents are de-

signed to dynamically adjust their actions based on

ongoing feedback, ensuring a ﬂexible and robust sys-

tem (Singh et al., 2024). For example, agents in a

travel planning system can specialize in areas like

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health

data collection or itinerary optimization, streamlining

the process and improving efﬁciency.

A key aspect of the Agentic Workﬂow is the in-

tegration of external tools, allowing agents to make

more informed decisions and mitigate issues like lan-

guage model hallucinations. By connecting to APIs

or specialized software, agents can retrieve accu-

rate data, enhancing their problem-solving capabil-

ities. Studies using tool-enhanced language mod-

els, such as “Gorilla” (Patil et al., 2024) and “MM-

REACT” (Yang et al., 2023), demonstrate the effec-

tiveness of this approach in expanding the range of

tasks agents can handle, from visual intelligence to

API-based queries. Additionally, agent interactions

are optimized through techniques like paralleliza-

tion and feedback loops, where agents iteratively im-

prove their performance, leading to more adaptive and

precise systems (Capital, 2024). This collaborative

and decentralized decision-making model has appli-

cations across various domains, from smart grids to

urban trafﬁc management, and represents a step to-

ward more autonomous and sophisticated AI systems.

3 DATASETS

As with the primary distinctions between the

LLaMA-2 (Touvron et al., 2023b) and LLaMA-3

models, which were attributable to the quality and

scale of the data, we adopted a comparable approach

by concentrating on identifying a suitable dataset to

enhance counselling-style responses. The Psych8k

dataset, developed by (Liu et al., 2023), was used.

It comprises 260 English-language counseling con-

versations, extracted from recorded clinical sessions

between patients and certiﬁed professionals. The

aforementioned conversations encompass ﬁve princi-

pal subject areas: The categories of mental health

concerns, stress, family and relationships, minority

groups, and others.

Furthermore, four additional datasets were gath-

ered for the purpose of evaluating the models em-

ployed in a variety of tasks, see Table 1. The UMR

Suicidality Dataset (Shing et al., 2018), an effort from

the University of Maryland, represents a signiﬁcant

contribution to the ﬁeld of suicide risk assessment, as

it is the ﬁrst to rely solely on social media content.

This dataset was developed using the r/SuicideWatch

subreddit. It enables researchers to identify anony-

mous users who may have signs of suicidality. The

data was annotated at four levels. The framework of-

fers a comprehensive evaluation of suicide risk, cat-

egorising posts as either “No Risk”, “Low Risk”,

“Moderate Risk”, or “Severe Risk”. Furthermore,

crowdsourced annotations were compared with expert

evaluations and baseline predictive modeling experi-

ments were conducted to demonstrate the potential of

this dataset to advance suicide prevention through so-

cial media analysis.

The Crisis Response QA dataset (Jin et al., 2023)

comprises a series of pivotal inquiries pertaining to

the administration of mental health crises, meticu-

lously assembled from esteemed sources such as the

“Responding to Mental Health Crisis” and “Navigat-

ing a Mental Health Crisis” manuals. These guides

provide an invaluable framework for the effective

management of mental health crises, offering best

practice guidance. The dataset comprises question-

answer pairs that have been reviewed by medical stu-

dents to ensure quality assurance.

The USMLE-Mental dataset (Jin et al., 2023) rep-

resents a specialized subset of the MedQA dataset,

comprising multiple-choice questions sourced from

professional medical board examinations, including

the United States Medical Licensing Examination

(USMLE). This dataset is centred on mental health-

related questions from the USMLE and the Step 2

Clinical Knowledge (CK) sections. A manual review

was conducted to conﬁrm the relevance of the ex-

tracted questions to mental health. This process re-

sulted in the annotation of 727 questions, which pro-

vide valuable insights into mental health knowledge

within the medical ﬁeld.

The ﬁnal dataset is the Counselling Bench, which

is a comprehensive framework designed to evaluate

the performance of chatbots in psychological coun-

selling contexts. This dataset extends beyond the as-

sessment of basic question-answering capabilities to

evaluate the capacity of chatbots to replicate the sub-

tleties of counselling strategies employed in profes-

sional mental health support. The dataset comprises

229 open-ended queries, which simulate real-world

counselling conversations. This provides a robust

testing ground for chatbots to demonstrate their ca-

pability in delivering therapeutic support.

4 APPROACH

Our approach was based on the agentic workﬂow,

which entails equipping each agent with specialized

knowledge and distinct tasks. This permits greater

ﬂexibility in the selection of the large language mod-

els (LLMs) for each agent and the ﬁne-tuning of the

models for speciﬁc tasks. In our case, we performed

ﬁne-tuning using auto-regressive training exclusively

on the Conversation Agent, thereby enhancing its

conversational and counselling capabilities. Further-

MentalRAG: Developing an Agentic Framework for Therapeutic Support Systems

Table 1: Datasets used for models evaluation.

Task Dataset Format Data Size Language

Mental Health QA USMLE-mental Question-Answering 727 English

Mental Health QA Crisis Response QA Question-Answering 153 English

Suicide Risk Classiﬁcation via Online Text UMR Suicidality Dataset Classiﬁcation 1000 English

Psychological Counseling Counseling Bench Generation 229 English

more, a RAG process in the Proposal Agent was im-

plemented, enabling it to generate therapy propos-

als based on mental health clinical guidelines. This

allows the agent to update its context in real time.

We also implemented an agent that detects suicide

ideation levels, enabling the system to assess and re-

spond with sensitivity to varying levels of ideation

risk and warn the mental health professionals. Finally,

the system architecture solution was implemented,

creating each agent along with the interaction ﬂow be-

tween them.

4.1 Solution

The Conversation Agent is the primary point of inter-

action between the user and our system, as shown in

Figure 1. It communicates with the user via a chatbot

interface or potentially a general companion agent, al-

lowing for real-time exchanges where the user sends

messages and receives responses. The agent’s pri-

mary function is to engage with the user on a subject

of their choice, supporting them through a conversa-

tional process that builds trust and facilitates the de-

tection of mental health issues.

Once the user has ﬁnished the session, the

Anonymization Agent takes over, responsible for

anonymizing the messages exchanged by the user

and the Conversation Agent. This involves replac-

ing identiﬁable information such as names, emails,

phone numbers, and addresses with tokens, ensuring

that personal data remain securely stored on the server

and is not shared over the Internet. By anonymizing

the data, this approach signiﬁcantly improves the se-

curity and conﬁdentiality of user data.

The Expert Agents play a crucial role in the

diagnosis of mental health disorders. The Symp-

toms Detection Agent focuses on identifying speciﬁc

symptoms of mental health disorders by analyzing

anonymized messages from the conversation, produc-

ing a symptom-based diagnosis along with a detailed

rationale for each symptom detected. In contrast, the

Clinical Issues Detection Agent takes a broader ap-

proach, assessing the conversation to identify over-

arching clinical mental health issues. Although both

agents process the same conversation, the Symptoms

Detection Agent focuses on individual symptoms,

whereas the Clinical Issues Detection Agent provides

a comprehensive and evidence-based evaluation of

potential clinical issues as a whole. The Suicide

Detection Agent detects whether suicide ideation is

present or not, by processing anonymized messages.

If it detects that suicide ideation is present, it assigns

a risk level between 1 and 5. The Report Agent acts

the consensus agent as its primary function is to re-

ceive diagnostics from the expert agents, compare

these diagnostics, and extract only the overlapping or

common information to produce the report. The Pro-

posal Agent is responsible for generating comprehen-

sive ﬁnal proposals consisting of both a report and

suggested therapeutic interventions for the detected

symptoms generated by the RAG process described

above.

This multi-agent system enables our system to de-

tect mental health issues with greater accuracy, pro-

vide personalized support, and facilitate evidence-

based treatment plans while maintaining user privacy

and security.

Although this system is operated by independent

agents, the sensitive nature of the domain (mental

health) requires the integration of a human valida-

tor into the workﬂow, a technique known as ”human-

in-the-loop.” In this particular case, the psychologist

serves the function of validating the content that is to

be provided to the Conversation Agent as context, see

Figure 2. This speciﬁcally pertains to the therapy pro-

posal and, in addition, the psychologist offers feed-

back based on the session conducted. This approach

guarantees that the agent engaged in communication

with the patient does not receive information that has

been hallucinated or otherwise erroneous, generated

by previous agents. This ensures the accuracy and

quality of the content delivered. By incorporating the

psychologist into this process, we ensure the depend-

ability of the system in a crucial and sensitive domain

such as mental health.

4.2 Conversational Agent Fine-Tuning

To optimize the LLaMA-3-8B-Instruct model for psy-

chological counselling, the Quantized Low-Ranking

Adaptation (QLoRA) (Dettmers et al., 2023) method

was employed to facilitate efﬁcient ﬁne-tuning on a

single NVIDIA GeForce RTX 4090 GPU with 24GB

of memory. The model was quantized using 4-

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health

Therapy

Proposal

Feedback

Psychologist

Yes

Is valid?

Conversation

Agent

Report

Suicide Ideation

risk level (0-5)

Figure 2: Human-in-the-Loop solution for MentalRAG sys-

tem.

bit precision with the NormalFloat (NF4) data type,

which enabled retention of near full-precision accu-

racy while signiﬁcantly reducing memory usage. Ad-

ditionally, bﬂoat16 was used as the compute data

type, optimizing the model’s efﬁciency during train-

ing by leveraging the GPU’s tensor core.

The ﬁne-tuning process involved training the

model for 5 epochs using a batch size of 4 per de-

vice with gradient accumulation steps set to 1 and a

learning rate of 2e

−5

to optimize convergence. The

model was trained on the Psych8k dataset, compris-

ing authentic counselling conversations conducted by

licensed professionals, which was tokenized using a

maximum-length truncation strategy to ensure consis-

tency. To manage memory usage more effectively,

gradient checkpointing was implemented, allowing

the model to train on larger sequences without ex-

ceeding the memory limitations of the GPU. The au-

toregressive instruction tuning objective was used to

enhance the model’s ability to generate responses that

are professional and empathetic, with real-time mon-

itoring using wandb tracking providing insights into

the model’s performance during training.

4.3 Retrieval Augmented Generation

(RAG)

The RAG process began with the collection of clini-

cal practice guidelines for prevalent mental health dis-

orders, including bipolar disorder, eating disorders,

psychosis and schizophrenia, major depressive disor-

der, post-traumatic stress disorder, substance use dis-

order, and suicide risk. Following this, an ingestion

phase was conducted during which each document

was splitted into smaller chunks. These segments

were subsequently transformed into high-dimensional

vectors using the LLaMA-3 embedding model and

stored in the ChromaDB vector database for further

analysis. To generate a comprehensive ﬁnal proposal

with suggested therapeutic interventions, the system

performs similarity searches to retrieve the most rel-

evant clinical information. This information is used

to contextually enrich the input report and detected

symptoms, enabling the system to generate therapy

Query +

Report Agent output

Embedding

Model

ChromaDB

Proposal

Agent

Retrieved

Context

Therapy

Proposal

Figure 3: Overview of the RAG process implemented.

recommendations that are well informed and aligned

with established clinical practices.

4.4 Agents Implementation

The implementation of a variety of agents, each de-

signed to perform speciﬁc tasks within the therapeu-

tic support framework, is described in Table 2. Each

agent utilizes a large language model, such as Mental-

LLaMA-3-8B-Instruct and GPT-4o, to address a va-

riety of functions. These include conversation han-

dling, anonymization, suicide ideation detection and

clinical issue assessment. It should be noted that

small open-source models are employed in expert

agents primarily for the purpose of evaluating their

performance in speciﬁc tasks. However, these mod-

els are fully interchangeable with state-of-the-art, off-

the-shelf models, such as Claude 3.5

Sonnet or GPT-

, thereby providing ﬂexibility for enhanced func-

tionality as needed. This structured approach enables

each agent to leverage model strengths that are rele-

vant to its functional requirements.

Table 2: Overview of agents and corresponding models.

Agent Model

Conversation Agent Mental-LLaMA-3-8B-Instruct

Anonymization Agent LLaMA-3-8B-Instruct

Suicide Ideation Detection Agent GPT-4o

Clinical Issues Detection Agent LLaMA-3-8B-Instruct

Symptoms Detection Agent Mistral-7B

Report Agent LLaMA-3-8B-Instruct

Proposal Agent LLaMA-3-8B-Instruct

5 EVALUATION

The evaluation process was divided into two com-

ponents: a models evaluation and an expert evalua-

tion conducted by a PhD-qualiﬁed psychiatrist. The

models evaluation was structured around three dis-

tinct tasks, namely a mental health knowledge task, a

suicide risk classiﬁcation task and an emotional sup-

port task. In order to complete this assessment, four

datasets were used. Subsequently, the psychiatrist

https://www.anthropic.com/news/claude-3-5-sonnet

https://platform.openai.com/docs/models/gpt-4o

MentalRAG: Developing an Agentic Framework for Therapeutic Support Systems

conducted a series of manual tests followed by an in-

depth interview to gain insights into the system’s per-

formance.

5.1 Models Results

In order to evaluate the system, which follows an

agentic workﬂow, all agents running different (or not)

LLMs were tested in accordance with their predeﬁned

tasks. To illustrate, the Conversation Agent utilizes

the instruction-tuned model, Mental-LLaMA-3-8B-

Instruct, which is evaluated in accordance with psy-

chological counselling metrics.

Models such as GPT-4o and GPT-4o-mini (Ope-

nAI et al., 2024), LLaMA-3-8B-Instruct, Mistral-7B

(Jiang et al., 2023) and our ﬁne-tuned model, Mental-

LLaMA-3-8B-Instruct, were evaluated on most of the

tasks. A summarization of model details is shown in

Table 3.

5.1.1 Mental Health Knowledge Task

In evaluating this task, the prompt is comprised

of the question itself, as well as the available an-

swer options, presented in a multiple-choice question-

answering format. Given the objective nature of the

questions, accuracy was used as the evaluation met-

ric.

It is crucial to consider both the model size and

the ﬁne-tuning techniques applied when analysing the

results from the Crisis Response QA and USMLE-

mental datasets (see Table 4). The largest model,

GPT-4o demonstrated superior performance on both

tasks, achieving a score of 93.46% on Crisis Response

QA and 81.16% on USMLE-mental. The model’s

substantial size enables it to process intricate tasks

and generalise effectively across a range of domains,

including medical reasoning and crisis intervention.

In comparison, GPT-4o-mini (6-13B parameters)

performs almost as well on Crisis Response QA with

91.50%, but exhibits a more pronounced decline on

the USMLE-mental task (66.30%). This demon-

strates that although smaller models can still demon-

strate proﬁciency in general tasks, their reduced ca-

pacity constrains their ability to process the complex-

ity required for specialised medical tasks.

LLaMA-3-8B-Instruct (8B parameters) demon-

strates a moderate performance, achieving 70.59%

on Crisis Response QA and 52.27% on USMLE-

mental. This indicates that mid-sized models are ca-

pable of handling general tasks but face challenges in

more specialized domains, such as medical question-

answering. Similarly, Mistral-7B (7B parameters)

achieves 81.70% on Crisis Response QA but performs

poorly on USMLE-mental (34.94%), indicating that

models of this size are less well-suited to domain-

speciﬁc tasks.

A noteworthy aspect for examination is Mental-

LLaMA-3-8B-Instruct, which was tailored to meet

the requirements of psychological counselling tasks

through the utilization of a QLoRA adapter-based

methodology (Dettmers et al., 2023). Although

the model was optimized for mental health coun-

selling, it exhibits suboptimal performance on the

USMLE-mental dataset, achieving a score of 23.11%.

This suboptimal performance can be attributed to

the QLoRA quantisation method employed for ﬁne-

tuning. Although QLoRA is highly memory-efﬁcient,

the process of quantizing the model’s weights to 4-bit

precision introduces quantization errors, particularly

in the presence of outliers or complex scenarios, such

as medical question-answering. Such errors impact

the model’s capacity to generalize across domains

beyond the one for which it was ﬁne-tuned. This

is particularly relevant in the context of the Mental-

LLaMA-3-8B-Instruct model, which was ﬁne-tuned

on a psychological counselling instruction dataset.

Consequently, it is more specialized and less capable

of handling medical reasoning tasks, such as those in

USMLE-mental.

In contrast, models such as GPT-4o-mini, which

were not subjected to QLoRA, demonstrate consider-

ably superior performance across a range of diverse

tasks. This comparison highlights the trade-off be-

tween the efﬁciency gained through techniques such

as QLoRA and the loss in generalization.

5.1.2 Suicide Risk Classiﬁcation Task

The UMR Suicidality Dataset (Shing et al., 2018) was

used to evaluate this task and the evaluation prompt is

composed of the text to classify and the corresponding

dataset labels. The metrics used were accuracy, recall,

precision and F1-score.

This evaluation reveal signiﬁcant performance

variations across models in terms of their overall

accuracy, recall, precision and F1-score, and label-

speciﬁc performance. The models evaluated, GPT-4o,

GPT-4o-mini, LLaMA-3-8B-Instruct, and Mistral-

7B, demonstrated strengths in different areas but also

showed notable shortcomings that limit their readi-

ness for real-world application.

From the general metrics (Accuracy, Recall, F1-

Score), Mistral-7B outperformed the other models

with a higher score in accuracy (37.51%) and F1-

score (35.19%), indicating its relatively better ability

to correctly classify cases. GPT-4o and GPT-4o-mini

showed lower overall performance across all metrics,

with GPT-4o scoring the lowest at 19.61% in F1-score

and GPT-4o-mini only slightly better. The low results

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health

Table 3: Details of the models used in the evaluation phase.

Model Modality Parameters (Billions) Context Length (Tokens) Language Access

GPT-4o Pretrained ≈ 175 32000 English API

GPT-4o-mini Pretrained ≈ 6 − 13 ≈ 8000 − 16000 English API

LLaMA-3-8B-Instruct Pretrained 8 8192 English Weights

Mental-LLaMA-3-8B-Instruct Fine-tuned 8 8192 English Weights

Mistral-7B Pretrained 8 8192 English Weights

Table 4: PsyEval benchmark evaluation results.

Crisis Response QA USMLE-mental

Mental-LLaMA-3-8B-Instruct 71.24 23.11

LLaMA-3-8B-Instruct 70.59 52.27

GPT-4o 93.46 81.16

GPT-4o-mini 91.50 66.30

Mistral-7B 81.70 34.94

Figure 4: Models performance.

Table 5: Model Performance Metrics for Suicide Risk Clas-

siﬁcation.

Model Accuracy (%) Recall (%) F1-Score (%) Precision (%)

Mistral-7B 37.51 37.51 35.19 36.31

LLaMA-3-8B-Instruct 32.20 32.20 32.94 32.57

GPT-4o 20.40 20.40 19.61 20.00

GPT-4o-mini 19.10 19.10 19.21 19.15

may be attributed to the presence of OpenAI safe-

guards that prohibit the evaluation of text that may

evoke suicidal ideation. LLaMA-3-8B-Instruct per-

formed reasonably well in precision but showed room

for improvement in overall classiﬁcation effectiveness

compared to Mistral-7B.

A review of the accuracy of the individual risk la-

bels (see Figure 5) revealed the following:

LLaMA-3-8B-Instruct model demonstrated a note-

worthy degree of accuracy in identifying cases

classiﬁed as “Severe Risk” (43.71%) and “Mod-

erate Risk” (38.29%). However, its performance

in the “No Risk” (15.70%) and “Low Risk”

(11.76%) categories was signiﬁcantly lower, in-

dicating a difﬁculty in detecting non-risk cases.

Mistral-7B also demonstrated proﬁciency in identi-

fying “Moderate Risk” and “Severe Risk” cases,

with accuracy rates of 51.59% and 25.53%, re-

spectively. However, it exhibited notable chal-

lenges in accurately distinguishing “Low Risk”

and “No Risk” scenarios, with accuracy rates of

2.12% and 15.24%, respectively.

GPT-4o demonstrated a high degree of accuracy in

identifying individuals not at risk, with a score of

74.11% in the No Risk category. However, its per-

formance in other risk categories was markedly

poor, with low risk at 4.93% and severe risk at

9.74%. This suggests that it is unable to cope with

more subtle forms of classiﬁcation.

GPT-4o-mini exhibited a superior overall equilib-

rium compared to the GPT-4o, displaying a

comparable potency in the “No Risk” category

(63.49%) while also demonstrating marginal su-

periority in the “Low Risk” and “Moderate Risk”

categories.

In general, Mistral-7B and LLaMA-3-8B-Instruct

demonstrated superior performance in identifying in-

dividuals belonging to higher-risk categories, whereas

GPT-4o exhibited a greater capacity for detecting

those with no risk.

Figure 5: Model’s performance on Classiﬁcation Task

(Metrics: Accuracy 100%).

Despite these insights, the overall accuracy scores

across all models remain relatively low, with Mistral-

7B reaching around 37.51% total accuracy, the high-

est among the models. LLaMA-3-8B-Instruct follows

closely behind, while GPT models lag with an overall

accuracy below 21%. These results suggest that the

overall performance is not yet robust enough for de-

ployment in real-world mental health scenarios, par-

ticularly where the stakes are high, such as suicide

prevention.

Given the sensitivity of mental health appli-

MentalRAG: Developing an Agentic Framework for Therapeutic Support Systems

cations, where misclassiﬁcation could lead to ei-

ther missed interventions or unnecessary escalations,

these models need further reﬁnement. Additionally,

the relatively low performance of even the best mod-

els emphasizes the need for more training data and

more comprehensive ﬁne-tuning to enhance their abil-

ity to distinguish between subtle differences in risk

levels, especially when applied to real-world suicide

ideation detection tasks.

In summary, while the models evaluated in these

classiﬁcation tasks show potential, particularly in

their specialized domains, their current accuracy lev-

els are insufﬁcient for real-world use in critical mental

health settings.

5.1.3 Emotional Support Task

The Counselling Bench dataset (Liu et al., 2023) and

its prompt design were used to evaluate this task,

structured around seven core counseling skills that

challenge large language models (LLMs) to respond

empathetically and effectively in various scenarios

such as anxiety, depression, and addiction (Jin et al.,

2023). With 229 open-ended queries, it serves as

a robust measure of LLM performance in delivering

therapeutic support. The performance of Psych8k-

7B, Mental-LLaMA-3-8B-Instruct, and LLaMA-3-

8B-Instruct was compared across key counselling

metrics, including information, direct guidance, ap-

proval and reassurance, and restatement, reﬂection,

and listening (see Table 6). Despite Psych8k-7B be-

ing identiﬁed as the state-of-the-art model in the orig-

inal paper (Liu et al., 2023), both LLaMA-3-based

models have since surpassed it, with notable differ-

ences in performance.

1. Information: Mental-LLaMA-3-8B-Instruct out-

performs with a score of 8.00, compared to

LLaMA-3-8B-Instruct (7.50) and Psych8k-7B

(6.25), indicating its superior accuracy in provid-

ing relevant psychological data.

2. Direct Guidance: Both Mental-LLaMA-3-8B-

Instruct and LLaMA-3-8B-Instruct score 8.67,

showing equal effectiveness in offering actionable

advice, while Psych8k-7B lags with 6.67.

3. Approval and Reassurance: Mental-LLaMA-

3-8B-Instruct (7.80) and LLaMA-3-8B-Instruct

(7.60) perform similarly, with Psych8k-7B scor-

ing 5.80, indicating a weaker capacity to provide

emotional support.

4. Restatement, Reﬂection, and Listening:

LLaMA-3-8B-Instruct (7.84) slightly surpasses

Mental-LLaMA-3-8B-Instruct (7.68), with

Psych8k-7B showing deﬁciencies at 6.12.

5. Interpretation: LLaMA-3-8B-Instruct leads

with 8.00, followed by Mental-LLaMA-3-8B-

Instruct (7.27) and Psych8k-7B (6.27), highlight-

ing the latter’s lower interpretive abilities.

6. Self-Disclosure: LLaMA-3-8B-Instruct scores

7.26, slightly higher than Mental-LLaMA-3-8B-

Instruct (7.04), while Psych8k-7B trails at 5.61,

indicating less effectiveness in personal insight.

7. Obtain Relevant Information: LLaMA-3-8B-

Instruct leads with 8.00, followed by Mental-

LLaMA-3-8B-Instruct (7.68), and Psych8k-7B

underperforms with 6.39.

In conclusion, despite previously being the state-of-

the-art model, Psych8k-7B has been surpassed by

both Mental-LLaMA-3-8B-Instruct and LLaMA-3-

8B-Instruct across almost all categories. This shift

in performance demonstrates the impact of more ef-

fective ﬁne-tuning and model development. In par-

ticular, the LLaMA-3-based models, notably Mental-

LLaMA-3-8B-Instruct, exhibit enhanced capabilities

in critical areas such as providing accurate infor-

mation, offering actionable guidance, and provid-

ing emotional support. In consideration of the Con-

Figure 6: Model’s performance on Emotional Support Task

(Metrics: LLM-as-Judge 0-10 scale).

versation Agent’s role in interacting with patients,

which entails the utilization of personal details, ther-

apy proposals and feedback, it is evident that Mental-

LLaMA-3-8B-Instruct is the optimal selection. It is

particularly adept at providing information and di-

rect guidance, which are essential for advising pa-

tients based on their speciﬁc therapy context. The

model’s efﬁcacy in providing approval and reassur-

ance renders it an effective conduit for emotional sup-

port, which is a crucial element in sensitive counsel-

ing conversations.

The model’s proﬁciency in restatement, reﬂection,

listening, and interpretation ensures its capacity to re-

spond empathetically, thereby fostering rapport with

patients through the reﬂection of their concerns. In

comparison to Psych8k-7B, which it outperforms in

nearly all metrics, our model is better equipped to ad-

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health

Table 6: Counselling-bench evaluation (Scale: 0 to 10).

Mental-LLaMA-3-8B-Instruct LLaMA-3-8B-Instruct Psych8k-7B

Information 8.00 7.50 6.25

Direct Guidance 8.67 8.67 6.67

Approval & Reassurance 7.80 7.60 5.80

Restatement, Reﬂection & Listening 7.68 7.84 6.12

Interpretation 7.27 8.00 6.27

Self-disclosure 7.04 7.26 5.61

Obtain Relevant Information 7.68 8.00 6.39

dress the subtleties of psychological counselling in a

personalized and empathetic manner.

5.2 Expert Evaluation

The expert evaluation was conducted by a doctoral-

level psychiatrist who began by simulating patient in-

teractions with the system, addressing a range of gen-

eral mental health issues. In this process, the psychi-

atrist engaged with the system’s responses and ana-

lyzed its diagnostic assessments, reporting, and ther-

apeutic recommendations. Following this simulation,

an in-depth interview was conducted with the expert,

focusing on their overall experience and insights into

the system’s efﬁcacy and reliability. The key ﬁndings

from this interview form the basis of the conclusions

presented in this subsection, covering aspects such

as empathy in responses, sensitivity to patient needs,

human-AI collaboration, and clinical usability.

Empathy and Response Guidelines: The psychia-

trist noted that while the system demonstrated em-

pathy, responses should be more concise to avoid

overwhelming distressed users. Emphasis should be

placed on brief, active listening interactions, incorpo-

rating reﬂexive listening and validation techniques to

balance information collection with reassurance.

Sensitivity and Quality of Responses: Sensitivity

in addressing patients, especially those with suicidal

ideation, was highlighted as critical. The agent should

help patients express emotions and avoid providing

unrealistic hope, instead offering grounded guidance

and promoting calming strategies.

Human-AI Collaboration: A collaborative ap-

proach to suicide risk assessment was recommended,

with both AI and clinicians evaluating the risk inde-

pendently. The integration of established risk scales

and attention to GDPR compliance were emphasized,

particularly regarding when alerts are triggered.

Suicide Risk Detection Actions: The system should

involve human intervention in high-risk cases, send-

ing alerts to clinicians and directing patients to ap-

propriate support services, while providing immedi-

ate emotional support and empathetic follow-up.

Clinical Usability: The interface was deemed intu-

itive, but improvements were suggested, such as bet-

ter organization of patient records and quicker access

to contact information for urgent cases.

Legal and Regulatory Issues: The psychiatrist

raised concerns about the system’s classiﬁcation un-

der EU medical device regulations, especially in high-

risk areas like suicide prevention. Legal safeguards

were recommended to clarify responsibility in cases

of misclassiﬁcation.

Future Improvements: Suggested enhancements in-

cluded adding voice interaction and a clear introduc-

tion to the system’s capabilities and limitations. The

system could also be useful in academic settings to

meet the growing demand for mental health services.

5.3 Discussion

The ﬁne-tuned Mental-LLaMA-3-8B-Instruct model

outperformed the original LLaMA-3-8B-Instruct in

emotional support tasks, highlighting the effective-

ness of task-speciﬁc ﬁne-tuning for therapeutic com-

munication. However, this specialization resulted in

decreased performance on general medical knowl-

edge tasks, illustrating the trade-off between special-

ization and generalization.

The GPT-4o model proved to be the best per-

former in both general knowledge tasks and the spe-

cialized domain of suicide risk detection, making it

ideal for high-stakes tasks. Meanwhile, GPT-4o-mini

offered a cost-effective alternative for routine tasks,

maintaining competitive performance in less complex

areas.

Improving the system’s efﬁciency by optimiz-

ing model architectures and exploring model com-

pression techniques will help reduce resource con-

sumption and increase response times. Future ef-

forts should also address regulatory concerns in clin-

ical applications, ensuring compliance with emerg-

ing AI frameworks and providing transparency in AI

decision-making.

Enhancing the RAG process and expanding the

system’s knowledge base in ChromaDB will further

improve the accuracy and relevance of proposed ther-

apies, making the system more robust in real-world

MentalRAG: Developing an Agentic Framework for Therapeutic Support Systems

mental health scenarios.

Finally, our framework has a plug-and-play de-

sign that facilitates the use of both small open-source

models and larger, state-of-the-art models. While the

current implementation uses lightweight open source

models to keep the system efﬁcient and accessible, the

framework is compatible with more advanced, pre-

trained models that can be integrated off-the-shelf as

needed. This ﬂexibility allows for scalable model up-

grades, providing a pathway to incorporate more pow-

erful agents as needed without changing the system

architecture. Importantly, the expert agents processes

only anonymized data, ensuring that any model re-

placement or upgrade respects user privacy and data

security.

6 CONCLUSION

This paper presents the development and evaluation

of the proof of concept MentalRAG system, an agen-

tic framework for therapeutic support systems, with a

particular focus on suicide risk assessment and psy-

chological counselling. The system’s reﬁned models

exhibited notable enhancements in the provision of

empathetic and contextually tailored responses, par-

ticularly in delicate scenarios such as emotional dis-

tress and crisis management.

The MentalRAG system employs a multi-agent

structure that leverages distinct specialized agents to

handle various aspects of mental health support, from

symptom identiﬁcation to therapeutic proposal gen-

eration. The architecture of the system begins with

the Conversation Agent, which interacts with users

in real time, followed by the Anonymization Agent

to protect user privacy. Expert agents, including

the Symptoms Detection, Clinical Issues Detection,

and Suicide Detection Agents, provide diagnostic in-

sights by analyzing conversation data. The Report

Agent synthesizes these diagnostics into a cohesive

report, while the Proposal Agent generates personal-

ized therapeutic recommendations. This multi-agent

approach, combined with a human-in-the-loop vali-

dation process, ensures accurate, secure, and ethically

guided support in mental health applications.

Despite its potential, several limitations remain.

The evaluation was done by a single psychiatrist, in

future works it needs more specialists to reduce the

possibility of a bias. Also, the system’s “black box”

nature poses a risk in mental health applications, ne-

cessitating further safeguards to ensure safe and em-

pathetic interactions, mitigating biases and errors due

to its nature. Future work should focus on incorporat-

ing multimodal inputs (e.g., voice, video) to enhance

interaction depth and integrating additional datasets

for domain-speciﬁc ﬁne-tuning.

By addressing these areas, the MentalRAG system

can evolve into a more reliable, efﬁcient, and ethically

sound mental health support tool, better equipped to

meet user needs while minimizing risks.

ACKNOWLEDGEMENTS

The authors are deeply grateful to Dr. Daniel Neto,

whose contributions to the expert evaluation were

invaluable. His detailed feedback and specialized

knowledge as a PhD psychiatrist provided critical in-

sights that signiﬁcantly enhanced the quality and rel-

evance of this project. This work was supported by

national funds through Fundac¸

ao para a Ci

encia e a

Tecnologia (FCT) with reference UIDB/50021/2020

and FCT Project WSMART ROUTE+ reference

2022.04180.PTDC.

REFERENCES

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,

J., Dhariwal, P., Neelakantan, A., Shyam, P., Sas-

try, G., Askell, A., Agarwal, S., Herbert-Voss, A.,

Krueger, G., Henighan, T., Child, R., Ramesh, A.,

Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen,

M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark,

J., Berner, C., McCandlish, S., Radford, A., Sutskever,

I., and Amodei, D. (2020). Language models are few-

shot learners. In Proceedings of the 34th Interna-

tional Conference on Neural Information Processing

Systems, NIPS ’20, Red Hook, NY, USA. Curran As-

sociates Inc.

Capital, S. (2024). What’s next for AI Agentic Workﬂows

ft. Andrew Ng of AI Fund.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra,

G., Roberts, A., Barham, P., Chung, H. W., Sutton,

C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko,

S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer,

N., Prabhakaran, V., Reif, E., Du, N., Hutchinson,

B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-

Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat,

S., Dev, S., Michalewski, H., Garcia, X., Misra, V.,

Robinson, K., Fedus, L., Zhou, D., Ippolito, D., Luan,

D., Lim, H., Zoph, B., Spiridonov, A., Sepassi, R.,

Dohan, D., Agrawal, S., Omernick, M., Dai, A. M.,

Pillai, T. S., Pellat, M., Lewkowycz, A., Moreira, E.,

Child, R., Polozov, O., Lee, K., Zhou, Z., Wang, X.,

Saeta, B., Diaz, M., Firat, O., Catasta, M., Wei, J.,

Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., and

Fiedel, N. (2023). Palm: scaling language modeling

with pathways. J. Mach. Learn. Res., 24(1).

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fe-

dus, W., Li, Y., Wang, X., Dehghani, M., Brahma,

ICT4AWE 2025 - 11th International Conference on Information and Communication Technologies for Ageing Well and e-Health

S., et al. (2024). Scaling instruction-ﬁnetuned lan-

guage models. Journal of Machine Learning Re-

search, 25(70):1–53.

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer,

L. (2023). Qlora: efﬁcient ﬁnetuning of quantized

llms. In Proceedings of the 37th International Con-

ference on Neural Information Processing Systems,

NIPS ’23, Red Hook, NY, USA. Curran Associates

Inc.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. In North Amer-

ican Chapter of the Association for Computational

Linguistics.

Dubey, A., Jauhri, A., and et al., A. P. (2024). The llama 3

herd of models. ArXiv, abs/2407.21783.

Huang, K., Altosaar, J., and Ranganath, R. (2019). Clinical-

bert: Modeling clinical notes and predicting hospital

readmission. arXiv preprint arXiv:1904.05342.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,

Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel,

G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b.

arXiv preprint arXiv:2310.06825.

Jin, H., Chen, S., Wu, M., and Zhu, K. Q. (2023). Psye-

val: A comprehensive large language model evalu-

ation benchmark for mental health. arXiv preprint

arXiv:2311.09189.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H.,

and Kang, J. (2019). Biobert: a pre-trained biomedi-

cal language representation model for biomedical text

mining. Bioinformatics, 36(4):1234–1240.

Liu, J. M., Li, D., Cao, H., Ren, T., Liao, Z., and Wu, J.

(2023). Chatcounselor: A large language models for

mental health support. ArXiv, abs/2309.15461.

Liu, Y. (2019). Roberta: A robustly optimized bert pretrain-

ing approach. arXiv preprint arXiv:1907.11692.

National Alliance on Mental Illness (2023). Mental health

by the numbers. https://nami.org/mhstats. Last

checked on 2023-12-12.

OpenAI et al. (2024). Hello gpt-4o. https://openai.com/

index/hello-gpt-4o/. Last checked on 2024-10-12.

OpenAI, R. et al. (2023). Gpt-4 technical report. ArXiv,

2303:08774.

Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E.

(2024). Gorilla: Large language model connected

with massive APIs. In The Thirty-eighth Annual Con-

ference on Neural Information Processing Systems.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).

Exploring the limits of transfer learning with a uniﬁed

text-to-text transformer. Journal of machine learning

research, 21(140):1–67.

Shing, H.-C., Nair, S., Zirikly, A., Friedenberg, M.,

Daum

e III, H., and Resnik, P. (2018). Expert, crowd-

sourced, and machine assessment of suicide risk via

online postings. In Loveys, K., Niederhoffer, K.,

Prud’hommeaux, E., Resnik, R., and Resnik, P., ed-

itors, Proceedings of the Fifth Workshop on Compu-

tational Linguistics and Clinical Psychology: From

Keyboard to Clinic, pages 25–36, New Orleans, LA.

Association for Computational Linguistics.

Singh, A., Ehtesham, A., Kumar, S., and Khoei, T. T.

(2024). Enhancing ai systems with agentic workﬂows

patterns in large language model. In 2024 IEEE World

AI IoT Congress (AIIoT), pages 527–532.

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li,

X., Guestrin, C., Liang, P., and Hashimoto, T. B.

(2023). Stanford alpaca: An instruction-following

llama model. https://github.com/tatsu-lab/stanford

alpaca.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,

M.-A., Lacroix, T., Rozi

ere, B., Goyal, N., Hambro,

E., Azhar, F., et al. (2023a). Llama: Open and ef-

ﬁcient foundation language models. arXiv preprint

arXiv:2302.13971.

Touvron, H., Martin, L., Stone, K. R., et al. (2023b).

Llama 2: Open foundation and ﬁne-tuned chat mod-

els. ArXiv, abs/2307.09288.

Xu, X., Yao, B., Dong, Y., Gabriel, S., Yu, H., Hendler,

J., Ghassemi, M., Dey, A. K., and Wang, D. (2024).

Mental-LLM: Leveraging large language models for

mental health prediction via online text data. Proceed-

ings of the ACM on Interactive, Mobile, Wearable and

Ubiquitous Technologies, 8(1):1–32.

Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed,

F., Liu, Z., Liu, C., Zeng, M., and Wang, L. (2023).

Mm-react: Prompting chatgpt for multimodal reason-

ing and action. ArXiv.

MentalRAG: Developing an Agentic Framework for Therapeutic Support Systems