An AI-Based Virtual Client for Educational Role-Playing in the Training

of Online Counselors

Eric Rudolph

, Natalie Engert

and Jens Albrecht

Technische Hochschule N

urnberg Georg Simon Ohm, N

urnberg, Germany

Keywords:

Counseling, Chatbot, Large Language Model, Persona, Role-Play.

Abstract:

The paper presents the Virtual Client for Online Counseling (VirCo), a novel system for training online coun-

selors through simulated client interactions. Addressing the rising need for digital communication skills in

counseling, VirCo leverages large language models (LLMs) to create a chatbot that generates realistic re-

sponses from different client personas. The approach complements traditional role-playing methods in aca-

demic training, offering independent practice opportunities without direct supervision. To ensure privacy

VirCo’s chatbot interface uses an open-source LLM for response generation. The system’s dataset comprises

detailed persona descriptions and transcripts from role-play sessions, contributing to the authenticity of the

training experience. The evaluation of the quality of the conversations utilized both human evaluators and

LLMs. Results show a high degree of coherence and persona alignment in responses, highlighting VirCo’s

effectiveness as a training tool. The paper concludes by showcasing the features of the VirCo learning platform

like compulsory assignments and multiple feedback mechanisms.

1 INTRODUCTION

Many people seek professional help in personal cri-

sis situations. In this context, online counseling ser-

vices play a vital role. At best, well-trained coun-

selors can clarify the causes of problems in a dialogue

and provide individual assistance. The American 988

crisis hotline, for example, is contacted over 400.000

times per month (Center for Behavioral Health Statis-

tics and Quality, 2023). While telephone counseling

is still dominating, many organizations also support

the growing demand for text-based online counseling

via email or chat.

The use of digital communication through online

platforms presents unique challenges that require on-

line counselors to receive specialized training before

providing counseling to individuals seeking help in

a digital environment. To ensure sufﬁcient hands-

on experience alongside theoretical learning, trainees

should interact with clients in virtual environments

as part of their education (DeMasi et al., 2020).

Nonetheless, in academic settings, managing actual

clients is uncommon. Thus, role-playing is frequently

https://orcid.org/0009-0003-0615-4780

https://orcid.org/0009-0001-2493-0208

https://orcid.org/0000-0003-4070-1787

employed, because it offers several advantages over

traditional case studies. In addition to immediate

and immersive experiences that allow participants to

embody and understand perspectives other than their

own it also improves communication skills and empa-

thy (Mianehsaz et al., 2023; Bharti, 2023; Sai Sailesh

Kumar Goothy et al., 2019). Role-playing is also no-

tably effective due to its ability to immerse individ-

uals in another’s perspective and it also provides the

learner with a safe environment without harming po-

tential clients in the real world (Kerr et al., 2021). De-

spite the several advantages role-playing can be difﬁ-

cult to implement due to the extensive supervision by

trainers and the need to rely on role-play partners. As

a result, course participants are restricted in their abil-

ity to practise independently and gain practical expe-

rience.

In this paper, we introduce VirCo, the Virtual

Client for Online Counseling, a system to simulate

clients for the training of online counselors (see ﬁg-

ure 1 for a sample dialog). With VirCo, learners are

able to engage autonomously with different clients

and gain initial counseling experiences that are free

from direct trainer involvement. Some effects, such

as the emotional impact of recommended actions on

the client, can only be observed in dialog with a real

person. Thus, VirCo is not intended to replace role-

108

Rudolph, E., Engert, N. and Albrecht, J.

An AI-Based Virtual Client for Educational Role-Playing in the Training of Online Counselors.

DOI: 10.5220/0012690700003693

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024) - Volume 2, pages 108-117

ISBN: 978-989-758-697-2; ISSN: 2184-5026

playing but to complement it.

The idea of creating a chatbot for counselor

training was explored before as discussed in sec-

tion 2. Our solution is based on ideas from previous

work, but makes use of the recent advances in natu-

ral language processing with large language models

(LLMs). Thus, the core of the system is a chatbot

which uses an LLM to generate answers. The chatbot

is part of a comprehensive learning platform which

includes functionality to practice with different per-

sonas (problem cases) and to provide automated and

manual feedback to students. To represent the per-

sonas, we created transcripts of role-plays based on

different problem scenarios like drug abuse of a child

or quarrels between parents. The persona descriptions

and the transcripts are used to shape the answers of the

LLM. This paper gives an overview about the archi-

tecture of the whole learning platform and ﬁrst evalu-

ation results for the virtual client chatbot. Our contri-

bution is as follows:

• We demonstrate, how large language models can

be utilized for educational role-play.

• We present a methodology to simulate different

personas with large language models.

• We introduce an architecture for a chatbot-based

learning platform for educational role-play.

• We evaluate the course of conversation coherence

and persona consistency of a persona-based client

chatbot.

Figure 1: Excerpt of an example conversation with the vir-

tual client (VirCo). VirCo simulates a concerned mother

who assumes that her son smokes marijuana.

2 RELATED WORK

The famous ELIZA, developed in 1966, is a pioneer-

ing chatbot that served as the model for many others.

It simulates a therapist by posing and responding to

particular queries from the individual interacting with

it (Weizenbaum, 1966). Since then, many publica-

tions have been published in the area of chatbot for

counseling, especially in the area of psychotherapy

and mental health. See (Boucher et al., 2021) and (Xu

and Zhuang, 2022) for an overview. These counseling

chatbots are mainly intended to support diagnostics

and behavior change, or simply to deliver support-

ive content. However, there are ongoing discussions

about the feasibility of using AI-powered chatbots as

a substitute for counselors or therapists.

Thus, several studies have sought to simulate the

character of the client rather than the therapist. These

chatbots are intended to be used for training aspir-

ing professionals. (Tanana et al., 2019) introduced

ClientBot, a patient-like chatbot imitating a visitor

in a psychotherapy session. DeMasi et al. made a

signiﬁcant contribution to the research on chatbots in

counseling contexts through the creation of the Crisis-

bot (DeMasi et al., 2019; DeMasi et al., 2020). Cri-

sisbot simulates a caller to a suicide prevention hot-

line. In order to provide different problem scenar-

ios of clients (personas), Crisisbot also introduced a

multi-task training framework to construct persona-

speciﬁc responses (DeMasi et al., 2020). An overview

of persona-based conversational AI was given in (Liu

et al., 2022).

One approach to provide consistent answers is to

retrieve candidate responses from a corpus of proto-

type conversations as input for the generation of the

actual response (Tanana et al., 2019; DeMasi et al.,

2019; Liu et al., 2022). To further improve the re-

trieval, several works used an utterance classiﬁer to

predict the type of the next response (Cao et al., 2019;

Park et al., 2019; DeMasi et al., 2020). Genera-

tive models like Seq2Seq were then used to generate

responses based on the selected candidates (Tanana

et al., 2019; DeMasi et al., 2020; Liu et al., 2022).

Generating responses which are both, consistent

to a given persona and coherent to the course of con-

versation, is a crucial prerequisite for a realistic learn-

ing experience in such a setting. But this is a chal-

lenging requirement, because the conversations are

basically open-domain and don’t follow a clear struc-

ture. The state-of-the art in 2019/20 did not produce

satisfying results in this regard, because the models

generated sometimes unrealistic, distracting or irrele-

vant responses (Tanana et al., 2019) or were not reli-

ably consistent (DeMasi et al., 2020). In Crisisbot, the

An AI-Based Virtual Client for Educational Role-Playing in the Training of Online Counselors

109

generated responses were also generally shorter than

the real responses extracted from the corpus (DeMasi

et al., 2020). This had a negative impact on the learn-

ing experience of aspiring counselors.

The situation changed with the rapid development

of Large Language Models since the launch of Chat-

GPT. LLMs are very good at generating coherent di-

alogues. A typical drawback, hallucination, might

even be beneﬁcial in our setting. Lee et al. intro-

duced a methodology to prompt LLMs for long open-

domain conversations utilizing few-shot in-context

learning and chain-of-thought (Lee et al., 2023) .

Chen et al. showed that LLMs can effectively be

used for counselor and client simulation without ﬁne-

tuning (Chen et al., 2023). Our approach combines

the power of LLMs with ideas from the discussed pre-

vious work on client-simulating counseling chatbots.

3 DATASET

The basis for the Virtual Client evaluation is a dataset

consisting of two parts: the persona descriptions and

for each persona a set of simulated conversations. The

persona descriptions were created by domain experts

based on documented email counseling sessions and

public forum posts

. To simulate a variety of counsel-

ing settings, we created so far seven persona descrip-

tions. A key aspect of these personas is the deﬁni-

tion of a main concern, which represents the client’s

motivation for using chat counseling. In particular,

problems in the area of addiction counseling or ed-

ucational counseling, such as the following problem

description, were deﬁned:

“Elke is worried about her 16-year-old son

Lukas, who is in the 10th grade. She sus-

pects that he is using drugs or smoking mar-

iuhana because of his circle of friends. He

was generally not a bad student, but unfortu-

nately his grades have deteriorated recently.

She has already tried to talk to Lukas about

it, both about his school performance and his

drug use, but he keeps blocking it and doesn’t

want to talk about it. Elke suspects that his

son’s changed behavior is due to his circle of

friends, as they have a bad inﬂuence on him.”

Additionally, the linguistic characteristics of the

personas are carefully considered to capture the com-

munication preferences and styles of the clients.

All persona descriptions, conversations and prompts

have actually been created in German, as the Virtual Client

is intended for German-speaking students. We translated

the examples to English for this publication.

Online counselor trainees were then asked to use

these persona descriptions to role-play chat coun-

seling conversations. The role-plays lasted approxi-

mately one hour, based on the typical length of chat

counseling training sessions. Further dataset statistics

can be found in Table 1.

Table 1: Dataset statistics.

Dataset component Count

Number of conversations 56

Average messages per conversation 40.39

Counselor messages 1125

Client messages 1137

The data was collected in four phases, with the

participation of different trainees. Although all con-

versations relate to the deﬁned persona descriptions,

the actual conversations include some variations. For

instance, simulated clients contacted the wrong coun-

seling centre and had to be referred to the correct cen-

tre by the counselor. The writing style also varies sig-

niﬁcantly. During counseling sessions, clients may

exhibit varying levels of cooperation. Some may pro-

vide detailed answers to the counselor’s questions,

while others may be uncooperative, providing brief

responses or struggling to articulate their problems

clearly.

4 SYSTEM OVERVIEW

The Virtual Client can be accessed by online coun-

selor trainees through a software portal which in-

cludes a learning management system and an inter-

face for answer generation (Figure 2).

VirCo Learning Platform. The VirCo Learning

Platform presents itself to trainees as a professional

chat interface for online counseling, similar to what a

trained counselor might use for counseling via a com-

puter or mobile phone. The unique aspect of this plat-

form is that trainees interact with a simulated client,

programmed to respond to their statements using a

large language model. The learning platform incor-

porates gamiﬁcation elements to improve the learning

experience for users. The platform itself is further de-

scribed in section 6.

VirCo Bot. After a learner sends a message to the

client on the learning platform, it is forwarded to

the response generation interface. Due to privacy

concerns, we prefer not to use cloud-based LLMs

like ChatGPT. Instead, the open-source LLM Vicuna-

CSEDU 2024 - 16th International Conference on Computer Supported Education

110

Figure 2: Overview of the VirCo architecture, which com-

prises two independent components: the learning platform

and the chatbot interface. These components can function

separately and communicate with each other via REST.

13B-1.5 (Chiang et al., 2023) is used to generate re-

sponses based on the previously described personas

and dataset. The model is hosted on a private High

Performance Computing (HPC) cluster. However,

switching to OpenAI models for evaluation purposes

is also possible.

The Vicuna 13B model was selected because a

preliminary test showed that it fulﬁlls the following

requirements:

• It is able to carry out role-plays and respond to

previous conversations.

• It is open source and can therefore be hosted on a

private environment.

• With 13 billion parameters, it has a comparatively

small size for the described capabilities.

• It has an understanding of the German language

and is able to respond in German.

For prompting the model, a role-play prompt was

created, which contains a role assignment, a place-

holder for the description of the persona and a place-

holder for a conversation history. Role-play prompts

with an LLM often enhance its capabilities (Shana-

han et al., 2023; Kong et al., 2023). The structure

of such prompts often follows a pattern in which the

persona of the character is described, followed by the

inclusion of the conversation history, as shown in the

following example in the context of online counsel-

ing:

“Pretend you are {name}. {name} is cur-

rently in counseling and is writing with

her/his social counselor. Give statements that

{name} would give. {problem description}

Chat history: {chat history} {name} answers

brieﬂy and concisely”

The placeholders in the prompt are then replaced by

the persona (e.g. Elke from above) and conversation-

speciﬁc data. The name and problem description are

derived from the persona description (see section 3),

while the conversation history is retrieved from the

database. An example of a conversation history was

already shown in Figure 1. For this purpose, the max-

imum token length was set to 256, as a message in the

chat should not be too long, and a stop sequence was

added, which interrupts the generation at the phrase

“Counselor:”, as the language model would otherwise

often unintentionally generate the next reply from the

counselor. An example replacement for the place-

holder chat history from our data set looks like this:

Client: Hello, I’m Elke and my child Lukas is

taking drugs ... Can you help me here? Coun-

selor: Hello Elke. Great that you got in touch.

I’m Marie and I’m a counselor. I’d be happy

to try and help you, but ﬁrst I’d like to clarify

the technical and organizational framework

with you. Is that okay? Client:

Once the response has been generated by the LLM, it

is sent back to the learning platform and displayed to

the user. All conversations are logged to the database

as input for the feedback module.

5 EVALUATION

For the purpose of evaluation, we segmented each

of the 56 conversations into individual conversational

turns. Initially, we analyzed the ﬁrst two messages

if initiated by the client, as illustrated in the con-

cluding remarks of section 4, or the opening mes-

sage if initiated by the counselor. Subsequently, we

incrementally included additional turns of conversa-

tion. To illustrate, the initial sequence comprises al-

ternating turns between the client and counselor, pro-

gressively building up to sequences involving mul-

tiple exchanges (e.g., Client - Counselor - Client -

Counselor). This process resulted in a dataset com-

prising 1,125 entries for thorough evaluation. This

means we generated an answer for each counselors

message. The data was then rated by human evalu-

ators as well as automatically by GPT-3.5 and GPT-

4 on two tasks, dialogue coherence (task 1) and per-

sona consistency (task 2). We used GPT-3.5 and GPT-

4 because they often achieve competitive correlation

with human judgment for natural language generation

tasks (Wang et al., 2023). The task for dialogue co-

herence was to evaluate whether the generated answer

ﬁts to the conversation history. The task for consis-

tency was to evaluate whether the generated answer

ﬁts to the persona. The integration of human and

LLM evaluations serves distinct purposes:

An AI-Based Virtual Client for Educational Role-Playing in the Training of Online Counselors

111

LLM Empowered Evaluation: enables the analy-

sis of large datasets with consistency and efﬁciency.

LLMs can process extensive conversations, identi-

fying patterns and metrics at scale. Since conven-

tional metrics such as ROGUE or BLEU score do not

provide reliable results for generated language tasks

LLMs have become the standard for the automatic

evaluation of dialogues (Chang et al., 2023)[p. 26]

i. e. in (Lin and Chen, 2023; Liu et al., 2023).

Human Evaluation: provides often a more robust

result with nuanced understanding of conversational

context and emotional undertones. Reliable human

evaluators are essential for identifying complex con-

versational dynamics in the context of counseling ses-

sions that AI might overlook.

Conducting both evaluation approaches enables a

more comprehensive insight. In addition, both ap-

proaches are compared to assess their degree of over-

lap and to analyze the differences. We start by ex-

plaining the automatic evaluation procedure, as hu-

man evaluators basically got the same instructions as

the LLM.

5.1 Automatic Evaluation by LLMs

The automatic evaluation was carried out with the

OpenAI models ”GPT-3.5 1106 Turbo” (hereafter ab-

breviated as GPT-3.5) and ”GPT-4 1106 Preview”

(hereafter abbreviated as GPT-4). The evaluation

prompts are inspired by G-Eval, a framework for the

evaluation of Natural Language Generation tasks (Liu

et al., 2023). Both prompts begin with a task de-

scription of the evaluation. Afterwards there are task

speciﬁc evaluation criteria and a scoring description.

This is followed by conversation-speciﬁc information.

5.1.1 Task 1: Conversation History Rating

The following prompt is used for the automated gen-

eration of ratings and evaluations with GPT. The ﬁrst

section is about familiarizing the model with the in-

struction. Since GPT tends to evaluate the course of

the conversation as a whole instead of just the gener-

ated response based on the course of the conversation,

this is explicitly described here:

“The following conversation shows a chat

counseling session between a client and a

counselor. The context of the chat between

these two people is online social counseling.

Your task is to evaluate to what extent the gen-

erated message matches the previous conver-

sation. Please evaluate only whether the gen-

erated message matches the previous conver-

sation.”

The next part outlines the criteria for evaluation. The

main focus of the evaluation criteria lies on the co-

herence to the conversation history, ensuring the chat

maintains logical consistency; Content Accuracy, re-

quiring that the answers are correct and relevant to the

advisor’s message; and the Flow of the Conversation,

emphasizing the need for a natural and uninterrupted

progression of the chat without sudden topic shifts or

confusing elements.

“Evaluation criteria:

Coherence: The chat should be coherent.

Content accuracy: The content of the answers

must be correct and appropriate to the coun-

selors message.

Flow of the conversation: The continuation

of the chat should have a natural and smooth

ﬂow, without abrupt changes of topic or con-

fusing contexts. Please keep in mind that this

is a chat. Short, precise and direct answers

can occur here.”

The third section provides a scoring system for

the evaluation. A score of 0 signals a fully coher-

ent response that is contextually accurate and ﬂows

well with the previous conversation. A score of 1 in-

dicates basic coherence and content alignment with

minor grammatical or spelling errors. A score of

2 denotes a lack of coherence, confusion, or repeti-

tive words or sentences, indicating a mismatch with

the ongoing conversation. The human evaluators rate

those scores on a website with a green (0: ﬁts good),

yellow (1: ﬁts mediocre) and red (2: doesn’t ﬁt) radio

button.

“Score rating:

0: The generated answer is coherent with the

previous conversation. The content of the gen-

erated answer is correct and the ﬂow of the

conversation is appropriate to the previous

course.

1: The generated answer is basically coher-

ent with the course of the conversation so far

and the content matches the course of the con-

versation, but the generated answer contains

grammatical errors or spelling mistakes

2: The generated answer does not match the

course of the conversation so far. It is con-

fused or there are many repetitions of words

or sentences”

The following section outlines the structure for con-

ducting the evaluation. It highlights that the evalua-

tion will be based on the previously mentioned crite-

ria. The format includes placeholders for the course

CSEDU 2024 - 16th International Conference on Computer Supported Education

112

of the conversation (history) and the generated re-

sponse (answer) from the LLM. Note that the text in

curly brackets is also a placeholder for the conversa-

tion history, the generated response.

“The evaluation is based on the evaluation

criteria.

Course of the conversation:

{history}

Generated response:

{answer}”

The last block speciﬁes the format in which the eval-

uation results are to be displayed. This is added for

post-processing purposes.

“Please structure your answer as JSON with

the attributes rating and reason. Output only

the JSON format: ”

5.1.2 Task 2: Persona Consistency Rating

The second task is centered on evaluating the consis-

tency of a generated answer with a predeﬁned charac-

ter persona. The prompt, akin to the ﬁrst one, shifts

focus towards assessing persona consistency. The dif-

ferences between this prompt and the prompt before

were marked in bold.

“The following conversation shows a chat

counseling session between a client and a

counselor. The context of the chat between

these two people is online social counseling.

Your task is to evaluate whether the charac-

ter described would write the generated re-

sponse.”

The persona consistency is a direct evaluation crite-

rion. An indirect way to evaluate this is to assess the

ﬂow of conversation, which may extend the persona

description but still should follow a consistent prob-

lem case:

“Evaluation criteria:

Character consistency: The content of the

answers must be consistent with the charac-

ter description or expand on it in a meaning-

ful and realistic way.

Flow of conversation: The continuation of

the chat should be natural and appropriate to

the character described. Please keep in mind

that this is a chat. Short, precise and direct

answers can occur here.”

The score descriptions were also adjusted to task two:

“Score:

0: The generated answer is very realistic and

is consistent with the character’s description

or expands on it in a meaningful way

1: The generated answer ﬁts the character,

but the person would probably express them-

selves differently based on the character

description and previous history

2: The generated answer does not match the

character. The person would deﬁnitely not

express themselves in this way”

After the description of the scores a character descrip-

tion is added. In order to prevent the model from rat-

ing the generated answer only on the course of the

conversation, a sentence was added with a request to

rate only the relation to the character description. The

rest remains the same:

The evaluation is based on the evaluation cri-

teria.

Character description:

{personality condition}

Course of the conversation:

{history}

Generated answer:

{answer}

Please structure your answer as JSON with

the attributes Rating and Reason. Please only

evaluate the generated answer in relation to

the character description and the course of

the conversation. Only output the JSON for-

mat:”

5.2 Manual Evaluation by Humans

The manual evaluation process was carried out with

ﬁve raters with an academic background in social sci-

ences (hereinafter referred to as raters). It relies on

the two tasks outlined in the previous section. A spe-

cialized web application was created for the rating

which enabled evaluators to rate dialogues based on

the speciﬁc criteria outlined in section 5.1. Evalu-

ators can view dialogue histories, read task descrip-

tions and generated answers, and score each task us-

ing a set of radio buttons. Additionally, for the second

task, a persona description was provided on the right

side for reference. To ensure comparability, all raters

were provided with the same conversations.

5.3 Results

5.3.1 Results of Task 1

Table 2 provides a comparison of evaluation scores

assigned by GPT-3.5 and GPT-4, alongside ﬁve indi-

vidual raters. These scores are presented as percent-

ages, reﬂecting the frequency with which each rater

deemed the generated responses as coherent with the

An AI-Based Virtual Client for Educational Role-Playing in the Training of Online Counselors

113

Figure 3: Pairwise percentage of agreement between all raters for each label on task 1 (conversation coherence).

Figure 4: Pairwise percentage of agreement between all raters for each label on task 2 (persona consistency).

preceding discourse. Additionally, the table encapsu-

lates the consensus assessment derived through a ma-

jority vote among the raters (1 through 5). The data

vividly illustrates a general agreement among all par-

ties — both human raters and AI — that at least 89%

of the generated answers maintained coherence (La-

bel 0) with the interview’s prior context.

Table 2: Comparative Rating Percentages by GPT Versions

and Raters for task 1.

Rater Label 0 Label 1 Label 2

GPT-3.5 91.91 2.22 5.86

GPT-4 89.59 0.81 9.61

Rater 1 89.26 5.8 4.94

Rater 2 89.51 5.57 4.93

Rater 3 89.81 5.81 4.39

Rater 4 95.92 4.08 0

Rater 5 92.38 3.52 4.11

Majority (1-5) 92.01 3.54 4.45

Figure 3 shows the pairwise percentage agreement

between all raters including the GPT models for each

label on task 1 in form of a heatmap. Each cell in the

heatmap represents the pairwise agreement between

two raters (e.g., User 1 and User 2, or User 1 and

GPT-3.5) for the respective label. We used a Jaccard-

like metric to evaluate the agreement: Let Resp

(r)

set of responses that rater r has rated with label l. The

agreement A

, r

) between two raters r

and r

with

regard to label l is then calculated by:

, r

) =

|Resp

) ∩ Resp

|Resp

) ∪ Resp

The diagonal cells, which are all 100%, repre-

sent the agreement of each rating entity with itself.

It shows that when it comes to label 0 (”coherent”)

there are high levels of agreement among all entities,

mostly above 87%. It indicates that the users and AI

models tend to agree on the classiﬁcation of label 0

on this task. The second heatmap for label 1 (”moder-

ately coherent”) shows very low agreement levels, in-

dicated by cooler colors like blue and light red. Most

values are below 20%, suggesting a high level of dis-

agreement or variability in how label 1 is being classi-

ﬁed. The agreement level on label 2 (”not coherent”)

ranges from 14.3 % to 55.3 %.

All in all, it can be said that there is little differ-

ence between the human raters and the GPT models.

While raters often agree when a generated answer is

coherent with the respective course of the conversa-

tion, they disagree about which course of the con-

versation is only moderately coherent and occasion-

ally agree when the generated answer does not ﬁt the

course of the conversation at all.

5.3.2 Results of Task 2

Table 3 summarizes the percentage of each label as-

signed by GPT-3.5, GPT-4, the ﬁve human raters, and

CSEDU 2024 - 16th International Conference on Computer Supported Education

114

the majority decision among human raters. Similar

to task 1, raters 1 to 5 and GPT-4 predominantly as-

signed label 0 (”consistent with the persona”). The

exception here is GPT-3.5. This is because GPT-3.5

often takes the persona description very literally and

argues that an upset persona responds directly with

an emotional answer about the problem, which is not

necessarily the case in a chat counseling session. For

example the generated answer “yes that’s okey” on

the conversation depicted at the end of section 4 was

rated by GPT-3.5 with label 2 with the following rea-

son:

“The generated answer does not match the

character. Elke is very worried about her

son and suspects that he is taking drugs. She

would therefore probably not simply answer

’Yes, that’s okay’, but would rather talk about

her worries or ask for speciﬁc help.”

All other raters, including GPT-4, considered the gen-

erated answer to be consistent with the persona. GPT-

4’s reasoning was as follows:

“The generated answer is coherent with the

previous course of the conversation. The

client agrees with the technical and organiza-

tional framework, which is a logical and ap-

propriate next step in the conversation. There

are no grammatical or content errors, and the

ﬂow of the conversation remains natural and

coherent.”

Table 3: Comparative Rating Percentages by GPT Versions

and Raters for task 2.

Rater Label 0 Label 1 Label 2

GPT-3.5 38.27 13.4 48.32

GPT-4 89.59 0.81 9.61

Rater 1 92.22 3.58 4.2

Rater 2 93.15 5.78 1.07

Rater 3 96.39 0.52 3.1

Rater 4 95.92 3.86 0.21

Rater 5 93.99 2.35 3.67

Majority (1-5) 94.34 2.53 3.13

Figure 4 shows the heat maps for task 2. The

map on the left-side shows again a high percentage

of agreement (mostly in the red shades indicating per-

centages from the high 80s to 100%) among raters 1

to 5 and GPT-4 for label 0, which means there is a

general consensus that the answer ﬁts the persona de-

scription well. The agreement with GPT-3.5 is signif-

icantly lower with percentages ranging from around

31.8% to 42.5%. This is in line with table 3.

In summary, it can be stated that the responses of

the Vicuna model in the virtual client were assessed

as both coherent and consistent in approx. 90% of

cases. In the cases where this was not the case, the

raters often disagreed.

Figure 5: Userﬂow diagram of the learning platform.

6 LEARNING PLATFORM

The evaluated virtual client chatbot is the core of a

comprehensive learning platform. In section 4 we al-

ready introduced the overall architecture and the in-

teraction between the various components. This sec-

tion focuses on the learning platform in the upper part

of the ﬁgure 2. To get an overview of its function-

ality, ﬁgure 5 shows a userﬂow diagram. After lo-

gin, course enrollment and course selection, which is

common practice in a learning platform, an overview

of the compulsory tasks and voluntary exercises com-

pleted so far with the virtual clients is displayed.

While a compulsory task involves a chat with a

persona selected by the trainer and serves as an exam-

ination, the exercise serves as a voluntary opportunity

to improve one’s chat counseling skills and as prepa-

An AI-Based Virtual Client for Educational Role-Playing in the Training of Online Counselors

115

ration for the compulsory task. In the latter case, the

student can choose the persona from a number of per-

sonas pre-selected by the trainer. Optionally, a techni-

cal difﬁculty can also be conﬁgured, for example that

the persona has internet problems.

During the chat session, a note ﬁeld offers the op-

portunity to make notes for each message. This is

intended to improve the student’s ability to reﬂect.

On the left-hand side, users have to possibility to re-

ﬂect on the chat and rate the AI-generated messages

(thumbs up/down). This helps to further improve the

VirCo architecture in the future. Various options for

requesting/generating feedback are displayed on the

right-hand side:

• Request Trainer Feedback. To receive feedback

from the trainer, a student can actively request

feedback by clicking on a button. The trainer

can conﬁgure for the course how much feedback

he/she would like to give per student.

• Request Peer Feedback. To encourage student

participation in providing feedback, our approach

employs a coin-based incentive system. Initially,

students are allocated a speciﬁc number of coins.

Submitting feedback requires the expenditure of

one coin, which is replenished upon providing

feedback. This cycle aims to foster continuous en-

gagement and contribution.

• AI-Generated Feedback. AI-generated feedback

is another promising attempt to improve counsel-

ing skills. The advantage here is that the feedback

is provided directly after the counseling session.

7 SUMMARY AND FUTURE

WORK

In this study, we explore the application of large

language models (LLMs) in the realm of education,

speciﬁcally focusing on their potential for facilitat-

ing educational role-play to improve the training of

online counselors. We introduce a comprehensive

methodology designed to enable LLMs to simulate

diverse personas. Our work includes the development

of a novel architecture for a chatbot-based platform

speciﬁcally tailored for educational role-play. This

platform leverages the capabilities of LLMs to de-

liver interactive learning experiences and personal-

ized feedback opportunities.

We also evaluated the persona consistency and the

course of conversation coherence of the client chat-

bot. The input data for the natural language gener-

ation task was created by chat counseling role-plays

where one person played the client and one person

played the counselor. The evaluation showed that the

answers generated by the model are realistic, gener-

ally correspond to the course of the interview and are

consistent with the persona. However, it also shows

that the human raters disagree about when a gener-

ated answer does not match the course of the conver-

sation. It is therefore also not surprising that GPT-4

and GPT-3.5 do not agree with the human raters here

either. Similar results were obtained when measuring

persona consistency, with the difference that GPT-3.5

shows strong deviations from other raters and GPT-4.

Future improvements could concentrate on ex-

panding the diversity and complexity of the per-

sonas and scenarios included in the dataset to cover

a broader range of counseling situations. This expan-

sion would demonstrate the system’s adaptability and

improve its utility as a practical training tool. Addi-

tionally, exploring how the virtual client responds to

non-serious counselor messages could unveil limita-

tions and inform necessary adjustments to its archi-

tecture, ensuring effective responses across a wider

range of interaction types.

Another way to improve the virtual client is to

compare different LLM models in this scenario. For

this, a ranking task could be used instead of a rating

task, as comparing different answers is often easier

for both humans and language models than evaluat-

ing a single answer. The ranking can then be used

as a preference dataset to further improve the virtual

client through Direct Preference Optimization or Re-

inforcement Learning from Human Feedback.

We also described the feedback functionality of

the learning platform, but this has not yet been evalu-

ated. In addition, we want to delve deeper into the

automatic generation of feedback and compare dif-

ferent feedback methods like the “Situation, Behav-

ior, Impact”- or the “sandwich”-method with differ-

ent LLM architectures. It is currently not possible

to add further personas in the learning platform front

end. We plan to add such functionality, whereby the

course trainer is asked speciﬁc questions and a per-

sona is then created on this basis. This will make it

very easy to use the learning platform in areas other

than online counseling.

In-depth research on the long-term impact of

training with the platform will provide valuable in-

sights into its efﬁcacy, beneﬁts, and limitations. An

additional area of interest is the exploration of user

interaction with the system, with a particular focus

on user experience and interface design. Optimizing

these aspects could signiﬁcantly enhance engagement

and the overall effectiveness of training sessions. Fi-

nally, given the rapid development of LLMs, ongoing

updates and comparisons with new models will en-

CSEDU 2024 - 16th International Conference on Computer Supported Education

116

sure that the platform remains at the forefront of tech-

nology, continuously improving its realism and effec-

tiveness as a training tool.

REFERENCES

Bharti, R. K. (2023). Contribution of Medical Education

through Role Playing in Community Health Promo-

tion: A Review. Iranian Journal of Public Health.

Boucher, E. M., Harake, N. R., Ward, H. E., Stoeckl, S. E.,

Vargas, J., Minkel, J., Parks, A. C., and Zilca, R.

(2021). Artiﬁcially intelligent chatbots in digital men-

tal health interventions: A review. Expert Review of

Medical Devices, 18(sup1):37–49.

Cao, J., Tanana, M., Imel, Z. E., Poitras, E., Atkins, D. C.,

and Srikumar, V. (2019). Observing Dialogue in Ther-

apy: Categorizing and Forecasting Behavioral Codes.

arXiv:1907.00326.

Center for Behavioral Health Statistics and Quality (2023).

2023 National Survey on Drug Use and Health (NS-

DUH): Prescription drug images for the 2023 ques-

tionnaires. Technical report, Substance Abuse and

Mental Health Services Administration.

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu,

K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W.,

Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., and Xie,

X. (2023). A survey on evaluation of large language

models. arXiv:2307.03109.

Chen, S., Wu, M., Zhu, K. Q., Lan, K., Zhang, Z., and Cui,

L. (2023). LLM-empowered Chatbots for Psychiatrist

and Patient Simulation: Application and Evaluation.

arXiv:2305.13614.

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H.,

Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E.,

Stoica, I., and Xing, E. P. (2023). Vicuna: An open-

source chatbot impressing gpt-4 with 90%* chatgpt

quality. https://lmsys.org/blog/2023-03-30-vicuna/.

DeMasi, O., Hearst, M., and Recht, B. (2019). Towards

augmenting crisis counselor training by improving

message retrieval. In Proceedings of the Sixth Work-

shop on Computational Linguistics and Clinical Psy-

chology, pages 1––11, Minneapolis, Minnesota. As-

sociation for Computational Linguistics.

DeMasi, O., Li, Y., and Yu, Z. (2020). A multi-persona

chatbot for hotline counselor training. In Find-

ings of the Association for Computational Linguistics:

EMNLP 2020, pages 3623––3636, Online. Associa-

tion for Computational Linguistics.

Kerr, A., Strawbridge, J., Kelleher, C., Barlow, J., Sulli-

van, C., and Pawlikowska, T. (2021). A realist eval-

uation exploring simulated patient role-play in phar-

macist undergraduate communication training. BMC

Medical Education, 21(1):325.

Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., and

Zhou, X. (2023). Better zero-shot reasoning with role-

play prompting. arXiv:2308.07702.

Lee, G., Hartmann, V., Park, J., Papailiopoulos, D., and

Lee, K. (2023). Prompted LLMs as Chatbot Modules

for Long Open-domain Conversation. In Findings of

the Association for Computational Linguistics: ACL

2023, pages 4536–4554, Toronto, Canada. Associa-

tion for Computational Linguistics.

Lin, Y.-T. and Chen, Y.-N. (2023). Llm-eval: Uni-

ﬁed multi-dimensional automatic evaluation for open-

domain conversations with large language models.

arXiv:2305.13711.

Liu, J., Symons, C., and Vatsavai, R. R. (2022). Persona-

Based Conversational AI: State of the Art and Chal-

lenges. In 2022 IEEE International Conference on

Data Mining Workshops (ICDMW), pages 993–1001.

arXiv.

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C.

(2023). G-eval: Nlg evaluation using gpt-4 with better

human alignment. arXiv:2303.16634.

Mianehsaz, E., Saber, A., Tabatabaee, S. M., and Faghihi,

A. (2023). Teaching Medical Professionalism with

a Scenario-based Approach Using Role-Playing and

Reﬂection: A Step towards Promoting Integration of

Theory and Practice. Journal of Advances in Medical

Education and Professionalism, 11(1).

Park, S., Kim, D., and Oh, A. (2019). Conversation model

ﬁne-tuning for classifying client utterances in counsel-

ing dialogues. In Proceedings of the 2019 Conference

of the North, pages 1448–1459. Association for Com-

putational Linguistics.

Sai Sailesh Kumar Goothy, Sirisha D, and Movva Swathi

(2019). Effectiveness of Academic Role-play in Un-

derstanding the Clinical Concepts in Medical Educa-

tion. International Journal of Research in Pharma-

ceutical Sciences, 10(2):1205–1208.

Shanahan, M., McDonell, K., and Reynolds, L.

(2023). Role-Play with Large Language Mod-

els. arXiv:2305.16367.

Tanana, M., Soma, C., Srikumar, V., Atkins, D., and Imel,

Z. (2019). Development and evaluation of clientbot:

Patient-like conversational agent to train basic coun-

seling skills. In Journal of Medical Internet Research,

volume 21(7), Online. Journal of medical Internet re-

search.

Wang, J., Liang, Y., Meng, F., Sun, Z., Shi, H., Li, Z., Xu,

J., Qu, J., and Zhou, J. (2023). Is chatgpt a good nlg

evaluator? a preliminary study. arXiv:2303.04048.

Weizenbaum, J. (1966). Eliza—a computer program for

the study of natural language communication between

man and machine. In Communications of the ACM,

volume 9(1), pages 36–45.

Xu, B. and Zhuang, Z. (2022). Survey on psychotherapy

chatbots. Concurrency and Computation: Practice

and Experience, 34(7):e6170.

An AI-Based Virtual Client for Educational Role-Playing in the Training of Online Counselors

117