PARL: A Dialog System Framework with Prompts as Actions for

Reinforcement Learning

Tao Xiang

, Yangzhe Li, Monika Wintergerst

, Ana Pecini, Dominika Młynarczyk

and Georg Groh

Department of Informatics, Technical University of Munich, Munich, Germany

Keywords:

Open-Domain Dialog Systems, Prompting, Reinforcement Learning, Conversational AI.

Abstract:

The performance of most current open-domain dialog systems is limited by the (training) dialog corpora due

to either generation-based or retrieval-based learning patterns. To circumvent this limitation, we propose

PARL, an open-domain dialog system framework using Prompts as Actions for Reinforcement Learning.

This framework requires a (ﬁxed) open-domain dialog system as the backbone and trains a behavior policy

using reinforcement learning to guide the backbone system to respond appropriately with respect to a given

conversation. The action space is deﬁned as a ﬁnite set of behaviors in the form of natural language prompts.

Preliminary results show that with the guidance of the behavior policy, the backbone system could generate

more engaging and empathetic responses.

1 INTRODUCTION

Open-domain dialog systems are a popular natural

language processing (NLP) task because of their po-

tential in real-life applications, such as Google Meena

(Adiwardana et al., 2020) or Facebook Blenderbot

(Roller et al., 2021). Current methods for open-

domain dialog systems can be generally catego-

rized into retrieval-based and generation-based meth-

ods, where both require high-quality dialog corpora:

Retrieval-based systems need a pre-collected paired

conversation dataset for retrieving responses, while

generation-based systems need a large amount of

training data for supervised learning (Ni et al., 2022).

Therefore, the performance of dialog systems de-

pends heavily on the quality of the dialog corpus and

in theory, it is difﬁcult for dialog systems trained with

these methods to exceed the quality of the training set.

Inspired by reinforcement learning (RL) appli-

cations surpassing human performance, such as Al-

phaGo Zero (Silver et al., 2017), our research objec-

tive in this work is to explore whether RL can further

improve the performance of dialog systems in order to

outperform training set level or even reach a human-

like quality. To this end, we train a behavior policy

https://orcid.org/0000-0001-6217-6560

https://orcid.org/0000-0002-9244-5431

https://orcid.org/0000-0002-5942-2297

that decides which system action to perform accord-

ing to the current dialog history with RL. The sys-

tem actions are deﬁned as general human behaviors

during one-on-one conversations in the form of nat-

ural language prompts, such as “greeting the other”

or “comforting the other”. After the system action is

conﬁrmed, it is fed together with the dialog history

to a ﬁxed dialog system, which then generates a re-

sponse.

In this position paper, we introduce PARL,

an open-domain dialog system framework using

Prompts as Actions for Reinforcement Learning

. It

is considered a framework because the deﬁnition of

the action space, the backbone dialog system, and the

training of the policy network can all be modiﬁed in

future work. Its general pipeline is given as follows:

1. Deﬁne the actions as natural language prompts.

An example is “comfort me”.

2. Train or use a pre-trained open-domain dialog

system as the ﬁxed backbone.

3. Train a policy network that maps dialog history to

actions with reinforcement learning.

4. Feed the dialog history and action prompt to the

backbone to generate responses.

Code and models: https://github.com/TUM-NLPLab-

2022/PARL-A-Dialog-System-Framework-with-Prompts-

as-Actions-for-Reinforcement-Learning

Xiang, T., Li, Y., Wintergerst, M., Pecini, A., Młynarczyk, D. and Groh, G.

PARL: A Dialog System Framework with Prompts as Actions for Reinforcement Learning.

DOI: 10.5220/0011725200003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 633-640

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

633

The remainder of this paper is organized as fol-

lows: In section 2 we review related work in recent

years, in section 3 we dive into details of our frame-

work, in section 4 we introduce the evaluation meth-

ods used in this work, in section 5 we demonstrate

and analyze the evaluation results, in section 6 we

highlight advantages and limitations of the proposed

framework, and in section 7 we conclude our work

and discuss directions for future work.

2 RELATED WORK

2.1 Open-Domain Dialog Systems

In the past few years, the area of open-domain dia-

log systems has achieved signiﬁcant progress with the

development of deep learning. Typically, deep learn-

ing methods for open-domain dialog systems can be

categorized into retrieval-based and generation-based

approaches.

A retrieval-based open-domain dialog system

matches user utterances with present queries in a pre-

collected human conversation dataset and retrieves

responses of similar queries as candidate responses.

Then a scoring algorithm scores these candidates and

the response with the highest score is selected. Re-

cent work on retrieval-based systems includes (Zhou

et al., 2016), (Zhou et al., 2018), and (Gu et al.,

2020). One drawback of retrieval-based systems is

their dependence on the pre-collected dataset, which

is difﬁcult to construct. Additionally, the pre-existing

responses can only cover a limited scope of con-

versations. This poses limits to real-world open-

domain conversations, which include an arbitrarily

wide range of topics.

Generation-based dialog systems, on the other

hand, possess the potential of generating unseen re-

sponses. Recent work on generation-based systems

focuses on ﬁne-tuning pre-trained language models

on dialog datasets (Saleh et al., 2020; Wolf et al.,

2019; Zhang et al., 2020; Adiwardana et al., 2020).

While generation-based dialog systems alleviate the

limited scope problem of retrieval-based systems,

their performance still heavily depends on the qual-

ity of training corpora.

2.2 RL for Dialog Systems

Compared to supervised learning, reinforcement

learning in the area of open-domain dialog systems

is still in the exploratory stage. One popular RL

research direction is to optimize a dialog system

pre-trained with supervised learning. For example,

(Jaques et al., 2019) optimize for sentiment and sev-

eral other conversation metrics by learning from a

static batch of human-bot conversations using Batch

RL. (Saleh et al., 2020) propose using RL to reduce

toxicity in an open-domain dialog setting in order to

ensure the model produces more appropriate and safe

conversations. In these settings, the action space is

usually inﬁnite with actions being system responses

of various lengths. In contrast, (Xu et al., 2018) ex-

plicitly deﬁne an action space consisting of dialog

acts that represent human behaviors during conversa-

tion and train a policy model that decides appropriate

dialog acts with respect to dialog history.

2.3 Prompting Language Models

Prompting, which can for instance be used to steer

multi-task generalist agents (Reed et al., 2022), has

recently also been explored as a way to enhance the

performance of language models. (Radford et al.,

2019) employ prompts to guide zero-shot generation

for tasks such as translation. (Raffel et al., 2020) use

task-speciﬁc preﬁxes as prompts in their text-to-text

framework for various NLP tasks. (Lee et al., 2021)

use natural language descriptions for requested do-

mains and slots as prompts to guide the generation of

a slot value for the requested domain and slot in the

dialog state tracking task.

Inspired by this body of work, we propose to use

natural language prompts as actions in RL to guide

the backbone dialog system to behave accordingly.

We explicitly deﬁne an action space similarly to (Xu

et al., 2018), but with actions as natural language

prompts. To the best of our knowledge, we are the

ﬁrst to propose using natural language prompts as ac-

tions in RL for optimizing dialog systems.

3 FRAMEWORK DESIGN

3.1 Problem Statement & Notation

In this section, we introduce the primary notations

used in this paper and formulate the task brieﬂy. The

main task of PARL is to train a behavior policy that

takes the dialog history as input and outputs a be-

havior action with RL. Then we combine the dialog

history and behavior action as input to our ﬁxed pre-

trained dialog system, which we call backbone. The

backbone then generates a system response. PARL’s

framework structure can be found in Figure 1.

We consider a dialog as a sequence of utterances

alternating between two parties, U

, S

, . . . , U

, S

with U as the user utterance and S as the system

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

634

Figure 1: Framework structure of PARL.

Figure 2: Input representation for the backbone dialog system: Special tokens [usr], [sys], [qst], and [bhv] represent a

following user utterance, system utterance, question-or-not action, and behavior action, respectively.

response. In a turn t, the user produces the new

utterance U

and the system replies with utterance

. Then we denote the dialog history at turn t as

{

, S

, . . . , S

t−1

, U

}

, which excludes the latest

system response S

. Furthermore, we denote the ac-

tion space in RL as A , and the policy network is de-

ﬁned as π = P(A

| H

) with A

∈ A .

3.2 Action Space Deﬁnition

As discussed above, we deﬁne the action space A as

a ﬁnite set of general human behaviors during one-

on-one conversations. For our experiments, we de-

ﬁne two types of actions: question-or-not (QST) ac-

tions and behavior (BHV) actions. QST actions in-

dicate whether the next system response should be a

question, whereas BHV actions represent behaviors

the backbone system should perform next. For QST

actions, we deﬁne two values: “Ask me for further de-

tails.” and “[None]”. As for the BHV actions, we de-

ﬁne four human behaviors: 1) to congratulate the user,

2) to comfort the user, 3) to give the user advice, and

4) no behavior. We believe these behaviors represent

actions desirable for an empathetic conversation part-

ner, supportively reacting to both positive and neg-

ative emotions, and providing advice when needed.

The action values are deﬁned as “I’m in a positive

mood, please congratulate me and praise me.”, “I’m

in a negative mood, please comfort me.”, “give me

some advice.” and “[None]”, respectively. The action

space A is then a two-dimensional space combining

QST actions and BHV actions.

Note that the values are in the form of natural lan-

guage prompts, and in particular, they are phrased as

requests from the user’s perspective. We deﬁne them

as such because we concatenate these action values

with the last user utterance U

from the dialog his-

tory H

, and feed it as input to the backbone sys-

tem. An example input representation can be found in

Figure 2. Such processing simulates the user saying

these prompts, and should guide the backbone system

to better understand the user sentiment and generate

more appropriate and empathetic responses.

3.3 Backbone Dialog System

For the backbone dialog system, we use Blenderbot-

400M-distill (Roller et al., 2021), a generation-based

model that has shown generally good conversational

skills. To let the backbone system better understand

the prompts, we ﬁrst augmented the EmpatheticDi-

alogues dataset (Rashkin et al., 2019) by appending

suitable prompts to each last user utterance U

in dia-

log history H

, and then we ﬁne-tuned Blenderbot on

this dataset for only 10 epochs to avoid overﬁtting. To

assign a proper question-or-not prompt to each dialog

in the augmentation step, we simply checked whether

the system response has a question mark. For be-

havior prompts, we trained a sentiment classiﬁer to

tell whether the user is in a positive/negative/neutral

mood and then added prompts according to the clas-

siﬁcation. To train this classiﬁer, the ﬁrst author

manually labeled 500 dialog samples from the Empa-

theticDialogues dataset with the label set deﬁned as

{positive, negative, neutral}. Then we used this clas-

siﬁer to tag the whole EmpatheticDialogues dataset

and ﬁnally, we manually reviewed the entire dataset

and revised obvious misclassiﬁcations. We have made

the augmented EmpatheticDialogues dataset

public.

The purpose of the ﬁne-tuning is to make sure the

backbone system can understand action prompts and

respond accordingly. This consistency between action

prompts and system responses is necessary for later

reinforcement learning.

https://huggingface.co/datasets/Adapting/

empathetic dialogues with special tokens

PARL: A Dialog System Framework with Prompts as Actions for Reinforcement Learning

635

3.4 Policy Network

3.4.1 Model Architecture

To select appropriate actions according to different

conversation situations, we design a policy network

that takes the dialog history as input and outputs the

action prompts deﬁned in subsection 3.2. The policy

network consists of only fully-connected layers. The

input is the embedding of the dialog history and the

output is two-dimensional logits, where the ﬁrst di-

mension represents behavior actions and the second

represents question-or-not actions.

To obtain the embedding of the dialog history,

we use the pre-trained conversational representation

model ConveRT (Henderson et al., 2020) and keep

it ﬁxed. ConveRT is a specialized encoder that can

compress the dialog history into a 512-dimensional

embedding. We further apply the arctan function to

this embedding element-wise so that each dimension

is restricted to (−1, 1). We believe such processing

can improve exploration efﬁciency for later training

while ensuring distinguishability of the embeddings

around the origin.

To map the logits to corresponding action values,

we further employ activation functions to restrict the

logits into a ﬁxed interval. Then we slice this interval

into subintervals and each subinterval corresponds to

a certain action value. For example, for the second

dimension logits we apply Tanh so that the interval is

(−1, 1). Then we divide the interval into two subinter-

vals (−1, 0] and (0, 1), where (−1, 0] corresponds to

the QST action value “Ask me for further details.” and

(0, 1) corresponds to the QST action value “[None]”.

3.4.2 Training

In order to train the policy network with reinforce-

ment learning, we choose the Soft Actor-Critic (SAC)

algorithm, which is an off-policy actor-critic deep RL

algorithm based on maximum entropy reinforcement

learning (Haarnoja et al., 2018). The reason we use

SAC is that it can explore very diverse policies and

preserve near-optimal policies while pursuing conver-

gence as much as possible. This ﬁts our behavior

policy well, since human behaviors can be very com-

plex and different people might react differently to the

same conversation situation.

The process of training the policy network can be

divided into the following steps:

1) Action Decision & System Response Gener-

ation. First, the embedding of dialog history H

{

, S

, . . . , S

t−1

, U

}

is fed to the policy network,

which then selects an appropriate action. The action

(prompt) is then concatenated with the original dialog

history and fed as input to the backbone dialog sys-

tem, which then generates a new system response S

An input example for the backbone dialog system can

be found in Figure 2.

2) Reward Calculation. Once we have the gen-

erated system response S

, we compute a reward for

it. We use the metric model DYnamic MEtric for dia-

log modeling (DYME) (Unold et al., 2021), which we

trained on the EmphateticDialogues (Rashkin et al.,

2019) and DailyDialog datasets (Li et al., 2017).

DYME predicts utterance metrics for the next sen-

tence based on a given dialog history. In total, it

considers 15 metrics, such as repetition metrics, sen-

timent, coherence metrics, empathy-based metrics,

and utterance length. We ﬁrst use DYME to predict

ground truth utterance metrics of an ideal next system

response given the current dialog history H

, denoted

as m

∈ R

. Then we use the same metric algorithms

as in DYME to compute utterance metrics of the gen-

erated response S

, denoted as ˆm

∈ R

. The reward

function is deﬁned as a distance function that mea-

sures the similarity between m

and ˆm

, denoted as

l : R

× R

→ R. In this work, we use the negative

mean square error as the reward function.

3) New User Input Generation. To continue the

conversation, a new user input U

t+1

is required as

reply to the system response S

. In our framework,

real human interaction or a user chatbot can both be

used to produce user inputs. For our experiments, we

employ a user chatbot, namely Blenderbot-1B-distill

(Roller et al., 2021), which is a variant of the back-

bone system but with more parameters.

4) Dialog History Update & Repetition. Once

we have the new system response S

and user in-

put U

t+1

, we update the dialog history as: H

t+1

⊕{S

, U

t+1

}, where ⊕ stands for the concatenation.

Then we repeat the entire process.

4 EVALUATION METHODS

To explore the effect of the policy network’s guid-

ance, we compare PARL and the baseline (PARL’s

backbone, the ﬁne-tuned Blenderbot without policy

network as introduced in subsection 3.3) in both an

automatic and a human evaluation. The experimental

dataset is the test set of the original EmpatheticDia-

logues dataset, which does not include prompts.

4.1 Automatic Evaluation

For automatic evaluation, we use METEOR (Baner-

jee and Lavie, 2005) and FED (Mehri and Eskenazi,

2020). METEOR is a word-overlap metric that cal-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

636

(a) Episodic return w/o AT. (b) Actor loss w/o AT. (c) Q function loss w/o AT. (d) Q values w/o AT.

(e) Episodic return w/ AT. (f) Actor loss w/ AT. (g) Q function loss w/ AT. (h) Q values w/ AT.

Figure 3: Training results of two experiments with (“w/”) and without (“w/o”) auto-tuning (AT). The blue part in Figure 3a

indicates the random sampling in the ﬁrst 1k steps. Q values are represented by values from Qf1 due to double Q learning.

culates the similarity between the generated sequence

and the ground truth sequence on word-level, whereas

FED is a neural metric that can measure ﬁne-grained

dialog qualities at both the turn- and whole dialog-

level. More speciﬁcally, FED can assess eighteen

qualities of dialog without relying on a reference re-

sponse and has shown moderate to strong correlation

with human judgment. Example qualities are “di-

verse”, “coherent” and “ﬂuent”. In this experiment,

we consider eleven qualities in view of their relative

importance stated in (Mehri and Eskenazi, 2020).

4.2 Human Evaluation

Seven raters scored 20 randomly sampled conversa-

tions independently from 1 (very low) to 5 (very high)

in terms of empathy, relevance, and ﬂuency. All

raters are university-educated and share a computer-

science-related background. For the purpose of blind

evaluation, we marked the responses of the two mod-

els as response A and response B and randomly shuf-

ﬂed some responses to reduce bias. To represent sub-

jectivity, we computed the inter-rater agreement by

Krippendorff’s Alpha (Krippendorff, 2011) with the

level of measurement as “interval”.

5 RESULTS

5.1 Policy Network Training Results

In our experimental training process, we observe that

the actor in the policy network begins exploration at

an early stage. Exploration and exploitation form a

trade-off in reinforcement learning, where the former

avoids getting stuck in a local optimum by trying var-

ious actions, and the latter aims to explore the op-

timal strategy by spending limited resources on ob-

serving the results of a few better actions. Figure 3a

and Figure 3e show an inconspicuous improvement

around 2k steps (1k steps for the random data collec-

tion phase, which should be 5k steps generally). The

actor loss (Figure 3b) experiences a rapid decline and

then rebounds, so we suspect that the training data

collected in the early stage has already been exploited

and the actor has turned to slow exploration after 2k

steps. The loss of the Q function (Figure 3c) de-

clines steadily while the value (Figure 3d) decreases

and stabilizes after reaching the peak at around 2k

steps, which means the Q function is overestimated at

the beginning, also indicating the completion of ex-

ploitation. In the case of auto-tuning, the actor loss

(Figure 3f) increases instead along with the Q values

(Figure 3h), reﬂecting that the exploration is carried

out without too much exploitation. Considering this,

we think the training is still at a very early stage and

the agent is doing more exploration at this point.

Note that SAC and other RL algorithms often re-

quire millions of training steps to achieve signiﬁcant

results (Haarnoja et al., 2018), while resource con-

straints only allowed us to make a preliminary veriﬁ-

cation of the algorithm.

5.2 Results of Automatic Evaluation

From the METEOR scores in Table 1, we can see that

although PARL has a higher score, both scores are

small and the difference is negligible.

For FED, PARL scores higher than the baseline

in engagement, semantic appropriateness, speciﬁcity,

understandability, ﬂuency, and likeability, with the

score difference being around 0.01. On the other

hand, the performance of PARL on relevance, correct-

ness, coherence, consistency, and diversity metrics is

lower, with the difference being around 0.1.

PARL: A Dialog System Framework with Prompts as Actions for Reinforcement Learning

637

Table 1: Results of automatic evaluation. Better scores are

in bold (FED scores were negated for for readability).

Blenderbot PARL

METEOR 0.1605 0.1627

Engaging 0.3698 0.3701

Semantically

Appropriate

-0.2425 -0.2354

Speciﬁc 0.2020 0.2118

Relevant -6.3218 -6.4173

Correct -6.3973 -6.4928

Understandable -0.2254 -0.2161

Fluent -0.2454 -0.2350

Coherent -6.2809 -6.3813

Consistent -6.5760 -6.6770

Diverse -6.4907 -6.5906

Likeable 0.2269 0.2283

5.3 Results of Human Evaluation

Table 2: Results of human evaluation.

Empathy Relevance Fluency

Blenderbot 3.114 3.114 3.607

PARL 3.250 3.129 3.236

Table 3: Krippendorff’s alpha.

General Empathy Relevance Fluency

0.256 0.290 0.278 0.149

From the results in Table 2, we can see PARL was

rated slightly better than the baseline regarding em-

pathy and relevance. The difference in empathy is

0.136, while the difference in relevance is quite small

at only 0.015. There is a relatively big difference of

0.4 when it comes to ﬂuency, where the baseline was

rated higher. Overall, both models achieved ratings

slightly above the neutral rating of 3 for all metrics.

The agreement results in Table 3 show very low

agreement on each metric. However, the agreement

on empathy and relevance is much higher compared

to the agreement on ﬂuency. This might indicate that

ﬂuency is harder to judge and more subjective.

5.4 Conversation Examples

To show advantages and limitations of PARL, we

cherry-picked two conversation examples generated

during training. The actions output by the policy net-

work are attached to each user utterance in italics.

Conversation A

User: Hi, I’d like to return this radio. (ask de-

tails; congratulate me)

PARL: That’s good to hear. Why do you want

to return it? Is it something wrong with

it?

User: I don’t think so, but I just don’t want

to deal with the hassle of returning it.

(none; comfort me)

PARL: I understand. Perhaps you can give it a

try and if you like it, you can return it.

Conversation B

User: Mom, you know that Andrea and I

sometimes worry about you. (ask de-

tails; comfort me)

PARL: I worry about my mom all the time.

Why do you worry about me? What

happened to you?

User: I don’t know, I guess I’m just worried

that she’s going to get sick and I won’t

be able to take care of her. (none; give

advice)

PARL: You can take good care of your mom.

Don’t worry about the things that worry

you. Everything will be okay.

6 DISCUSSION

Based on the results in section 5, we now discuss

PARL’s advantages and current limitations.

6.1 Advantages

PARL has the following advantages:

Controllable and Explainable. Due to explicit

behavior decisions, PARL shows controllability and

explainability. The behavior actions chosen by the

policy network can guide the backbone dialog sys-

tem to generate a corresponding response as shown in

Conversation A and B, where the backbone follows

the guidance of the policy network.

Engaging and Empathetic. Even though the dif-

ferences are small, the results of both automatic and

human evaluation show that PARL is rated as more

engaging and empathetic compared to the baseline.

We believe this is thanks to the explicit engaging be-

haviors such as “comfort” and “congratulate”.

Dynamic emotion Capturing. Because the pol-

icy network decides behaviors in every dialog turn, it

can capture the user’s changing emotions.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

638

Adaptability. The backbone dialog system can be

exchanged (e.g., with a more powerful model) and the

behavior deﬁnition can also be extended.

6.2 Limitations

However, there are still several limitations for PARL.

Non-Comprehensive Behaviors. In this work,

we only deﬁne four behavior actions, which is far too

little compared to real human behaviors. We think

this is the reason why PARL shows less diversity in

the automatic evaluation.

Lengthy Action Prompts. Some action prompts

are lengthy, such as “comfort” and “congratulate”.

Since the prompts are directly concatenated to the di-

alog history, this could potentially change the user ut-

terance’s original meaning, especially when the dia-

log history is short or the policy network makes mis-

takes. We believe this is why PARL shows less rel-

evance, coherence and consistency in the automatic

evaluation. An example would be Conversation A,

where the user’s emotion may not actually be positive

in the beginning.

Coarse-Grained Action Space. The current 2D

action space is coarse-grained because we put sen-

timent behaviors and advising behavior in the same

dimension. Thus, PARL can not perform sentiment

behaviors and advising behaviors simultaneously.

Besides the framework limitations, the decision

making of the policy network is not perfect due to

limited training time, which might explain the small

difference compared to the baseline. As such, PARL

might yet have to fully realize its potential and the

advantages mentioned above.

Additionally, PARL’s performance depends on the

backbone system’s quality. The used model produced

generally good outputs, but struggled with logical

consistency and uncommon user inputs (see Conver-

sation A and B).

Regarding the human evaluation, it must be noted

that due to the small sample size, it is not fully conclu-

sive. Since the manual rating of conversations seems

to be challenging and subjective, as evidenced by the

low inter-rater agreement, a larger-scale evaluation

with more detailed annotation rules should be carried

out once the model has been fully trained.

In addition, we found DYME has some limita-

tions. For instance, for discrete metrics like “ques-

tion”, DYME predicts ﬂoating point numbers, which

leads to permanent losses between the predicted ﬂoat-

ing point numbers and the ﬂoor and ceiling integer

numbers corresponding to the calculated metrics from

the generated utterance. Also, metrics like “utter-

ance length” may lead to lower rewards despite a high

utterance quality from a human perspective, which

has a continuous impact on a conversation due to the

non-sparse nature of DYME-based rewards. These

two factors make it difﬁcult for the policy network

to achieve the best results.

7 CONCLUSION & FUTURE

WORK

We propose PARL, an open-domain dialog frame-

work that uses natural language prompts as behav-

ior actions to guide a pre-trained dialog system. We

design a reward function using the pre-trained metric

model DYME, with which we train a policy network

to select proper actions according to the dialog con-

text. Despite limited training resources, preliminary

results indicate a potential of the policy network’s

guidance to improve dialog systems with RL.

Since our work is only a preliminary attempt to

combine RL with the prompting technique, there are

still many possible improvements: 1) Improving the

reward function: DYME’s limitations as discussed in

subsection 6.2 could be remedied. 2) More diverse

behaviors: For instance, behaviors related to different

emotions (instead of just positive or negative) could

create more diverse dialogs. 3) Improved prompts: As

mentioned in subsection 6.2, shorter prompts could

better preserve the user utterance’s original meaning.

4) Better action space design: As mentioned in sub-

section 6.2, a more ﬁne-grained action space (e.g.

higher dimensional) would enable the agent to per-

form diverse behaviors simultaneously. 5) Dynamic

aborting: Dynamic “done” returns based on metrics

could be applied to stop conversations at appropri-

ate times. 6) Multi-task learning: Instead of a ﬁxed

backbone, policy and dialog system could be trained

jointly. 7) Extending the model: We could, for exam-

ple, add a memory component (Weston et al., 2015)

to increase conversational ability in longer dialogs.

REFERENCES

Adiwardana, D., Luong, M., So, D. R., Hall, J., Fiedel, N.,

Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade,

G., Lu, Y., and Le, Q. V. (2020). Towards a human-

like open-domain chatbot. CoRR, abs/2001.09977.

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic

metric for MT evaluation with improved correlation

with human judgments. In Proceedings of the ACL

Workshop on Intrinsic and Extrinsic Evaluation Mea-

sures for Machine Translation and/or Summarization,

pages 65–72, Ann Arbor, Michigan. Association for

Computational Linguistics.

PARL: A Dialog System Framework with Prompts as Actions for Reinforcement Learning

639

Gu, J.-C., Li, T., Liu, Q., Ling, Z.-H., Su, Z., Wei, S.,

and Zhu, X. (2020). Speaker-aware BERT for multi-

turn response selection in retrieval-based chatbots.

In Proceedings of the 29th ACM International Con-

ference on Information & Knowledge Management,

pages 2041–2044.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor. In

International conference on machine learning, pages

1861–1870. PMLR.

Henderson, M., Casanueva, I., Mrk

c, N., Su, P.-H., Wen,

T.-H., and Vuli

c, I. (2020). ConveRT: Efﬁcient and ac-

curate conversational representations from transform-

ers. In Findings of the Association for Computational

Linguistics: EMNLP 2020, pages 2161–2174, Online.

Association for Computational Linguistics.

Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson,

C., Lapedriza,

A., Jones, N., Gu, S., and Picard,

R. W. (2019). Way off-policy batch deep reinforce-

ment learning of implicit human preferences in dialog.

CoRR, abs/1907.00456.

Krippendorff, K. (2011). Computing Krippendorff’s alpha-

reliability.

Lee, C.-H., Cheng, H., and Ostendorf, M. (2021). Dialogue

state tracking with a language model using schema-

driven prompting. In Proceedings of the 2021 Confer-

ence on Empirical Methods in Natural Language Pro-

cessing, pages 4937–4949, Online and Punta Cana,

Dominican Republic. Association for Computational

Linguistics.

Li, Y., Su, H., Shen, X., Li, W., Cao, Z., and Niu, S.

(2017). DailyDialog: A manually labelled multi-turn

dialogue dataset. In Proceedings of the Eighth Inter-

national Joint Conference on Natural Language Pro-

cessing (Volume 1: Long Papers), pages 986–995,

Taipei, Taiwan. Asian Federation of Natural Language

Processing.

Mehri, S. and Eskenazi, M. (2020). Unsupervised evalua-

tion of interactive dialog with DialoGPT. In Proceed-

ings of the 21th Annual Meeting of the Special Interest

Group on Discourse and Dialogue, pages 225–235.

Ni, J., Young, T., Pandelea, V., Xue, F., and Cambria, E.

(2022). Recent advances in deep learning based dia-

logue systems: A systematic survey. Artiﬁcial Intelli-

gence Review, pages 1–101.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and

Sutskever, I. (2019). Language models are unsuper-

vised multitask learners. OpenAI blog, 1(8), 9.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., Liu, P. J., et al. (2020).

Exploring the limits of transfer learning with a uniﬁed

text-to-text transformer. Journal of Machine Learning

Research, 21(140):1–67.

Rashkin, H., Smith, E. M., Li, M., and Boureau, Y.-L.

(2019). Towards empathetic open-domain conversa-

tion models: A new benchmark and dataset. In Pro-

ceedings of the 57th Annual Meeting of the Associa-

tion for Computational Linguistics, pages 5370–5381,

Florence, Italy. Association for Computational Lin-

guistics.

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G.,

Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky,

Y., Kay, J., Springenberg, J. T., Eccles, T., Bruce, J.,

Razavi, A., Edwards, A., Heess, N., Chen, Y., Had-

sell, R., Vinyals, O., Bordbar, M., and de Freitas, N.

(2022). A generalist agent. ArXiv, abs/2205.06175.

Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M.,

Liu, Y., Xu, J., Ott, M., Smith, E. M., Boureau, Y.-L.,

and Weston, J. (2021). Recipes for building an open-

domain chatbot. In Proceedings of the 16th Confer-

ence of the European Chapter of the Association for

Computational Linguistics: Main Volume, pages 300–

325, Online. Association for Computational Linguis-

tics.

Saleh, A., Jaques, N., Ghandeharioun, A., Shen, J., and

Picard, R. (2020). Hierarchical reinforcement learn-

ing for open-domain dialog. Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, 34(05):8741–

8748.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,

Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,

Bolton, A., et al. (2017). Mastering the game of go

without human knowledge. nature, 550(7676):354–

359.

Unold, F. v., Wintergerst, M., Belzner, L., and Groh, G.

(2021). DYME: A dynamic metric for dialog mod-

eling learned from human conversations. In Interna-

tional Conference on Neural Information Processing,

pages 257–264. Springer.

Weston, J., Chopra, S., and Bordes, A. (2015). Memory

networks. Paper presented at 3rd International Con-

ference on Learning Representations, ICLR 2015, San

Diego, United States.

Wolf, T., Sanh, V., Chaumond, J., and Delangue, C. (2019).

TransferTransfo: A transfer learning approach for

neural network based conversational agents. CoRR,

abs/1901.08149.

Xu, C., Wu, W., and Wu, Y. (2018). Towards explain-

able and controllable open domain dialogue genera-

tion with dialogue acts. ArXiv, abs/1807.07255.

Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett,

C., Gao, X., Gao, J., Liu, J., and Dolan, B. (2020).

DIALOGPT : Large-scale generative pre-training for

conversational response generation. In Proceedings of

the 58th Annual Meeting of the Association for Com-

putational Linguistics: System Demonstrations, pages

270–278, Online. Association for Computational Lin-

guistics.

Zhou, X., Dong, D., Wu, H., Zhao, S., Yu, D., Tian, H., Liu,

X., and Yan, R. (2016). Multi-view response selection

for human-computer conversation. In Proceedings of

the 2016 Conference on Empirical Methods in Natural

Language Processing, pages 372–381.

Zhou, X., Li, L., Dong, D., Liu, Y., Chen, Y., Zhao, W. X.,

Yu, D., and Wu, H. (2018). Multi-turn response se-

lection for chatbots with deep attention matching net-

work. In Proceedings of the 56th Annual Meeting of

the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 1118–1127.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

640