Leveraging Large Language Models for Preference-Based Sequence

Prediction

Michaela Tecson

1 a

, Daphne Chen

2 b

, Michelle Zhao

1 c

, Zackory Erickson

1 d

and Reid Simmons

1 e

Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, U.S.A.

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, U.S.A.

Keywords:

Meal Preparation, Sequence Prediction, User Preferences, Assistive Technology, Cooking Assistance, Large

Language Models, Action Anticipation.

Abstract:

We present a novel approach to leveraging Large Language Models (LLMs) for action prediction in meal

preparation sequences, with a focus on tailoring predictions based on user preferences. We introduce methods

using OpenAI’s GPT-4o model to predict subsequent actions in a sequence by providing different forms of

context such as sequences from other participants or prior sequences of the test participant. Our approach out-

performs baseline methods, including Aggregate Long Short-Term Memory (LSTM) and mixture-of-experts

(MoE) models, by up to 33.8% by leveraging the LLM’s ability to adapt predictions based on minimal prior

context. We highlight the generalizability of the method across different cooking domains by analyzing the

results on two different cooking datasets. This adaptability will be useful for assistive systems aiming to sup-

port older adults, especially those with Mild Cognitive Impairments (MCI), in completing complex, sequential

tasks in ways that align with the user’s preferences. The full prompts used in this work can be found at the

project webpage: sites.google.com/view/preference-based-prediction.

1 INTRODUCTION

Older adults, particularly those with Mild Cognitive

Impairments (MCI), often encounter signiﬁcant chal-

lenges when performing complex, sequential tasks

such as meal preparation (Tuokko et al., 2005; Wher-

ton and Monk, 2010). Those with MCI may strug-

gle with maintaining their independence in their daily

routines, which can drastically undermine their sense

of well-being (Yu et al., 2023). Difﬁculties in cooking

caused by cognitive decline can also severely impact

their ability to meet nutritional needs, which is cru-

cial for maintaining overall health. The nutritional in-

take and dietary habits of individuals with MCI are

especially important to maintain (McGrattan et al.,

2018), yet meal preparation is an instrumental activ-

ity of daily living (IADL) that proves to be a chal-

lenge faced by many older adults with MCI everyday

(Johansson et al., 2015).

https://orcid.org/0009-0004-0611-0400

https://orcid.org/0009-0003-0382-7652

https://orcid.org/0000-0001-7522-2300

https://orcid.org/0000-0002-6760-6213

https://orcid.org/0000-0003-3153-0453

Assistance that takes into account the current

needs and long-standing habits of these individuals is

more likely to be effective and adopted by the target

population (Sanders and Martin-Hammond, 2019).

Additionally, prior work has shown that in-situ feed-

back systems signiﬁcantly improve performance in

individual tasks for those with cognitive impairments

(Funk et al., 2015). Therefore, tailoring assistance

that acts in line with their personal preferences will

improve user acceptance, enhancing their quality of

life while maintaining their independence in daily ac-

tivities (Randers and Mattiasson, 2004).

In this work, we present a novel approach to

preference-based sequence prediction using large lan-

guage models (LLMs). While it may appear that there

are limited preferences to habitual meal preparation

tasks such as preparing a salad or a sandwich, our

work uncovers that there are multiple different pref-

erences for even simple meals, further motivating the

need for preference-based assistance.

We present two unique approaches to preference-

based sequential reasoning that each leverage the

capabilities of LLMs: Independent Context and

Shared Context. Our methods are able to dynam-

Tecson, M., Chen, D., Zhao, M., Erickson, Z. and Simmons, R.

Leveraging Large Language Models for Preference-Based Sequence Prediction.

DOI: 10.5220/0013190200003890

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 2, pages 519-532

ISBN: 978-989-758-737-5; ISSN: 2184-433X

519

ically adapt their predictions based on observations

of both the user’s prior sequences and the current

context of actions that are taking place. Using prior

sequences to infer preferences, we can more accu-

rately predict what action a user will take next in a

cooking sequence. We show that both of these LLM

methods outperform baseline methods for sequence

prediction and prior work on preference-based pre-

diction. This approach could assist users, providing

guidance when they make a mistake or are unsure of

the next step, while still respecting their autonomy

and long-established cooking routines. The key con-

tributions of this work are:

• We introduce a novel approach of using LLMs for

preference-based sequence prediction.

• We evaluate our approach across two meal prepa-

ration datasets for multi-sequence preference pre-

diction and assess generalizability.

• We compare and contrast to prior state-of-the-art

methods for preference-based sequence predic-

tion, in which we observe a signiﬁcant improve-

ment over prior work.

2 RELATED WORK

2.1 Learning Preferences for Sequential

Tasks

Prior work has used unsupervised clustering to extract

human preferences from sequential tasks and then ap-

plies those clusters to training specialized models. For

example, when exploring the problem of robot adap-

tion for collaborating with humans in sequential tasks,

prior work learns from human teams and adapts its

policies based on the preferences it learns from the

latent representation (Nikolaidis et al., 2015; Zhao

et al., 2022).

Others have studied the effectiveness of using a

mixture-of-experts approach to preference-based se-

quence prediction (Chen, 2023), similar to the un-

supervised learning described above. This method

uses an autoencoder to obtain the latent representa-

tion of sequences and then clusters those represen-

tations to reveal different strategies. A predictor is

trained on the sequences belonging to each cluster,

which serves as the “expert” that can do prediction

most accurately on a sequence belonging to that strat-

egy. However, these methods do not generalize well

in a real-world implementation because the predictor

training set and test set included sequences from the

same participant. Additionally, the predictors on both

the aggregate LSTM and mixture-of-experts in Chen

et al. (Chen, 2023) cannot incorporate any new strat-

egy information into their models. We show that our

method using LLMs can improve on these by adding

additional forms of context, even from the test partic-

ipant, to better inform the prediction.

Other methods explore learning and inferring hu-

man preferences in task sequencing using a two-stage

clustering approach (Nemlekar et al., 2021). This

method effectively clusters user preferences at both

the action level and event level to enhance prediction

accuracy in assembly tasks. It is applicable to the sce-

nario where a new person executes the same task that

the current model has demonstrations for. In contrast,

our approach leverages the LLM’s ability to infer the

preferences from an unseen participant and quickly

adapt its prediction for future sequences, rather than

having to rely on existing demonstrations for that par-

ticular task.

2.2 Using LLMs for Robotics

Several advancements have taken place in the appli-

cations of LLMs for zero-shot and few-shot reason-

ing for robot task planning problems (Singh et al.,

2022; Arenas et al., 2023; Hu et al., 2024; Lin et al.,

2023; Liang et al., 2023; Padmanabha et al., 2024).

Other works have leveraged the generalization capa-

bilities of LLMs to summarize the task speciﬁcation

to be aligned with a person’s preferences (Wu et al.,

2023; Wang et al., 2023). TidyBot (Wu et al., 2023)

for example, is bringing tidying preferences to robot

grounded actions with just a few training examples.

The summarization is then used to create a set of rules

for that particular person, which dictates where the

items should go. The robot can then execute that set

of rules and carry out the tidying task according to the

preferences of the user.

Similarly, Demo2Code (Wang et al., 2023) uses

demonstrations to summarize the task speciﬁcation

according to the user’s preference and generates per-

sonalized code to carry it out. These demonstrations

come in the form of detailed descriptions of the robot

state and actions and recursively summarizes these

from low-level actions to high-level subtasks. From

there, the algorithm expands the high-level subtasks

to action-level executable task code. However, works

involving code generation from demonstrations all re-

quire signiﬁcant preprocessing of the task description

and the demonstrations into a compact form before

generating executable code. Additionally, these meth-

ods cannot dynamically update the executable code to

account for changes in user strategy over time, but

rather, they must regenerate the code with those new

demonstrations to incorporate those changes. This

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

520

can be problematic because a user may have multi-

ple or changing preferences for preparing a meal, and

the prior work would not capture that. They also gen-

erate code with the intention that the robot execute the

entire sequence on its own, which means the task ex-

ecution code is ﬁxed and lacks the ﬂexibility to adapt

during the task, as it is entirely pre-deﬁned rather than

continually adjusting to new actions. In contrast, our

method continuously considers both prior preferences

and immediate context, integrating the most recent ac-

tions to reason dynamically and select the next action

based on current conditions. We propose a method de-

signed to work with the user that extends beyond static

rule-based execution as in prior work and can adapt to

a new user preference immediately. The method does

not just take prior sequences into account, but also

prior actions that have just occurred at test time. This

is particularly crucial in the use of an assistive cook-

ing system where the sequence of actions can vary

signiﬁcantly based on individual preferences and sit-

uational context.

2.3 Assistive Cooking Systems

Many existing cooking assistants aim to provide as-

sistance to the users step-by-step (Chan et al., 2023;

Kosch et al., 2019; Sato et al., 2014; Hamada et al.,

2005; Bouchard et al., 2020). LLM-enabled conver-

sational assistants can provide contextual instructions

during cooking tasks (Chan et al., 2023; Le et al.,

2023). However, these methods generate their feed-

back or instructions based on a set recipe. They as-

sume there is a ﬁxed order of steps that must be com-

pleted, not taking into account the different prefer-

ences that users may have for preparing their meal.

This includes varying ingredients, different ordering

of key steps, or speciﬁc techniques that a user may

want to do. In the work presented here, we show how

our methods can take into account the ways the same

dish can be prepared so that the assistance can be per-

sonalized based on the user’s cooking strategy. This

is especially important for the use of this technology

in people’s own homes where they’d like to carry out

the task in a way that aligns with their longstanding

habits.

3 METHODS

This section presents three methods of generating pre-

dictions that this paper compares: Our method of

context-based LLMs, aggregate LSTM, and mixture-

of-experts.

Figure 1: Frame captures from both the overhead camera

(left) and the headworn GoPro camera (right) for the salad

dataset (top) and sandwich dataset (bottom).

3.1 Datasets Used

All the methods described were tested on two cooking

datasets (Chen, 2023). We refer to them as the salad

dataset and sandwich dataset throughout this paper.

Both datasets consist of two camera views for each

sequence: top-down views from a ceiling-mounted

camera and egocentric view from a headworn GoPro

camera. See Figure 1 for examples of these views.

There were the same 17 participants in both datasets.

Participants were all healthy adults, ranging from 19-

82 years old (M = 37 years old, SD = 22); 65% Male,

35% Female.

For both datasets, participants were instructed to

make the salad or sandwich as if making a meal for

themselves in their own home (no recipe to follow)

with the ingredients they were given. Each partici-

pant completed the task ﬁve times, with each attempt

referred to as a sequence. Throughout the paper, we

will refer to these attempts as “Seq. n,” where n corre-

sponds to the order in which the task was completed,

ranging from 1 to 5.

The datasets were both annotated at a task-level

granularity, where each step corresponds to a key

stage in the process without delving into the ﬁne-

grained sub-actions of object handling or preparation.

Some examples from the salad dataset are “cut

tomato”, “place tomato into bowl”, “add salt”,

and “mix ingredients”. Some examples from the

sandwich dataset are “place jam onto bread 1”,

“spread jam onto bread 1”, “put bread together”,

and “cut off crusts”.

The full list of actions for both datasets is included

in the Appendix.

Salad Dataset

• There are 18 unique actions in the dataset.

• Maximum sequence length was 27 actions.

Leveraging Large Language Models for Preference-Based Sequence Prediction

521

Users were provided with the following:

• Ingredients: lettuce, tomato, cucumber, cheese,

oil, vinegar, salt, pepper

• Utensils: plate, chopping board, bowl, butter

knife, small chef’s knife, large chef’s knife,

spoons, forks, whisk

Example sequence from the salad dataset:

[start, cut lettuce, place lettuce into

bowl, cut tomato, place tomato into bowl,

cut cucumber, place cucumber into bowl,

add salt, add pepper, serve salad onto

plate, end]

Sandwich Dataset

• There were 18 unique actions in the dataset.

• Maximum sequence length was 24 actions.

Users were provided with the following:

• Ingredients: bread, peanut butter, jam

• Utensils: plate, chopping board, bowl, butter

knife, small chef’s knife, large chef’s knife,

spoons, forks, whisk

Example sequence from the sandwich dataset (PB

refers to peanut butter):

[start, place PB onto bread 1, spread

PB onto bread 1, place PB onto bread 1,

spread PB onto bread 1, place PB onto

bread 2, spread PB onto bread 2, place

PB onto bread 2, spread PB onto bread

2, place PB onto bread 2, spread PB

onto bread 2, place jam onto bread 1,

spread jam onto PB on bread 1, put bread

together, serve sandwich onto plate, end]

3.2 LLMs for Prediction

In typical machine learning approaches, it is compu-

tationally expensive to retrain the prediction models

on every new demonstration as it becomes available.

To generalize preference-based sequence prediction,

we leverage the LLM’s few-shot and zero-shot rea-

soning capabilities. Speciﬁcally, we explore differ-

ent forms of context by providing the LLM with task-

speciﬁc training examples, including sequences from

other participants as well as previously seen examples

from the test set. Prior analysis of the meal prepa-

ration datasets (Chen, 2023) showed the majority of

participants behave consistently across their multiple

attempts. Thus, by gradually introducing the test par-

ticipant’s own prior sequences, the LLM will be able

to use these prior sequence as context for prediction

of unseen data.

The raw dataset annotations use underscores

instead of spaces for use in the other meth-

ods (“serve salad onto plate” for example is

“serve salad onto plate”). Initial experiments used

the raw annotations with underscores as input to the

LLM. However, the ﬁnal method included prepro-

cessing steps for reducing the number of input tokens:

the underscores were removed, and all actions in a se-

quence were combined into a single string separated

by commas, rather than a list of individual strings.

This preprocessing has a negligble effect on accuracy

but had around a 40% decrease in input tokens.

We start by giving the LLM context about the goal

of the task:

“I am going to ask you to predict the next ac-

tion in a salad making sequence.”

To replicate how traditional machine learning

methods would train the model, we initially provided

the sequences for all participants, called context par-

ticipants, except for one held-out test participant. We

deﬁne context participants as any participants that are

not the test participants whose sequences are provided

to the LLM to build context. Note that in our meth-

ods, there are always N-1 context participants, but the

number can be varied in other applications. Provid-

ing all sequences from all the context participants had

an unrealistic amount of input tokens at every predic-

tion step, since API calls do not retain context mem-

ory as in ChatGPT and other LLM chat interfaces.

This would lead to millions of context tokens which

is costly, and also not sustainable as new sequences

might appear and eventually exceed the context to-

ken limit. Additionally, some participants’ behav-

iors may introduce noise rather than helpful patterns,

and such a large number of sequences may introduce

causal confusion (de Haan et al., 2019), reducing pre-

diction accuracy. Since LLMs are known to excel at

zero-shot and few-shot reasoning tasks (Kojima et al.,

2023; Brown et al., 2020), we then explored the im-

pact of providing different and more limited amounts

of prior context at each call.

One approach, termed Shared Context (LLM-

SC) and illustrated in Figure 2, leverages the few-

shot reasoning capabilities of LLMs. In this approach,

we experimented with providing either one or two

randomly selected sequences per context participant

to the LLM (called LLM-SC1 or LLM-SC2, respec-

tively). This ensures that the LLM receives a di-

verse set of examples to generalize the sequential pat-

terns from the dataset. Providing the sequences of

other participants may help to uncover patterns such

as “place (vegetable) into bowl” frequently follow-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

522

Figure 2: This diagram illustrates all three of the approaches to the LLM methods: Independent Context, Shared Context with

one sequence per context participant, and Shared Context with two sequences per context participant. This process shown in

the diagram is repeated at every action timestep. The API is called three times per action to obtain the Top-3 accuracy. The

previous predictions list is for recording the previous outputs corresponding to ﬁrst, second, and third most likely predictions.

Note that the red box representing the sequences from context participants is only applicable to the Shared Context methods.

For a given test participant, the training data and action set variables remain ﬁxed across all calls. Additionally, on the ﬁrst

prediction of the ﬁrst action of the ﬁrst sequence of the test participant, prev trials, prev actions, and prev pred will all be

empty.

ing “cut (vegetable)” or “end” frequently following

“serve salad onto plate”. Consequently, this method

allows the LLM to better predict unseen sequences

without being distracted by excessive or redundant

data as in providing all the sequences per context par-

ticipant.

The alternative approach, referred to as Indepen-

dent Context (LLM-IC), tests the minimum amount

of context the LLM requires to make accurate predic-

tions by using sequences from only the test participant

as context (see Figure 2). Even if the test participant

doesn’t perform the exact same order of actions on

each of their ﬁve sequences, the general action pat-

terns will still likely hold and be useful for the LLM

to reference. For example, if a participant likes to pre-

pare the vegetables, then cheese, then dressing last,

the LLM would predict cheese-related actions before

dressing actions if it had some prior context about that

participant’s preference towards doing that. The ratio-

nale behind this approach was that by providing only

a minimal amount of context speciﬁc to the test par-

ticipant, the model’s predictions would focus more on

the unique patterns of that participant. This would

reduce the inﬂuence of other participants’ sequences,

which could introduce noise.

In both methods, we provided the action bank for

the LLM to choose from, since the ﬁrst test sequence

has no other sequences to reference on what actions

are allowed. Another key input is the entire list of pre-

vious actions (from a

to a

) that had occurred in the

test sequence prior to the one being predicted (a

t+1

Earlier iterations of the prompt provided just the sin-

gle previous action (a

), but this method showed poor

results, due to the lack of full context of what had pre-

viously occurred in the sequence.

Our analysis of different prompts showed that the

LLM would sometimes output actions outside of the

action set, unnecessarily wordy responses, or nonsen-

sical predictions given the context, even with the tem-

perature set to 0. As a result, our ﬁnal prompts also

made speciﬁcations about the format of the output:

“Choose from the provided actions list with

no other extraneous words or phrases. Your

response should be a maximum of W words.”

where W was the number of words in the longest an-

notation in the dataset.

Lastly, classiﬁcation methods typically consider

a Top-N accuracy when evaluating learning perfor-

mance (Deshpande and Karypis, 2004; Tang and

Wang, 2018). This is particularly relevant for our ap-

plication to meal preparation, since more than just the

top prediction is valuable. In the case of the salad-

making task for example, the top prediction after

“start, cut tomato, place tomato into bowl” could be

“cut cucumber” when the next true action was “cut

lettuce”. In this case, “cut cucumber” would have

still been a valuable suggestion because it aligns with

sequential logic: the prediction continues the veg-

Leveraging Large Language Models for Preference-Based Sequence Prediction

523

etable preparation phase, the test participant hasn’t

cut the cucumber yet, and the prediction brings them

closer to task completion. Thus, Top-N accuracy cap-

tures the variation in possibilities for what the next

step might be, which is differentiated by preferences.

We query the API three separate times per action

to obtain the top three predictions. The separate calls

while providing the most likely, second most likely,

etc. returned from previous calls acts as a chain-of-

thought process which has been shown to improve

performance on reasoning tasks (Wei et al., 2023) and

indeed resulted in better accuracy over querying the

LLM a single time for the top three most likely pre-

diction all at once.

We iterate over all N participants in the dataset

leaving one-subject-held-out. We use the ﬁve se-

quences from the held-out participant as test se-

quences, testing from Seq. 1 to Seq. 5. For each ac-

tion (a) within the i-th sequence, the LLM is queried

at every timestep t three times to predict the Top-3

most likely predictions of the next action (a

t+1

) given:

• Previous Actions. All previous actions a

to a

• Action Bank. The action bank containing all pos-

sible actions for the LLM to choose from.

• Previously Seen Sequences. Previously seen se-

quences of the current test participant (Seqs. 1 to

i -1).

• Other Participants’ Sequences. Applies only

for Shared Context approaches. 1 or 2 sequences

per context participant (excluding the test partic-

ipant) are randomly chosen. For a dataset with

N total participants, this results in either N − 1

(LLM-SC1) or 2(N − 1) (LLM-SC2) context par-

ticipants’ sequences, depending on the shared

context approach.

• Previous Predictions. Applies for the second and

third calls to the LLM to predict the second and

third most likely predictions to obtain the Top-3

results

Refer to the project webpage

for the full prompts

used in each of these methods.

In summary, here are the three versions to this ap-

proach for using LLMs for action prediction:

• Independent Context (LLM-IC) Each partici-

pant is treated independently without considering

sequences from context participants. The API is

called to predict the next action based solely on

the actions within the current sequence and if ap-

plicable, the test participant’s previous sequences.

sites.google.com/view/preference-based-prediction

• Shared Context (LLM-SC1, 1 Sequence per

Context Participant) One sequence from each of

the context participants is included in the context.

These sequences are randomly selected per partic-

ipant, providing the model with additional infor-

mation about potential action patterns from differ-

ent sources.

• Shared Context (LLM-SC2, 2 Sequences per

Context Participant) This approach extends the

shared context by including two sequences from

each other participant, also randomly selected.

Refer to Figure 2 for an illustration of these LLM

methods.

3.3 Baseline Methods

3.3.1 Aggregate LSTM

As a benchmark for the LLM method, we evaluated

the performance of an Aggregate Long Short-Term

Memory (Agg. LSTM) network. We mapped both

datasets’ textual annotations to numeric values rep-

resenting each action and used a sliding-window ap-

proach with a window size of 5 actions. For each win-

dow, a state vector was created, incremented at the in-

dices corresponding to the actions, normalized, and

appended to the feature vector (moving window x).

The next action following the window was recorded

as the label (moving window y). These feature-label

pairs were then aggregated into a training set.

To determine the optimal parameters for the

LSTM, a grid search was performed on the salad

dataset, testing different values of the LSTM hidden

size, number of LSTM layers, and learning rate. The

values tested for each of these hyperparameters and

corresponding accuracy results are shown in Table 1.

Based on the results from the grid search, we used the

following parameters for the LSTM architecture:

• LSTM hidden size = 16

• Number of LSTM layers = 1

• Four fully-connected layers: Sizes 4× 128, 128×

256, 256 × 64, and 64 × num classes (18 for both

of our datasets)

• Activation function = ReLU

• Loss function = Cross-Entropy

• Optimizer = Adam

• Number of epochs = 5000

• Learning rate = 0.001

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

524

Table 1: Grid Search Results for LSTM Hyperparameters

on the Salad Dataset (Exact Accuracy Only).

Hidden Size Num Layers LR Exact Accuracy

4 1 0.0001 49.0%

4 1 0.001 51.4%

4 1 0.01 51.1%

4 2 0.0001 49.5%

4 2 0.001 52.2%

4 2 0.01 51.2%

8 1 0.0001 52.3%

8 1 0.001 56.8%

8 1 0.01 56.5%

8 2 0.0001 54.0%

8 2 0.001 53.2%

8 2 0.01 53.2%

16 1 0.0001 54.5%

16 1 0.001 57.0%

16 1 0.01 55.5%

16 2 0.0001 53.4%

16 2 0.001 55.0%

16 2 0.01 55.1%

3.3.2 Mixture-of-Experts

Autoencoder to Extract Preferences. The

mixture-of-experts (MoE) model uses an autoencoder

to identify the different strategies in the dataset. To

prepare the sequence data for the autoencoder, all se-

quences were padded with end markers to match the

length of the longest sequence, ensuring consistent

input sizes. Each sequence was then converted into

a tensor of dimensions 1 × 1 × (A × L), where A is

the number of possible actions and L is the longest

sequence in the dataset.

The autoencoder was trained to extract a 2D la-

tent space representation of the sequences using the

Mean Squared Error loss function, Adam optimizer,

a learning rate of 0.001, and 100 epochs. K-Means

Figure 3: Overview of the Mixture of Experts (MoE) ap-

proach compared to the Aggregate LSTM (Chen, 2023).

An autoencoder compresses trajectories (τ

) into a 2D latent

space (z

), followed by clustering to identify distinct strate-

gies. An expert model is trained for each cluster, while the

aggregate LSTM is trained on the entire dataset without dif-

ferentiation.

clustering (K = 5) was applied to the latent space,

with the number of clusters determined by silhouette

score analysis. These clusters reﬂect distinct strate-

gies participants used to complete tasks. For example,

in the salad dataset, one cluster represents sequences

where vegetables and dressing are prepared in paral-

lel, while another shows a sequence of preparing veg-

etables ﬁrst, then cheese, then dressing. Similarly, in

the sandwich dataset, one cluster represents spread-

ing peanut butter on both slices before jelly, while an-

other corresponds to spreading peanut butter on one

slice and jelly on the other.

Training the Experts for Prediction. Each se-

quence in the training data was assigned to a cluster

based on its strategy, and ﬁve separate LSTMs were

trained using the same process and hyperparameters

as the aggregate LSTM, with each LSTM focusing

only on sequences within its corresponding cluster.

This allows each LSTM to specialize in predicting se-

quences of a speciﬁc strategy. To ensure a fair com-

parison with the aggregate LSTM, which is trained on

all sequences except for the test participant, we gen-

erated synthetic data to augment the smaller training

sets of the expert LSTMs. The synthetic data was cre-

ated by calculating a probability matrix of action tran-

sitions within each cluster and generating sequences

based on the most likely actions (Chen, 2023). While

these synthetic sequences don’t include repeated ac-

tions, they still capture the core sequential patterns of

each cluster, helping the expert LSTMs to match the

data scale of the aggregate LSTM.

To effectively leverage the expert predictors, we

ﬁrst identify which cluster an unknown participant

belongs to by maintaining a running belief vector.

This belief vector is initialized with equal probabil-

ities across all 5 clusters. For each action, the outputs

from all 5 expert predictors are compared to the actual

action taken, and the belief vector is updated using a

Bayesian update rule.

During evaluation, the expert predictor corre-

sponding to the highest belief is used to predict the

next action. If there’s a tie, the lowest-numbered strat-

egy is selected. The Bayesian update allows the be-

lief vector to adapt over time, enabling the model to

adjust if a participant shifts strategies during the task.

This continuous belief update process ensures that the

model remains responsive to changing patterns in the

participant’s actions.

Leveraging Large Language Models for Preference-Based Sequence Prediction

525

4 RESULTS

The evaluation process involved leave-one-subject-

out, testing on the 5 sequences of the held-out subject.

The performance of each method was averaged over

all test participants. All of the models predict the next

action, one action at a time.

Two accuracy metrics are used:

• Top-1 Accuracy. The number of times the most

likely predicted action matches the actual next ac-

tion, divided by the total number of actions.

• Top-3 Accuracy. The number of times the correct

action is within the top three predicted actions, di-

vided by the total number of actions.

We used OpenAI’s GPT-4o model (OpenAI,

2024) for running all of the LLM experiments.

Top-1 Accuracy Results

Table 2: Salad Dataset Top-1 Accuracy of Different Meth-

ods Across Sequences.

Model Seq. 1 Seq. 3 Seq. 5

Agg. LSTM 54.5% ± 0.1% 55.3% ± 1.6% 52.3% ± 2.0%

MoE 56.9% ± 2.6% 53.7% ± 2.5% 54.2% ± 2.3%

LLM-IC 53.9% ± 3.1% 83.6% ± 1.1% 86.1% ± 0.5%

LLM-SC1 64.2% ± 1.3% 76.0% ± 3.3% 79.9% ± 2.8%

LLM-SC2 64.1% ± 1.7% 77.0% ± 0.8% 82.4% ± 2.4%

Table 3: Sandwich Dataset Top-1 Accuracy of Different

Methods Across Sequences.

Model Seq. 1 Seq. 3 Seq. 5

Agg. LSTM 69.1% ± 0.6% 66.0% ± 1.3% 56.8% ± 1.6%

MoE 68.7% ± 3.0% 70.3% ± 3.8% 67.6% ± 1.5%

LLM-IC 51.1% ± 5.1% 81.0% ± 2.9% 82.1% ± 0.7%

LLM-SC1 61.3% ± 2.5% 75.6% ± 1.4% 74.8% ± 0.4%

LLM-SC2 60.6% ± 3.2% 76.4% ± 5.0% 74.8% ± 1.9%

Recall that Seq. n refers to the n-th sequence that the

participant performed in the study. The order of se-

quences is maintained during evaluation, with Seq. 1

tested ﬁrst and Seq. 5 tested last. Seq. n has con-

text on Seq. 1 to Seq. (n − 1). In Tables 2 and 3, we

observe that LLM-IC consistently outperforms other

approaches in the later sequences (Seq. 3 and Seq. 5)

for both datasets in the Top-1 accuracy metric. In the

salad dataset, LLM-IC achieves a Top-1 accuracy of

86.1% on Seq. 5, outperforming the aggregate LSTM

by 33.8% and the MoE method by 31.9% (Table 2).

LLM-IC also surpasses other LLM-based meth-

ods in later sequences. For instance, in Seq. 5 of

the salad dataset (Table 2), LLM-IC achieves 86.1%,

which is 6.2% and 3.7% higher than LLM-SC1 and

LLM-SC2, respectively. A similar trend is seen in the

sandwich dataset (Table 3), where LLM-IC reaches

82.1% in Seq. 5, outperforming both LLM-SC1 and

LLM-SC2 by 7.3%. Additionally, LLM-IC outper-

forms the aggregate LSTM by 25.3% and the MoE by

14.5% in Seq. 5 (Table 3), highlighting the strength

of this method over the baseline methods in later se-

quences.

In the early sequences (Seq. 1) however, the LLM

Shared Context (LLM-SC1 and LLM-SC2) methods

outperform other models in the salad dataset. LLM-

SC1 achieves 64.2% and LLM-SC2 achieves 64.1%,

both outperforming all other methods in Seq. 1 by

7-10% (Table 2). However, in the sandwich dataset

(Table 3), the aggregate LSTM achieves the highest

accuracy on Seq. 1 of 69.1%, with MoE following

closely at 68.7%. The baseline methods continue with

a similar level of accuracy in later sequences but are

surpassed by the LLM methods. Also, it is important

to point out the strength of the baseline methods in

Seq. 1 can be expected, since the baseline methods

have been trained on n − 1 participants’ training data

(80 sequences) rather than the LLM methods which

have only been given 0, 16, or 32 prior sequences for

LLM-IC, LLM-SC1, and LLM-SC2, respectively by

that point.

Top-3 Accuracy Results

Table 4: Salad Dataset Top-3 Accuracy of Different Meth-

ods Across Sequences.

Model Seq. 1 Seq. 3 Seq. 5

Agg. LSTM 75.1% ± 1.5% 75.4% ± 2.7% 74.1% ± 2.2%

MoE 82.1% ± 2.9% 76.8% ± 0.3% 76.6% ± 2.8%

LLM-IC 74.6% ± 0.8% 91.4% ± 0.9% 95.0% ± 0.9%

LLM-SC1 80.2% ± 1.5% 82.3% ± 2.2% 85.9% ± 3.4%

LLM-SC2 80.5% ± 0.4% 83.9% ± 1.2% 88.5% ± 1.6%

Table 5: Sandwich Dataset Top-3 Accuracy of Different

Methods Across Sequences.

Model Seq. 1 Seq. 3 Seq. 5

Agg. LSTM 90.6% ± 1.6% 92.0% ± 1.5% 85.1% ± 0.7%

MoE 91.2% ± 1.8% 90.1% ± 0.9% 86.8% ± 2.0%

LLM-IC 69.1% ± 4.4% 90.5% ± 3.3% 89.0% ± 1.8%

LLM-SC1 75.5% ± 4.6% 86.3% ± 3.4% 85.4% ± 0.6%

LLM-SC2 74.5% ± 1.9% 86.3% ± 2.1% 86.0% ± 0.6%

In the salad dataset (Table 4), the LLM-IC method

demonstrates strong performance in Seq. 3 and 5,

achieving Top-3 accuracies of 91.4% and 95.0%, re-

spectively. Notably, the gap between Top-1 (Table 2,

3) and Top-3 accuracy (Table 4, 5) for LLM-IC is less

than 10% in Seq. 3 and Seq. 5 for both datasets, in-

dicating that the correct action is often ranked as the

most likely prediction. LLM-SC1 and LLM-SC2 also

show smaller gaps between Top-1 and Top-3 accuracy

in the salad dataset, with differences of around 6% in

Seq. 3 and 5 (Table 2, 4), further supporting the preci-

sion of LLM-based predictions. In contrast, the base-

line methods exhibit larger discrepancies, with dif-

ferences between Top-1 and Top-3 accuracy ranging

from 20-25%.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

526

Moreover, in the sandwich dataset, the MoE

method starts with the highest Top-3 accuracy in Seq.

1 at 91.2%, but its performance declines in later se-

quences, dropping to 86.8% by Seq. 5. In con-

trast, LLM-IC begins with a lower Top-3 accuracy of

69.1% in Seq. 1 but surpasses MoE by Seq. 3, reach-

ing 90.5%, and maintains this advantage in Seq. 5.

This declining accuracy between Seq. 3 and 5 oc-

curs for all methods in the sandwich dataset (Table

5). However the decrease in accuracy of the baseline

methods between Seq. 3 and Seq.5 is 6.9% and 3.3%

for aggregate LSTM and MoE, respectively, while the

difference is only 1.5% for LLM-IC, 0.9% for LLM-

SC1 and 0.3% for LLM-SC2.

5 DISCUSSION

Accuracy on Later Sequences. The LLM Inde-

pendent Context (LLM-IC) method demonstrated up

to 33.8% greater accuracy than other methods when

testing on Seq. 5. This notable improvement high-

lights the LLM’s ability to quickly adapt its prediction

model to individual participant preferences with min-

imal data. The model excels in environments where

participants exhibit consistent behavior patterns, even

in cases involving outlier participants. The LLM-IC

method performs well on later sequences compared

to Seq. 1 because it only uses the current test partici-

pant’s previous sequences for context.

The strength of the LLM methods can be observed

particularly in the jump in accuracy from Seq. 1 to

Seq. 3. For example, in the sandwich dataset, the

baseline methods outperformed the LLM methods on

Seq. 1 (Table 3). However, on Seq. 3, the LLM meth-

ods surpass baseline methods by 5.3%-15.0%. The

improved performance of the LLM methods over the

aggregate LSTM and MoE methods can be attributed

to the fact that the latter two methods cannot change

their prediction model at test time, whereas the LLM

method can use the test participant’s prior sequences

to inform the future predictions. While this isn’t the

same as updating the model parameters, it is able to

incorporate new information on the test participant’s

strategy, while the aggregate LSTM and MoE meth-

ods can only rely on the strategies that already existed

in the training set. This can be seen in the results from

both datasets, where both the aggregate LSTM and

MoE methods have a static or even downward trend

between Seq. 1 to Seq. 5, while all three of the LLM

methods have an overall upward trend from Seq. 1.

In Seq. 3 and Seq. 5 for both datasets (Table

2 and 3), the LLM-IC method consistently outper-

forms all other approaches in Top-1 accuracy, includ-

ing the other LLM-based methods. This superior per-

formance is likely due to LLM-IC’s ability to leverage

the increasing amount of information about the test

participant in later sequences, which allows it to make

more informed predictions. In contrast, although the

LLM Shared Context (LLM-SC) methods also gain

access to more of the test participant’s sequences over

time, the addition of sequences from other partici-

pants acts as noise. This noise makes the predictions

less focused on the test participant’s speciﬁc prefer-

ences, reducing accuracy.

In the sandwich dataset, there was a noticeable

decrease in Top-3 accuracy across all methods be-

tween Seq. 3 and Seq. 5 (Table 5), indicating

that participants began to deviate from the strate-

gies they followed in earlier trials. Despite this de-

crease, providing context from prior trials consis-

tently outperformed Seq. 1 results for all of the

LLM methods, but not for the baseline methods. No-

tably, LLM-IC demonstrated signiﬁcantly better per-

formance compared to other methods on Seq. 5 (Ta-

ble 3). While baseline methods showed a consistent

downward trend in accuracy from Seq. 1 to Seq. 5 in

both datasets (Table 4, 5) —likely due to their inabil-

ity to adapt to participants’ evolving strategies—the

LLM methods exhibited an upward trend, highlight-

ing their superior adaptability to changing patterns.

Performance on Unseen Data. When testing on

completely unseen data (Seq. 1), the LLM Shared

Context (LLM-SC1 and LLM-SC2) methods outper-

formed all other methods in Top-1 accuracy in the

salad dataset. This indicates that the salad making

task has more variations in strategy which the aggre-

gate LSTM cannot capture with the single model, but

existing salad-making priors in the LLM can distin-

guish those.

On the other hand, it seems that the aggregate

LSTM and MoE were able to capture the patterns for

the sandwich task quite effectively (Table 3, 5), likely

because making a peanut butter and jelly sandwich

has fewer possible strategies and a smaller action

space. The better performance of the baseline meth-

ods over the LLM methods in the sandwich dataset on

Seq. 1 (Table 3), indicates that these methods could

be suitable for simpler meal preparation tasks with a

small action space, limited number of possible strate-

gies, and when there’s little prior information avail-

able about the participants’ previous sequences. Thus,

our LLM methods are ideal candidates for general use

such as assisting with a variety of recipes, more com-

plicated meal preparation tasks involving more varia-

tion in preferences, or when there is prior information

available about the user’s previous preferences.

Leveraging Large Language Models for Preference-Based Sequence Prediction

527

By directly leveraging the sequences from other

participants in LLM-SC1 and LLM-SC2, the model

is able to more effectively learn the general patterns

as well strategies that exist for the task. The shared

context allows the LLM to generalize better to new

participants and unseen data. We can see this by com-

paring the LLM-SC1 and LLM-SC2 average accu-

racy on Seq. 1 with that of the LLM-IC. LLM-SC1

and LLM-SC2 outperforms LLM-IC signiﬁcantly on

Seq. 1 for both datasets. Thus, the context pro-

vided by other participants proves to be useful when

there are no prior sequences available from the cur-

rent test participant. The tradeoff in prediction ac-

curacy between the LLM-IC method and LLM-SC

methods over the trials shows that there may be some

room for overall improvement by leveraging a mix of

both strategies: providing sequences from other par-

ticipants when prior sequences of the test participant

are unavailable, then removing the other participant

sequences as context once the test participant’s prior

sequence can be referenced.

Overall, the LLM methods exhibit signiﬁcant im-

provements in prediction accuracy over the baseline

methods, as shown in the Seq. 3 and Seq. 5 accu-

racies. LLM-IC shows adaptation to personal pref-

erences with minimal data, while LLM-SC excels in

predicting actions in completely unseen data. These

ﬁndings underscore the effectiveness of LLMs in en-

hancing action prediction through prior information

about their preferences towards a certain strategy.

Figure 4: Comparison of average overall Top-1 to Top-3

accuracy for all ﬁve different methods for the Salad Dataset.

Top-N Accuracy Comparison. Figures 4 and 5

compares the Top-N results. For both datasets, the

LLM-based methods (LLM-IC, LLM-SC1, LLM-

SC2) show a smaller difference between the Top-1,

Top-2, and Top-3 accuracy compared to the baseline

methods (aggregate LSTM, MoE). In particular, the

accuracy gains from Top-1 to Top-2 and Top-2 to Top-

3 are smaller, indicating that the LLM methods are

predicting the correct action as the most likely predic-

tion more often than the baseline methods. The base-

Figure 5: Comparison of average overall Top-1 to Top-3

accuracy for all ﬁve different methods for the Sandwich

Dataset.

line methods in both dataset have a large gap between

Top-1 and Top-2 prediction accuracy and a smaller

gap between Top-2 and Top-3, indicating that these

are predicting the correct action mostly as the ﬁrst and

second most likely predictions.

This comparison of Top-N accuracies demon-

strates the tradeoff between computational efﬁciency

and different accuracy metrics. In the sandwich

dataset (Figure 5), the Top-3 accuracies across all

methods is similar. Thus, from a deployment stand-

point, even the baseline methods can be considered if

the Top-3 accuracy is the chosen metric for the spe-

ciﬁc sandwich-making task. However, the Top-3 re-

sults from the salad dataset are less consistent across

methods. The LLM methods outperform the baseline

methods for all three Top-Ns (Figure 4), indicating

that there must be different considerations made for

accuracy metrics and methods used in deployment de-

pending on the dataset.

Generalization to Other Cooking Domains. The

similarity in overall trends from both the salad dataset

and sandwich dataset indicate that the LLM methods

can generalize to multiple cooking domains. Making

a salad or a peanut butter and jelly sandwich are rel-

atively common meal preparation tasks and are likely

described in numerous online recipes, tutorials, and

articles. This makes it likely that the LLM has been

exposed to a diverse set of examples of ways to make

the same dish. As a result, the model has developed

robust priors that we can leverage effectively to gener-

alize sequence prediction for a variety of meal prepa-

ration tasks with different preferences.

6 FUTURE WORK

One interesting next direction would be an LLM

Adaptive Context approach, combining the LLM-IC

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

528

method and LLM-SC methods to further improve pre-

diction performance. We could use the LLM-SC

method on Seq. 1-2 then switch to LLM-IC on the

remaining sequences. By leveraging the ability of the

LLM-SC method to perform well on completely un-

seen data (Seq. 1) and the strength of the LLM-IC on

later sequences, we can develop an even more robust

prediction system that can quickly adapt to a user’s

preferences towards making a meal.

Another way to strengthen the performance of

LLM methods during Seq. 1 testing (when no prior

sequences are available) is to use the aggregate LSTM

to generate a reference sequence. This generated se-

quence can then be provided to the LLM methods as

context, serving a similar role to prior participant data

in cases where such data is not yet available. The

strength of the aggregate LSTM method on the sand-

wich dataset in particular makes this a promising next

step to improve the LLM method performance with-

out prior participant data.

Future work will consist of using these prediction

methods that we presented in this paper to contribute

to a robust assistive system to help older adults with

meal preparation. We have seen that our prediction

methods are improved when informed by prior con-

text about a user’s preferences, so this will be lever-

aged to provide meaningful assistance that aligns with

how they want to complete the task and how they

would like to be assisted. The assistive system will

provide feedback in the form of verbal cues to pro-

vide useful feedback while still maintaining the user’s

conﬁdence and autonomy.

7 CONCLUSION

In this work, we presented a novel approach to lever-

aging Large Language Models (LLMs) for action pre-

diction in cooking sequences, with a focus on tailor-

ing predictions based on user preferences. The pri-

mary contributions of this work are as follows: First,

we introduced the use of LLMs for personalized ac-

tion prediction, incorporating OpenAI’s GPT-4 model

to demonstrate how LLMs can effectively predict the

next action in a sequence by using the user’s previous

actions as context. This method outperformed base-

line models, such as aggregate LSTM and mixture-of-

experts models, due to the LLMs’ ability to adapt with

minimal prior context. We also compared various

contextual approaches, evaluating three distinct meth-

ods—Independent Context (LLM-IC), Shared Con-

text with one sequence per context participant (LLM-

SC1), and Shared Context with two sequences per

context participant (LLM-SC2). This comparison re-

vealed the strengths of each approach, with the Shared

Context methods particularly excelling in predictions

for unseen participants. Additionally, we demon-

strated that the ﬁve methods tested, including the

LLM approaches, could be generalized to other cook-

ing domains, such as the salad and sandwich datasets.

The LLM methods were especially effective due to

the vast amount of cooking-related information avail-

able in the training data and the robust reasoning ca-

pabilities of the models. Finally, our results showed

signiﬁcant improvement over baseline methods, with

LLM methods—especially the Independent Context

approach—achieving up to 33.8% higher accuracy

in later sequences, highlighting the effectiveness of

LLMs in quickly and effectively adapting to individ-

ual preferences, even with limited data.

This work shows promising potential for helping

to develop assistive technologies for those who strug-

gle with meal preparation. By integrating our ap-

proach, such technologies can offer meaningful, per-

sonalized assistance. This can empower older adults

to regain independence and conﬁdence in this essen-

tial daily task, ultimately enhancing both their well-

being and quality of life.

ACKNOWLEDGEMENTS

This material is based upon work supported by

the National Science Foundation under Grant No.

2112633. We would like to thank Divya Gupta for

her work on annotating the sandwich dataset.

REFERENCES

Arenas, M. G., Xiao, T., Singh, S., Jain, V., Ren, A. Z.,

Vuong, Q., Varley, J., Herzog, A., Leal, I., Kirmani,

S., Sadigh, D., Sindhwani, V., Rao, K., Liang, J.,

and Zeng, A. (2023). How to prompt your robot: A

promptbook for manipulation skills with code as poli-

cies. In 2nd Workshop on Language and Robot Learn-

ing: Language as Grounding.

Bouchard, B., Bouchard, K., and Bouzouane, A. (2020).

A smart cooking device for assisting cognitively im-

paired users. Journal of Reliable Intelligent Environ-

ments, 6(2):107–125.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,

G., Henighan, T., Child, R., Ramesh, A., Ziegler,

D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,

E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,

C., McCandlish, S., Radford, A., Sutskever, I., and

Amodei, D. (2020). Language models are few-shot

learners.

Leveraging Large Language Models for Preference-Based Sequence Prediction

529

Chan, S., Li, J., Yao, B., Mahmood, A., Huang, C.-M., Jimi-

son, H., Mynatt, E. D., and Wang, D. (2023). ”mango

mango, how to let the lettuce dry without a spinner?”:

Exploring user perceptions of using an llm-based con-

versational assistant toward cooking partner.

Chen, D. (2023). Learning task preferences from real-world

data. Master’s thesis, Carnegie Mellon University,

Pittsburgh, PA.

de Haan, P., Jayaraman, D., and Levine, S. (2019). Causal

confusion in imitation learning.

Deshpande, M. and Karypis, G. (2004). Item-based top-n

recommendation algorithms. ACM Trans. Inf. Syst.,

22(1):143–177.

Funk, M., Mayer, S., and Schmidt, A. (2015). Using in-situ

projection to support cognitively impaired workers at

the workplace. In Proceedings of the 17th Interna-

tional ACM SIGACCESS Conference on Computers &

Accessibility, ASSETS ’15, page 185–192, New York,

NY, USA. Association for Computing Machinery.

Hamada, R., Okabe, J., Ide, I., Satoh, S., Sakai, S., and

Tanaka, H. (2005). Cooking navi: assistant for daily

cooking in kitchen. In Proceedings of the 13th Annual

ACM International Conference on Multimedia, MUL-

TIMEDIA ’05, page 371–374, New York, NY, USA.

Association for Computing Machinery.

Hu, Z., Lucchetti, F., Schlesinger, C., Saxena, Y., Freeman,

A., Modak, S., Guha, A., and Biswas, J. (2024). De-

ploying and evaluating llms to program service mo-

bile robots. IEEE Robotics and Automation Letters,

9(3):2853–2860.

Johansson, M. M., Marcusson, J., and Wressle, E. (2015).

Cognitive impairment and its consequences in every-

day life: experiences of people with mild cognitive

impairment or mild dementia and their relatives. In-

ternational Psychogeriatrics, 27(6):949–958.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa,

Y. (2023). Large language models are zero-shot rea-

soners.

Kosch, T., Wennrich, K., Topp, D., Muntzinger, M., and

Schmidt, A. (2019). The digital cooking coach: using

visual and auditory in-situ instructions to assist cog-

nitively impaired during cooking. In Proceedings of

the 12th ACM International Conference on PErvasive

Technologies Related to Assistive Environments, PE-

TRA ’19, page 156–163, New York, NY, USA. Asso-

ciation for Computing Machinery.

Le, D. M., Guo, R., Xu, W., and Ritter, A. (2023). Improved

instruction ordering in recipe-grounded conversation.

Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter,

B., Florence, P., and Zeng, A. (2023). Code as poli-

cies: Language model programs for embodied control.

Lin, K., Agia, C., Migimatsu, T., Pavone, M., and Bohg,

J. (2023). Text2motion: from natural language in-

structions to feasible plans. Autonomous Robots,

47(8):1345–1365.

McGrattan, A. M., McEvoy, C. T., McGuinness, B.,

McKinley, M. C., and Woodside, J. V. (2018). Ef-

fect of dietary interventions in mild cognitive impair-

ment: a systematic review. British Journal of Nutri-

tion, 120(12):1388–1405.

Nemlekar, H., Modi, J., Gupta, S. K., and Nikolaidis, S.

(2021). Two-stage clustering of human preferences

for action prediction in assembly tasks. In 2021 IEEE

International Conference on Robotics and Automation

(ICRA), pages 3487–3494.

Nikolaidis, S., Ramakrishnan, R., Gu, K., and Shah, J.

(2015). Efﬁcient model learning from joint-action

demonstrations for human-robot collaborative tasks.

In Proceedings of the Tenth Annual ACM/IEEE In-

ternational Conference on Human-Robot Interaction,

HRI ’15, page 189–196, New York, NY, USA. Asso-

ciation for Computing Machinery.

OpenAI (2024). Gpt-4 technical report.

Padmanabha, A., Yuan, J., Gupta, J., Karachiwalla, Z.,

Majidi, C., Admoni, H., and Erickson, Z. (2024).

Voicepilot: Harnessing llms as speech interfaces

for physically assistive robots. arXiv preprint

arXiv:2404.04066.

Randers, I. and Mattiasson, A.-C. (2004). Autonomy

and integrity: upholding older adult patients’ dignity.

Journal of Advanced Nursing, 45(1):63–71.

Sanders, J. and Martin-Hammond, A. (2019). Exploring

autonomy in the design of an intelligent health assis-

tant for older adults. In Companion Proceedings of the

24th International Conference on Intelligent User In-

terfaces, IUI ’19 Companion, page 95–96, New York,

NY, USA. Association for Computing Machinery.

Sato, A., Watanabe, K., and Rekimoto, J. (2014). Mimi-

cook: a cooking assistant system with situated guid-

ance. TEI ’14, page 121–124, New York, NY, USA.

Association for Computing Machinery.

Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D.,

Tremblay, J., Fox, D., Thomason, J., and Garg, A.

(2022). Progprompt: Generating situated robot task

plans using large language models.

Tang, J. and Wang, K. (2018). Personalized top-n sequen-

tial recommendation via convolutional sequence em-

bedding. In Proceedings of the Eleventh ACM Inter-

national Conference on Web Search and Data Mining,

WSDM ’18, page 565–573, New York, NY, USA. As-

sociation for Computing Machinery.

Tuokko, H., Morris, C., and Ebert, P. (2005). Mild cognitive

impairment and everyday functioning in older adults.

Neurocase, 11(1):40–47. PMID: 15804923.

Wang, H., Gonzalez-Pumariega, G., Sharma, Y., and

Choudhury, S. (2023). Demo2code: From summariz-

ing demonstrations to synthesizing code via extended

chain-of-thought.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,

Xia, F., Chi, E., Le, Q., and Zhou, D. (2023). Chain-

of-thought prompting elicits reasoning in large lan-

guage models.

Wherton, J. P. and Monk, A. F. (2010). Problems people

with dementia have with kitchen tasks: The challenge

for pervasive computing. Interacting with Computers,

22(4):253–266. Supportive Interaction: Computer In-

terventions for Mental Health.

Wu, J., Antonova, R., Kan, A., Lepert, M., Zeng, A., Song,

S., Bohg, J., Rusinkiewicz, S., and Funkhouser, T.

(2023). Tidybot: Personalized robot assistance with

large language models. Autonomous Robots.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

530

Yu, R., Lai, D., Leung, G., Tong, C., Yuen, S., and Woo,

J. (2023). A dyadic cooking-based intervention for

improving subjective health and well-being of older

adults with subjective cognitive decline and their care-

givers: A randomized controlled trial. The Journal of

nutrition, health and aging, 27(10):824–832.

Zhao, M., Simmons, R., and Admoni, H. (2022). Coordina-

tion with humans via strategy matching.

APPENDIX

Action Sets

Salad Action Set. [start, cut lettuce, place

lettuce into bowl, cut tomato, place

tomato into bowl, cut cucumber, place

cucumber into bowl, add salt, add pepper,

serve salad onto plate, end, cut cheese,

place cheese into bowl, add oil, add

vinegar, mix ingredients, mix dressing,

add dressing]

Sandwich Action Set. [start, place peanut

butter onto bread 1, spread peanut butter

onto bread 1, place peanut butter onto

bread 2, spread peanut butter onto bread

2, place jam onto bread 2, spread jam

onto peanut butter on bread 2, put bread

together, serve sandwich onto plate,

end, place jam onto bread 1, spread jam

onto peanut butter on bread 1, spread jam

onto bread 2, spread jam onto bread 1,

cut sandwich, spread jam next to peanut

butter on bread 1, cut off crusts, spread

peanut butter onto jam on bread 1]

LLM Prompts

These were the prompts used in the Independent Con-

text and Shared Context LLM methods. Temperature

was set to 0.

Independent Context

In this setup, the model is called three times to

generate the most likely, second most likely, and

third most likely predictions. The output of each

call is used to guide the next. The second prediction

uses the output of the most likely prediction, and the

third prediction uses the outputs of both the most and

second most likely predictions.

Asking for Predictions with Previously Seen Trials.

“I am going to ask you to predict the next action in a

salad-making sequence. For an unseen test partici-

pant, I will provide some of their previous trials for

reference, as well as the previously seen actions from

the current trial. Please output the most likely pre-

diction of the next action which would immediately

follow the last action in the provided list of previous

actions in the current sequence. Choose from the pro-

vided actions list with no other extraneous words or

phrases. Your response should be a maximum of 4

words.”

Asking for Predictions Without Previously Seen

Trials (Testing on Seq. 1). “I am going to ask you to

predict the next action in a salad-making sequence.

You have not seen any of this participant’s previous

trials. I will provide the previously seen actions from

the current trial. Please output the most likely pre-

diction of the next action which would immediately

follow the last action in the provided list of previous

actions in the current sequence. Choose from the

provided actions list with no other extraneous words

or phrases. Your response should be a maximum of 4

words.”

For the second and third most likely predictions,

follow the same process as described above:

• Second prediction: “If the most likely prediction

is pred list, please output the second most likely

prediction.”

• Third prediction: “If the most likely predic-

tion is pred list[0] and the second most likely is

pred list[1], please output the third most likely

prediction.”

Shared Context

In this case, the model is again called three times

(most likely, second most likely, and third most likely

predictions), using the output from the previous call

to inform subsequent predictions. Additionally, the

model is provided with context from other partici-

pants’ trials.

Asking for Predictions with Previously Seen

Trials. “I am going to ask you to predict the next

action in a salad-making sequence. For an unseen

test participant, I will provide some of their previous

trials, as well as the entire trials of other participants

for reference. I will also provide the previously seen

actions from the current trial. Please output the

most likely prediction of the next action which would

immediately follow the last action in the provided

list of previous actions in the current sequence.

Leveraging Large Language Models for Preference-Based Sequence Prediction

531

Choose from the provided actions list with no other

extraneous words or phrases. Your response should

be a maximum of 4 words.”

Asking for Predictions Without Previously Seen

Trials. “I am going to ask you to predict the next ac-

tion in a salad-making sequence. You have not seen

any of this participant’s previous trials. I will provide

the previously seen actions from the current trial as

well as the entire trials of other participants for ref-

erence. Please output the most likely prediction of the

next action which would immediately follow the last

action in the provided list of previous actions in the

current sequence. Choose from the provided actions

list with no other extraneous words or phrases. Your

response should be a maximum of 4 words.”

For the second and third most likely predictions,

follow the same process as described in the Indepen-

dent Context:

• Second prediction: “If the most likely prediction

is pred list, please output the second most likely

prediction.”

• Third prediction: “If the most likely predic-

tion is pred list[0] and the second most likely is

pred list[1], please output the third most likely

prediction.”

Project Webpage

The prompts used in this work are also avail-

able on the project webpage: sites.google.com/view/

preference-based-prediction

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

532