Experimenting with Planning and Reasoning in Ad Hoc Teamwork

Environments with Large Language Models

Polyana Bezerra da Costa

1 a

, Pedro Thiago Cutrim dos Santos

1 b

Jos

e Matheus Carvalho Boaro

1 c

, Daniel de Sousa Moraes

1 d

, J

ulio Cesar Duarte

2 e

and Sergio Colcher

1 f

Pontif

ıcia Universidade Cat

olica do Rio de Janeiro, Rio de Janeiro, Brazil

Instituto Militar de Engenharia, Rio de Janeiro, Brazil

Keywords:

Ad Hoc Teamwork, Large Language Models, Prompt Engineering, Autonomous Agents.

Abstract:

Ad Hoc Teamwork environments are dynamic spaces where agents engage in activities, make decisions, and

collaborate with teammates without prior coordination or complete knowledge of tasks. To effectively operate

in such an environment, an ad hoc agent must be equipped with robust reasoning and planning mechanisms.

Since Large Language Models (LLMs) are known for their generalization abilities, this study showcases their

application in ad hoc scenarios. By modeling the robot’s actions using LangChain Tools, building a semantic

map, and capturing human communication interactions, we tested the LLM reasoning capabilities in three

simulated scenarios involving humans and a robot. In each case, after providing contextual information, we

build a meta-prompt with the question: ‘How can the Robot help?’. By conducting these tests, this study

highlighted the LLM’s ability to infer tasks and craft action plans even in the absence of explicit verbal com-

mands.

1 INTRODUCTION

As the use of robots in our daily activities becomes

more common, more versatile, and more general they

must be able to engage in a wide range of activities

that we expect them to assist us with. While there are

still places for task-speciﬁc assistants, current tech-

nologies are ready to leverage a new generation of

robots that can collaborate with humans without static

or pre-deﬁned rules or even pre-established roles.

In a so-called Ad Hoc Teamwork (AHT) type of

cooperation (Ravula, 2019), environments can be re-

garded as spaces where agents (that may include

robots and humans) are asked to make decisions, en-

gage in activities, and cooperate without prior knowl-

edge about the surrounding world, their teammates,

or even the task they are addressing (Ribeiro et al.,

2023). To adequately cooperate in such an envi-

https://orcid.org/0009-0005-8691-8107

https://orcid.org/0009-0002-8038-8713

https://orcid.org/0000-0002-4456-9050

https://orcid.org/0000-0003-2995-3115

https://orcid.org/0000-0001-6656-1247

https://orcid.org/0000-0002-3476-8718

ronment, participants, called ad hoc agents, must be

highly adaptable while identifying the task at hand

and collaborating with other agents to achieve a com-

mon goal without pre-deﬁned protocols (Mirsky et al.,

2022).

The versatility and adaptability necessary in AHT

environments naturally require new mechanisms to

enhance a robot’s reasoning and planning abilities.

Large Language Models (LLMs) have been used in

many scenarios where generalizations and adapta-

tions are necessary. Given the wide range of data

these models are usually trained on, LLMs embed

such a vast knowledge that enables them to achieve

high generalization capabilities and perform tasks in

diverse contexts, including acting as reasoning and

planning agents (Zhao et al., 2023). But, while LLMs

used for planning and reasoning have, up to now,

mainly relied on prompts that already specify the task

that should be executed (Zheng et al., 2023; Xu et al.,

2023; Huang et al., 2023), this can not be the case in

AHT scenarios.

In this work, we investigate the embedding of

LLMs into ad hoc agents to enhance their reasoning

capabilities and cooperation with human teammates

on the ﬂy. We designed a mobile robot and conducted

438

Costa, P., Santos, P., Boaro, J., Moraes, D., Duarte, J. and Colcher, S.

Experimenting with Planning and Reasoning in Ad Hoc Teamwork Environments with Large Language Models.

DOI: 10.5220/0012472600003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 1, pages 438-445

ISBN: 978-989-758-680-4; ISSN: 2184-433X

tests of the LLM’s reasoning and planning ability in

three simulated scenarios involving human interac-

tions. In each scenario, the absence of an explicit task

and direct human commands required the robot to ac-

quire a contextualization model of its surroundings,

including objects, their places, people, and what they

are currently saying. In our proposal, we submit the

same prompt template (which we call a meta-prompt)

to the LLM in every scenario, asking the robot to de-

termine which action it would take given each spe-

ciﬁc situation and context. Although the meta-prompt

structure remains the same, the LLM’s responses pro-

duced speciﬁc action plans corresponding to a set of

possible actions tailored for each proper scenario.

More speciﬁcally, we use LangChain Agents to

model the robot’s actions (Briggs and Ingham, 2023).

With these agents, the LLM has access to Tools,

which (in our case) are pieces of code that encapsulate

the robot’s basic capabilities, such as moving, grab-

bing/releasing something, and speaking. To build the

context of the environment in a certain timeframe, we

construct a semantic map with the objects and agents

(human and robotic) present in that environment and

their positions. Additionally, we transcribe what the

human agents are saying. We pass all this info in a

meta-prompt, always asking the same question: “How

can the robot help?”. Figure 1 shows an overview of

our approach.

With this work, we aim to evaluate the suitability

of LLMs as planning tools for ad hoc environments

and how such environments could beneﬁt from their

remarkable ability to reason in different contexts and

the large amount of knowledge embedded in them.

Since our purpose was not to model or evaluate the

techniques of acquiring context data, all the environ-

mental information was manually described for each

scenario (instead of being captured by microphones,

cameras, or other sensors/devices).

The remainder of the paper is organized as fol-

lows. Section 2 presents an overview of recent works

in Ad Hoc Teamwork and LLMs for Robotics. Sec-

tion 3 details our main method of investigation, while

Section 4 contains some of our results. Finally, Sec-

tion 5 concludes the paper with our ﬁndings and ideas

for future works.

2 RELATED WORK

This section is grouped into two categories. In the

ﬁrst subsection, we provide an overview of the lat-

est research in Ad Hoc Teamwork. Conversely, in

the second one, we detail the application of LLMs in

robotics.

2.1 Ad Hoc Teamwork

To manage the uncertainty and high ability to adapt

that ad hoc environments require from their agents,

most work in AHT deﬁnes some information before-

hand. Some works model a set of possible tasks,

while others model the type of possible teammates,

others store previous interactions to compare with the

current one, etc.

The PLASTIC Model (Barrett et al., 2017), one of

the classical approaches for ad hoc teamwork, models

both the environment’s dynamics and the teammate’s

behaviors. PLASTIC has a library with behaviors of

previous teammates and policies of actions for the ad

hoc agent based on these behaviors. This model as-

sumes that there is some similarity between the way

the current teammate acts in comparison with the pre-

vious one. However, this might not be true in every

context. So, at the runtime, the ad hoc agent com-

pares the actions of a teammate to the ones existing in

its library to select an appropriate policy.

Another approach is to model all possible tasks

in a context. Although the ad hoc agent might not

know the task performed at the moment, in these

works, it normally possesses a library of all possible

tasks (Ribeiro et al., 2021). In (Melo and Sardinha,

2016), the authors deﬁne the set of possible tasks as

matrices of actions, and the AHT problem is treated

as a partially observable Markov decision problem.

In this context, the agent must observe the behavior

of the teammates to identify the task. The disadvan-

tages of this approach are having to characterize each

task in advance and expecting the teammates to be-

have accordingly and perform only tasks speciﬁed in

the library.

In our work, differently from the others, we want

to eliminate the need to explicitly model all possible

tasks and teammate’s behavior. Instead, we let the

LLM autonomously analyze the context to discover

the task at hand and plan how the ad hoc agent can

collaborate based on its own capabilities, allowing the

agent to dynamically respond to evolving scenarios

without the knowledge of predeﬁned tasks or behavior

speciﬁcations.

2.2 LLMs in Robotics

With the recent breakthroughs in the LLMs ﬁeld, they

have been applied across diverse contexts, includ-

ing robotics. Even before the introduction of LLMs,

works with initial language models such as BERT

already catered to robotic-related problems (Devlin

et al., 2019). In this scenario, FILM (Min et al., 2021)

presented a modular method based on BERT that pro-

Experimenting with Planning and Reasoning in Ad Hoc Teamwork Environments with Large Language Models

439

semantic_map = {

"table": position,

"couch": position,

"Person_1": position,

"Robot": position,

. . .

"water bottle": position

}

World

Representation

semantic_map = {

"table": position,

"couch": position,

"Person_1": position,

"Robot": position,

. . .

"water bottle": position

}

World

Representation

Robot Capabilities

+ speak

+ move_to

+ grab

+ release

+ stop

robot_info = {

"status": "idle",

"position": position,

"arm_state": "empty"

}

Robot's State

robot_info = {

"status": "idle",

"position": position,

"arm_state": "empty"

}

Robot's State

"Person_1 said: 'I am thirsty'.

How can Robot help?"

Prompt

"Person_1 said: 'I am thirsty'.

How can Robot help?"

Prompt

Action

Planning

Figure 1: Overview of our method.

cesses language instructions and organizes visual in-

put into a semantic map to fulﬁll a given task, out-

putting a series of navigation commands and actions.

Similarly, in LangSuitE (Zheng et al., 2023), the

authors built a semantic mapping module to capture

what is happening in the environment. LangSuitE

speciﬁes the task to be executed, characterizes the en-

vironment, and deﬁnes the robot’s actuation all in the

same prompt. They also included a task example and

how it should be performed in the prompt for few-

shot learning. The authors used OpenAI’s GPT 3.5

for controlling and planning the robot’s actions and

for communication between agents.

Recently, more robust models for solving robotic

problems were introduced, such as RT-2 (Brohan

et al., 2023), a visual-language-action model. RT-2

is an end-to-end model based on PALM-2 that maps

robot observations to actions using textual and vi-

sual data. This uniﬁed model combined the problem-

solving and reasoning abilities of LLMs with the

visual grounding/understanding abilities of visual-

language models to help robots perform several tasks

in real-world environments. RT-2 is able to receive

a command in text, extract information from image

data, reason over all this information, and output gen-

eralized instructions for robotic control, such as vec-

tors for positional and rotational change.

Along with planning the robot’s action, some

works use LLM to generate actual code to execute

those actions. RoboTool (Xu et al., 2023) is a system

geared towards robotic arms and mobile robots that

uses LLMs to map commands in natural language to

executable code. Each single prompt encapsulates a

description of the robot in itself (mobile or robotic

arm), objects in an environment and their position,

https://platform.openai.com/docs/models/gpt-3-5

as well as the robot’s task and constraints that the

robot must follow to fulﬁll it. Based on this prompt,

RoboTool outputs a Python code that is responsible

for controlling the robot’s actions.

In addition, Instruct2Act (Huang et al., 2023) also

uses an LLM to generate Python code for robotic ma-

nipulation, distinguishing itself by receiving multi-

modal instructions. The system is composed of a per-

ception module that uses foundation models to locate

objects in a scene, classify and segment them, and an

action module that interprets a given command along

with the perception information; Then, it generates

actions that are transformed into executable code.

Furthermore, (Vemprala et al., 2023) applied Ope-

nAi’s ChatGPT

to a pipeline for solving robotic

tasks, also outputting Python code. At ﬁrst, the au-

thors deﬁne a basic Python API with the robot’s ac-

tions, the function’s names, and their parameters, but

no actual code within these functions. Then, in the

prompt, the authors provide a clear description of the

task and the functions deﬁned in the API (and their

descriptions). ChatGPT is then asked to output a

Python code that can fulﬁll the speciﬁed task. The

authors tested this pipeline in ten different scenarios

that involved logical, geometrical, and mathematical

reasoning, navigation, and manipulation, among oth-

ers. With extensive tests, the authors showcased the

potential of ChatGPT for robotic tasks and how it can

be able to reason and plan in the most different con-

texts.

Finally, in a work that is the most similar to

ours, (Xiong et al., 2023) integrated texts, sensor data,

prompts, LangChain’s Agents, and LLMs to model

the context that the robot is inserted and reason over

all this information to plan the robots’ actions. The

https://openai.com/blog/chatgpt/

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

440

authors deﬁne categories of requests and examples of

question-answering for each category. At ﬁrst, the

LMM has to classify the user’s request based on pre-

viously deﬁned categories using the right prompt tem-

plate. With the ﬁnal prompt, the LLM outputs a se-

quence of actions in natural language that fulﬁll the

user’s request.

Similarly to the works cited above, we also use

LLMs (GPT-4) to reason over various information

and generate actionable plans for robots. However,

different from these works, in ad hoc scenarios, the

task to be addressed is not always deﬁned, and there

may be no direct command or request from the user.

In such cases, the LLM has to infer the task at hand

and formulate a collaborative action plan for how the

robot can fulﬁll the team’s goal. With our work, we

want to build a system that works for ad hoc scenarios,

addressing tasks without predeﬁned speciﬁcations. In

addition, the way that we model our prompts saves

several tokens. Section 3 explains our method in de-

tail and shows three examples that illustrate the po-

tential of our method for AHT scenarios.

3 PLANNING WITH LLMs

This section presents a detailed description of each

part of our methodology. We start by deﬁning what

the robot (ad hoc agent) can do by coding its actions

in Python. Subsequently, we characterize the environ-

ment in which the robot is inserted, identifying the ob-

jects, teammates, and their positions. We also capture

what the teammates might be saying. We then use all

this contextual information to build a generic prompt

and submit it to GPT-4 to get an action plan.

Since we are not focused on the techniques for

acquiring contextual data, the visual and audio data

mentioned in the following sections were not captured

by cameras and microphones; these scenarios have

been manually described.

3.1 Modeling the Robot’s Capabilities

with LangChain Tools

We start by deﬁning the robot’s capabilities – the ac-

tions it can perform. To do so, we used LangChain’s

Agents, a functionality that allows the creation of cus-

tom tools for an LLM Agent in Python.

Tools are functions or methods that receive their

input from an LLM, and their output is sent back to

the LLM (Briggs and Ingham, 2023). Anything can

be done inside these functions, but their return should

always be a prompt sent back to the LLM, so it should

contain valuable information to the LLM.

In our context, each tool is a function that rep-

resents a robot’s capability, along with an extra tool

that handles a map of the environment. We deﬁned

a mobile robot that can move in spaces, grab or re-

lease objects, and speak utterances. So, we coded six

tools and started by deﬁning the inputs of our meth-

ods and the robot’s capabilities, as shown in Listing 1

and Listing 2. We also deﬁned a dictionary that con-

tains the robot’s position, its state (idle or not), and

the state of the robot’s arm (empty or not).

from pydantic.v1 import BaseModel, Field

class ObjectPositionInput(BaseModel):

object_name: str = Field(description="Name

of the object.")

class Text2SpeakInput(BaseModel):

text_input: str = Field(description="Text

that robot need to speak.")

class MapPositionInput(BaseModel):

n: int = Field(description="Desired

position described by an integer.")

Listing 1: Deﬁnition of the methods’ inputs.

def speak(text_input: str):

’’’

Code logic to make the Robot speak

’’’

return "Robot spoke: ’{}’.".format(

text_input)

def get_object_position(object_name: str,

map_data: dict = getMapData()) -> str:

k = match_key(object_name.lower(),

map_data)

return "Object ’{}’ is at position ’{}’."

.format(k, str(map_data[k]))

def move_to(n: int) -> str:

’’’

Code logic to make the Robot Move

’’’

robot_info[’status’] = ’moving’

robot_info[’position’] = n

robot_info[’status’] = ’idle’

if robot_info[’arm’] != ’empty’:

map_data[robot_info[’arm’]] = n

return "Robot at position {}".format(

robot_info[’position’])

def grab(object_name: str) -> str:

’’’

Code logic to make the robot

grab the object.

’’’

robot_info[’arm’] = k

Experimenting with Planning and Reasoning in Ad Hoc Teamwork Environments with Large Language Models

441

return "Robot grabbed object ’{}’.".

format(k)

def release(object_name: str) -> str:

’’’

Code logic to make the robot

release the object.

’’’

aux = robot_info[’arm’]

robot_info[’arm’] = ’empty’

return "Robot released object ’{}’ at

location {}".format(aux, robot_info[’

position’])

def stop(action: str) -> str:

’’’

Code logic to stop the robot.

’’’

robot_info[’status’] = ’idle’

return ’There is nothing the Robot can do

in this situation, status switched to

idle.’

Listing 2: Deﬁnition of the robot s capabilities.

3.2 Context Modeling

To provide contextual awareness to our robot, we rep-

resented the world that the robot is inserted in a se-

mantic map, similarly to (Min et al., 2021; Zheng

et al., 2023). The semantic map is a dictionary with

the name and position (id) of every object and agent

present in the scene. Since we were simulating AHT

environments, we manually deﬁned the objects and

agents’ positions. An example of the semantic map

used in our experiments is shown in Listing 3.

map_data = {

"Person_1": 1,

"Person_2": 2,

"Robot": 3,

"water bottle": 4,

"couch": 5,

"chair": 6,

"glass": 7,

"table": 8,

"computer": 8,

"cellphone": 9

}

Listing 3: Deﬁnition of a semantic map.

For context modeling, we also include what the

agents say.

3.3 Prompt Engineering

After modeling the robot’s capabilities and environ-

ment, we need to craft the prompt that will be sent to

the LLM Agent. The LLM we chose for our agent

was GPT-4 (OpenAI, 2023) due to its impressive per-

formance in several benchmarks.

We deﬁned an agent for zero-shot tasks, meaning

it does not have previous examples to guide its re-

sponse. With zero-shot agents, forcing the agent to

follow instructions is easier.

Based on how we modeled the environment, the

semantic map is not included in the prompt, which

saves several tokens. With this, every time the LLM

Agent wishes to know the position of a particular ob-

ject or teammate, it needs to call the responsible tool

(getObjectPosition) for the semantic map.

The LLM Agent chooses which tools to use based

solely on their descriptions. The ﬁnal prompt contains

the Python objects for each tool, with their names, de-

scriptions, and corresponding function names. The

internal code of each tool is not included in the

prompt, which also saves tokens. This code is shown

in Listing 4.

from langchain import OpenAI

from langchain.agents import

initialize_agent, Tool

from langchain.agents import AgentType

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-4",

openai_api_key=OPENAI_KEY)

tools = [

Tool(

name="Speak", func=speak, args_schema=

Text2SpeakInput,

description="Make the robot speak the

input text."

Tool(

name="Get Object Position", func=

get_object_position,

args_schema=ObjectPositionInput,

description="Get a position of a

specific object by giving its name.

Tool(

name="Move", func=move_to, args_schema=

MapPositionInput,

description="Move to a specific

position on the map. The input is

an integer that defines the

position (id)."

Tool(

name="Grab", func=grab, args_schema=

ObjectPositionInput,

description="Grab an object that is

placed in a specific position on

the map. The input is the object

name to be grabbed."

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

442

Tool(

name="Release", func=release,

args_schema=ObjectPositionInput,

description="Release something that is

grabbed at the current robot

location. The input is the object

name to be released."

Tool(

name="Stop", func=stop,

description=’When there is nothing an

agent can do in a situation or when

the agent has finished a task, use

this function.’,

)

]

agent = initialize_agent(

tools, llm, agent=AgentType.

ZERO_SHOT_REACT_DESCRIPTION, verbose=

True)

Listing 4: Deﬁnition of a Langchain Agent.

After creating the agent, we can now run it by

passing a prompt. The ﬁnal prompt contains only

what the teammates might have said and a generic

question asking what the robot could do in that con-

text, as illustrated by Listing 5.

agent.run("Person_1 said: ’I am thirsty.’.

How can Robot help?")

Listing 5: Submitting the ﬁnal prompt to the agent.

After reasoning and discovering what is the task

to be addressed, the agent may need to access the se-

mantic map to get the position of a speciﬁc object.

However, the LLM’s output may not always precisely

match a term on the map. In one of our examples, the

LLM agent decided that it needed to move to a ’water

dispenser’, which does not exist on the semantic map,

but ’water bottle’ does. To overcome this, we deﬁned

a term similarity function that always compares the

LLM’s output to each term on the map and returns

the most similar one.

4 EXPERIMENTS

This section describes our design and experiments in

three scenarios that follow AHT environments’ re-

quirements. The ad hoc agent (our mobile robot) has

to infer the task on the ﬂy and collaborate with hu-

man teammates to fulﬁll a common goal. There is no

previous agreement on how they will cooperate, and

the agent gathers information about its surroundings

to help in its reasoning process by using a semantic

map and speech recognition data. All three examples

share the same semantic map depicted on Listing 3.

Figure 2 shows the prompt and action plan exe-

cuted by the robot for the ﬁrst example. In this ex-

ample, the person asks if anyone has seen his phone.

The LLM correctly inferred the task. First, the robot

tries to ﬁnd the cell phone in the environment. After

ﬁnding it, the robot moves to the object’s position and

grabs it. Finally, the action plan consists of leaving

the cell phone in one place and communicating where

it is. It is worth mentioning that, while all actions

occur, the robot engages in continuous communica-

tion with the person, which can help the understand-

ing and increase the person’s conﬁdence in the robot’s

actions.

Prompt:

"Person_1 said: 'Has anyone seen my phone??'.

How can Robot help?"

LLM Agent Response:

> Entering new AgentExecutor chain...

I need to find the phone for Person_1.

Action: Get Object Position

Action Input: Phone

Observation: Object 'cellphone' is at position '9'.

Thought: I know where the phone is. I need to move to that

position.

Action: Move

Action Input: 9

Observation: Robot at position 9

Thought: I am now at the position of the phone. I need to

grab it.

Action: Grab

Action Input: Phone

Observation: Robot grabbed object 'cellphone'.

Thought: I have the phone. I need to return it to Person_1.

Action: Speak

Action Input: I have found your phone, Person_1.

Observation: Robot spoke: 'I have found your phone,

Person_1.'.

Thought: I have informed Person_1 that I found their phone.

Now I need to release the phone.

Action: Release

Action Input: Phone

Observation: Robot released object 'cellphone' at location 9

Thought: I have released the phone. Now I need to inform

Person_1 where I left it.

Action: Speak

Action Input: I have left your phone at position 9.

Observation: Robot spoke: 'I have left your phone at position 9.'.

Thought: I have completed the task.

Action: Stop

Action Input:

Observation: There is nothing the Robot can do in this situation,

status switched to idle.

Thought: I now know the final answer

Final Answer: The robot helped by finding the phone and

informing Person_1 about its location.

> Finished chain

Figure 2: Prompt for task 1 and the resulting reasoning pro-

cess and action planning.

The second example is shown in Figure 3. Simi-

larly to example 1, a person asks about the location of

an object in the environment, in this case, the couch.

With the help of the LLM, the robot correctly infers

that the current task is to ﬁnd the couch’s position.

Differently from example 1, in which the robot got

the phone’s position but also grabbed it, the robot

cannot grasp it due to the couch’s size and weight.

Even though this restriction is not explicitly stated in

Experimenting with Planning and Reasoning in Ad Hoc Teamwork Environments with Large Language Models

443

the deﬁnition of the robot’s capabilities, the LLM was

able to infer that the ’grab’ function was unnecessary.

In this instance, the LLM’s strategy involved lo-

cating the position of the couch and communicating

this information to the person.

Prompt:

"Person_2 said: "Where is the couch?". How can Robot

help?"

LLM Agent Response:

> Entering new AgentExecutor chain...

I need to find the position of the couch.

Action: Get Object Position

Action Input: Couch

Observation: Object 'couch' is at position '5'.

Thought: I now know the location of the couch.

Action: Speak

Action Input: The couch is at position 5.

Observation: Robot spoke: 'The couch is at position 5.'.

Thought: I have successfully informed Person_2 about the

location of the couch.

Action: Stop

Action Input:

Observation: There is nothing the Robot can do in this

situation, status switched to idle.

Thought: I now know the final answer

Final Answer: The robot helped by informing Person_2 that

the couch is at position 5.

> Finished chain.

Figure 3: Prompt for task 2 and the resulting reasoning pro-

cess and action planning.

The last example is illustrated in Figure 4, where

Person 1 mentions that he is thirsty, and after a de-

tailed reasoning process, the ad hoc agent (robot) con-

cludes that it can cooperate by getting a glass of wa-

ter to Person 1. With the help of an LLM, the robot

inferred the task and generated an action plan based

on its capabilities. Even when the robot encountered

an action beyond its predeﬁned capabilities (ﬁll the

glass), the agent could adapt by invoking an alterna-

tive action (speak). Although this shows the remark-

able adaptability of LLMs, by dynamically adjust-

ing and employing available functionalities to address

unanticipated tasks, it is also a point for improvement

in our method since just saying ”Filling the glass ...”

does not equate to physically executing the task.

5 CONCLUSIONS

This paper presented a method for reasoning and

planning in ad hoc teamwork (AHT) scenarios with

Large Language Models (LLMs). Due to the ﬂu-

idity of such environments, we wanted to equip ad

hoc agents with mechanisms to allow them to adapt

easily to different contexts and tasks. Since LLMs

have been showing remarkable generalization abili-

ties, we investigated their potential in AHT scenarios

with robots and human teammates.

Prompt:

"Person_1 said: 'I am thirsty.'. How can Robot

help?"

> Entering new AgentExecutor chain...

The robot can help by getting a glass of water for Person_1. First,

I need to find out where the glass is.

Action: Get Object Position

Action Input: Glass

Observation: Object 'glass' is at position '7'.

Thought: Now that I know where the glass is, I need to move to

that position to grab it.

Action: Move

Action Input: 7

Observation: Robot at position 7

Thought: Now that I'm at the position of the glass, I can grab it.

Action: Grab

Action Input: Glass

Observation: Robot grabbed object 'glass'.

Thought: Now that I have the glass, I need to find out where the

water dispenser is to fill the glass.

Action: Get Object Position

Action Input: Water Dispenser

Observation: Object 'water bottle' is at position '4'.

Thought: Now that I know where the water dispenser is, I need

to move to that position to fill the glass.

Action: Move

Action Input: 4

Observation: Robot at position 4

Thought: Now that I'm at the position of the water dispenser, I

can fill the glass.

Action: Speak

Action Input: Filling the glass with water.

Observation: Robot spoke: 'Filling the glass with water.'.

Thought: Now that I have filled the glass with water, I need to

find out where Person_1 is to give them the glass.

Action: Get Object Position

Action Input: Person_1

Observation: Object 'person_1' is at position '1'.

Thought: Now that I know where Person_1 is, I need to move to

that position to give them the glass.

Action: Move

Action Input: 1

Observation: Robot at position 1

Thought: Now that I'm at the position of Person_1, I can give

them the glass of water.

Action: Release

Action Input: Glass

Observation: Robot released object 'glass' at location 1

Thought: I have given the glass of water to Person_1. My task is

complete.

Action: Stop

Action Input:

Observation: There is nothing the Robot can do in this situation,

status switched to idle.

Thought: I now know the final answer

Final Answer: The robot helped by getting a glass of water for

Person_1.

> Finished chain.

LLM Agent Response:

Figure 4: Prompt for task 3 and the resulting reasoning pro-

cess and action planning.

In our experiments, we used LangChain Tools to

deﬁne the robot’s capabilities and modeled the envi-

ronment in which the robot is inserted for contextual

awareness. We showcased three scenarios in which

this information is combined in a generic prompt ask-

ing how the agent can help in that context, submitting

it to GPT-4.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

444

For every example, the LLM could reason and in-

fer the task to be performed, deﬁning a plan of action

to achieve it. This showed the potential of LLMs as

reasoning and planning mechanisms in ad hoc envi-

ronments.

For future works, we intend to adjust the prompt to

prevent the LLM from ’cheating,’ which is adjusting

the functionalities to address unanticipated tasks that

are not entirely equivalent. We also want to include

more complex scenarios and more examples, espe-

cially the ones in real-world settings, receiving data

from different sensors (cameras and microphones)

and connecting the tools to the robot operating sys-

tem so that it can actually perform the actions.

ACKNOWLEDGEMENTS

This work was partially funded by the National Coun-

cil for Scientiﬁc and Technological Development

(CNPQ), under grant number 141809/2020-5. In ad-

dition, this material is based upon work supported

by the Air Force Ofﬁce of Scientiﬁc Research under

award number FA9550-22-1-0475.

REFERENCES

Barrett, S., Rosenfeld, A., Kraus, S., and Stone, P. (2017).

Making friends on the ﬂy: Cooperating with new

teammates. Artiﬁcial Intelligence, 242:132–171.

Briggs, J. and Ingham, F. (2023). Langchain: Introduction

and getting started. Pinecone.

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen,

X., Choromanski, K., Ding, T., Driess, D., Dubey,

A., Finn, C., et al. (2023). Rt-2: Vision-language-

action models transfer web knowledge to robotic con-

trol. arXiv preprint arXiv:2307.15818.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding.

Huang, S., Jiang, Z., Dong, H., Qiao, Y., Gao, P., and Li, H.

(2023). Instruct2act: Mapping multi-modality instruc-

tions to robotic actions with large language model.

arXiv preprint arXiv:2305.11176.

Melo, F. S. and Sardinha, A. (2016). Ad hoc teamwork

by learning teammates’ task. Autonomous Agents and

Multi-Agent Systems, 30:175–219.

Min, S. Y., Chaplot, D. S., Ravikumar, P., Bisk, Y., and

Salakhutdinov, R. (2021). Film: Following instruc-

tions in language with modular methods. arXiv

preprint arXiv:2110.07342.

Mirsky, R., Carlucho, I., Rahman, A., Fosong, E., Macke,

W., Sridharan, M., Stone, P., and Albrecht, S. V.

(2022). A survey of ad hoc teamwork research. In

European Conference on Multi-Agent Systems (EU-

MAS).

OpenAI (2023). Gpt-4 technical report.

Ravula, M. C. R. (2019). Ad-hoc teamwork with behavior-

switching agents. PhD thesis, University of Texas.

Ribeiro, J. G., Faria, M., Sardinha, A., and Melo, F. S.

(2021). Helping people on the ﬂy: Ad hoc team-

work for human-robot teams. In Progress in Artiﬁcial

Intelligence: 20th EPIA Conference on Artiﬁcial In-

telligence, EPIA 2021, Virtual Event, September 7–9,

2021, Proceedings 20, pages 635–647. Springer.

Ribeiro, J. G., Rodrigues, G., Sardinha, A., and Melo,

F. S. (2023). Teamster: Model-based reinforcement

learning for ad hoc teamwork. Artiﬁcial Intelligence,

324:104013.

Vemprala, S., Bonatti, R., Bucker, A., and Kapoor, A.

(2023). Chatgpt for robotics: Design principles and

model abilities. Microsoft Auton. Syst. Robot. Res,

2:20.

Xiong, H., Bian, J., Yang, S., Zhang, X., Kong, L., and

Zhang, D. (2023). Natural language based context

modeling and reasoning with llms: A tutorial. arXiv

preprint arXiv:2309.15074.

Xu, M., Huang, P., Yu, W., Liu, S., Zhang, X., Niu, Y.,

Zhang, T., Xia, F., Tan, J., and Zhao, D. (2023). Cre-

ative robot tool use with large language models.

Zhao, P., Jin, Z., and Cheng, N. (2023). An in-depth survey

of large language model-based artiﬁcial intelligence

agents. arXiv preprint arXiv:2309.14365.

Zheng, Z., Wang, M., and Zixia Jia, B. T. (2023). Langsuit

e: Controlling, planning, and interacting with large

language models in embodied text environments.

Experimenting with Planning and Reasoning in Ad Hoc Teamwork Environments with Large Language Models

445