Finding and Navigating to Humans in Complex Environments for

Assistive Tasks

Asfand Yaar

1 a

, Antonino Furnari

1 b

, Marco Rosano

1 c

, Aki H

arm

2 d

and Giovanni Maria Farinella

1 e

Department of Mathematics and Computer Science, University of Catania, Italy

Department of Advanced Computing Sciences, Maastricht University, The Netherlands

{antonino.furnari, marco.rosano, giovanni.farinella}@unict.it

Keywords:

Assistive AI, Autonomous Exploration, Human-Robot Interaction, Embodied Navigation, Habitat Simulator.

Abstract:

Finding and reaching humans in unseen environments is a major challenge for intelligent agents and social

robots. Effective exploration and navigation strategies are necessary to locate the human performing various

activities. In this paper, we propose a problem formulation in which the robot is required to locate and reach

humans in unseen environments. To tackle this task, we design an approach that makes use of state-of-the-art

components to allow the agent to explore the environment, identify the human’s location on the map, and

approach them while maintaining a safe distance. To include human models, we utilized Blender to modify

the scenes of the Gibson dataset. We conducted experiments using the Habitat simulator, where the proposed

approach achieves promising results. The success of our approach is measured by the distance and orientation

difference between the robot and the human at the end of the episode. We will release the source code and 3D

human models for researchers to benchmark their assistive systems.

1 INTRODUCTION

Autonomous robots able to navigate and interact with

humans could be helpful in many assistive scenarios.

Consider for instance a robot assisting an elderly in

their home to carry out daily activities. The robot

could provide instructions on how to successfully pre-

pare a recipe, remind them to take the medicines at a

given hour or recommend not to sit too much in front

of the TV and go out for a walk once in a while. In

order to achieve such a varied range of assistive tasks

in the home, and in particular to initiate any form

of visual or vocal interaction with the human, robots

should be able to locate the human and reach them

appropriately. For example, in a scenario where a hu-

man instructs a robot, come to me as shown in Fig 2.

The robot needs to explore the environment and

locate the human while keeping track of its progress

to avoid redundant searches. Once the robot has

https://orcid.org/0009-0006-6329-3500

https://orcid.org/0000-0001-6911-0302

https://orcid.org/0009-0003-8680-1246

https://orcid.org/0000-0002-2966-3305

https://orcid.org/0000-0002-6034-0432

reached the area in which the human is located, it

can calculate the human’s position on the map and ap-

proach them from the right angle to initiate a conver-

sation. This task requires complex exploration strate-

gies, including a combination of implicit objectives

such as exploration, efﬁcient navigation, and interac-

tion. While the ability to locate humans and navigate

to them is a fundamental building block for assistive

robotic applications, there is still a need for a more

systematic investigation of the ability of current algo-

rithms to tackle this task in different environments.

To ﬁll this gap, in this paper, we focus on evalu-

ating robot performance in locating and navigating to

Figure 1: 3D models of humans involved in different ac-

tivities such as watching TV, eating, being on a call, and

cooking used in our experiments.

Yaar, A., Furnari, A., Rosano, M., Härmä, A. and Farinella, G.

Finding and Navigating to Humans in Complex Environments for Assistive Tasks.

DOI: 10.5220/0012271700003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

245-252

ISBN: 978-989-758-679-8; ISSN: 2184-4321

245

Figure 2: (Left) Robot responding to a human call by exploring the environment with the help of multiple global goals (1-6)

to locate and reach the human. (center) Robot’s observations upon reaching each global goal (1-6) during its exploration of

the environment. (right) Robot’s ﬁnal observation upon successfully reaching the human at an appropriate angle, depending

on the human’s activity, to initiate a conversation.

the human engaged in various activities such as cook-

ing, eating, talking, watching TV, etc. To complete

this task, the robot explores the environment to lo-

cate and reach the human at a safe distance. We pro-

vide a problem formulation and design a set of base-

lines based on human detection and point-goal navi-

gation to tackle the task. We validate the feasibility of

the task and the effectiveness of the considered base-

lines using the Habitat simulator (Savva et al., 2019)

and the Gibson (Xia et al., 2018) validation dataset

which consists of ﬁve complex 3D environments. As

this dataset lacks human models, we used Blender

to create four human models, in poses coherent with

the execution of four different activities (eating, cook-

ing, watching TV, and on a call) as illustrated in Fig.

1. Subsequently, we modiﬁed the Gibson environ-

ments in Blender to incorporate these human models

at multiple locations, such as the kitchen, TV lounge,

bedroom, and other relevant areas. Results show that

considered baselines achieve promising performance

in locating and reaching the human in an unseen en-

vironment. However, further research is still needed

in this area, and we believe that our proposed ap-

proach can serve as a starting point for future works

on how assistive robots can be used to provide sup-

port to users. The main contributions of this work are

listed below:

• We propose a novel pipeline for efﬁciently lo-

cating and reaching humans in complex environ-

ments, with a focus on assistive tasks and human-

robot interaction.

• Our approach utilizes global and local goal poli-

cies to generate objectives and precisely reach the

human.

• We made modiﬁcations to the Gibson environ-

ments by integrating 3D human models into var-

ious locations, aligning their poses with different

https://www.blender.org/

activities in areas such as the kitchen, TV lounge,

and other relevant areas that were previously ab-

sent in the original dataset.

• We show that considered baselines achieve

promising performance in locating and reaching

human in complex 3D environments.

2 RELATED WORK

Our work is related to previous research on embodied

navigation and environment exploration. The embod-

ied visual navigation problem involves an agent using

visual sensing to navigate an environment avoiding

obstacles to reach a given destination (Anderson et al.,

2018a; Anderson et al., 2018b; Batra et al., 2020;

Savva et al., 2019). Over the last decade, the ﬁeld

has made substantial progress due to the availabil-

ity of large photorealistic 3D scene datasets (Chang

et al., 2017; Ramakrishnan et al., 2021; Xia et al.,

2018) and fast navigation simulators (Savva et al.,

2019; Xia et al., 2018; Kolve et al., 2017). Current

literature on embodied visual navigation can be di-

vided into classic navigation, approaches based on re-

inforcement learning, and exploration.

Classic Navigation. Traditional navigation ap-

proaches involve building a map of the environment,

localizing the agent in the map, and planning paths

to guide the agent to desired locations. Mapping, lo-

calization, and path-planning have been extensively

studied in this context (Hartley and Zisserman, 2003;

Thrun, 2002; LaValle, 2006). However, most of this

research relies on human-operated traversal of the en-

vironment and is classiﬁed as passive SLAM. Active

SLAM, which focuses on automatically navigating a

new environment to build spatial representations, has

received less attention. We refer the reader to (Ca-

dena et al., 2016) for a comprehensive review of ac-

tive SLAM literature.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

246

Semantic Mapper

Sensor Pose

Reading (x

)

Global Goal Policy Local Policy

action(a

)

Observation (RGBD)

long term goal

map (m

)

Figure 3: The Semantic Mapper leverages RGB-D and sensor pose reading x

to construct a map of the environment (m

). The

Global Goal Policy utilizes this map to generate a long-term goal on the map. Finally, the Local Policy generates low-level

actions a

to guide the agent toward this long-term goal.

Reinforcement Learning. Different previous works

have formulated navigation as a reinforcement learn-

ing problem (Zhu et al., 2017; Gupta et al., 2017;

Mirowski et al., 2016; Savinov et al., 2018) in which

the robot is an agent interacting with a simulated en-

vironment in order to learn how to navigate it. By

training in several environments, the agents eventu-

ally learn how to extract semantic cues from the in-

put images and generalize them to unseen spaces.

Past works have investigated methods including feed-

forward networks (Zhu et al., 2017), vanilla neu-

ral network memory (Mirowski et al., 2016), spatial

memory and planning modules (Gupta et al., 2017),

semi-parametric topological memory (Savinov et al.,

2018), and imitation learning from an optimal expert

(Gupta et al., 2017). In addition, learning-based ap-

proaches have been used to develop low-level colli-

sion avoidance policies (Dhiraj et al., 2017; Sadeghi

and Levine, 2016). However, these approaches do not

consider task context and only focus on moving to-

ward open space. Other works (Zhang et al., 2017)

use a differentiable map structure to mimic SLAM

techniques.

Exploration. Navigation algorithms are generally

shaped around two main objectives: point-goal nav-

igation and environment exploration. The ﬁrst class

of methods aims to navigate in order to reach a given

destination provided to the agent in the form of co-

ordinates relative to the current location. Exploration

approaches aim instead to navigate the unknown en-

vironment without an explicit target location in mind

but with the goal to “uncover” all the available space,

which can be useful for mapping the environment or

searching for speciﬁc objects. Environment explo-

ration as an Active Neural SLAM (ANS) to gather

information for downstream tasks has been a popu-

lar topic in the past, with many works investigating

it in the context of reinforcement learning (Schmid-

huber, 1991; Stadie et al., 2015; Pathak et al., 2017;

Fu et al., 2017; Lopes et al., 2012; Chentanez et al.,

2004). These works design intrinsic reward functions

that capture the novelty of states or state-action tran-

sitions, which are then used to optimize exploration

policies using reinforcement learning.

Other related works have proposed alternative ex-

ploration methods, such as generating smooth move-

ment paths for high-quality camera scans (Xu et al.,

2017), information-theoretic exploration method us-

ing Gaussian Process regression (Bai et al., 2016), or

assuming access to the ground-truth map at training

time to learn an optimized trajectory that maximizes

the accuracy of the SLAM-derived map (Kollar and

Roy, 2008). Recently, (Chen et al., 2019) used human

navigation trajectories to learn task-independent ex-

ploration through imitation learning. To improve ex-

ploration speciﬁcally in object goal navigation tasks,

SemExp (Chaplot et al., 2020b) made use of a mod-

ular policy for semantic mapping and path planning

that directly predicts which action the agent should

take next and estimates the map on the ﬂy.

Overall, there has been a growing interest in de-

veloping robots that can perform diverse tasks in a va-

riety of environments (Lim et al., 2021). Exploration

and navigation are critical components of such sys-

tems, and recent work has made signiﬁcant progress

in learning exploration policies and developing mod-

ular architectures for navigation. Our work is akin to

these methods, but we investigate the navigation prob-

lem in an assisting care robot scenario by proposing a

method that relies on Habitat simulator and compar-

ing different baselines. The four activities we selected

come from that use case.

Finding and Navigating to Humans in Complex Environments for Assistive Tasks

247

3 PROBLEM DEFINITION

We aim to assess the performance of a robot in lo-

cating and navigating to a human involved in various

activities. We perform our evaluation in an episode-

based fashion, following works on navigation with re-

inforcement learning (Chaplot et al., 2020b; Chaplot

et al., 2020a; Ramakrishnan et al., 2022). At the be-

ginning of an episode, the agent is initialized at a ran-

dom location in the environment and receives a visual

observation o (an RGB-D image) and sensor position

reading x

(i.e. x and y coordinates of the agent and its

orientation at time t). The agent then takes a naviga-

tion action a

following a learned policy to achieve the

goal of locating and navigating to the human. At each

time step, the robot can choose among the following

actions: move forward, turn right, turn left, and stop.

To successfully complete the task, the stop action

should be called when the agent is conﬁdent that the

human has been reached. An episode ends when the

agent calls the stop action or when it reaches the limit

of 500 steps. Note that, since the human may not be

visible from the initial location, the agent should ﬁrst

explore the environment, then navigate to the human

when they are detected from the visual observation.

This makes this task different from classic point goal

navigation (Anderson et al., 2018a) or environment

exploration works (Zhang et al., 2017; Chaplot et al.,

2020a), effectively requiring a mix of both objectives.

We consider two versions of this problem:

• V1: The ﬁrst version considers an episode suc-

cessful if the robot reaches the human at a safe

distance (1m) at the end of the episode.

• V2: The second version considers an episode suc-

cessful if the robot reaches the human at a safe

distance (1m) and the difference in orientation be-

tween the robot and the human θ is below a given

threshold.

Evaluations are performed by computing the Suc-

cess weighted by Path Length (SPL) and the Success

Rate (SR) for both versions.

4 PROPOSED METHOD

The proposed approach relies on three key compo-

nents: a Semantic Mapper, a Global Goal Policy, and

a Local Policy. Fig. 3 illustrates the proposed ap-

proach.

Semantic Mapper. The Semantic Mapper is respon-

sible for creating an allocentric semantic map m

the world by aggregating semantic information ob-

tained from individual RGB-D observations acquired

from time 0 to t. This is done using a state-of-the-art

semantic exploration method (Chaplot et al., 2020b),

which creates a point cloud from depth observations.

Each point in the point cloud is then classiﬁed as ei-

ther a person or a background class using the seman-

tic segmentation model. The point cloud is then pro-

jected into the top-down map space using differen-

tiable geometric operations (Henriques and Vedaldi,

2018), resulting in the 3×M×M semantic map m

with channels 1 and 2 representing obstacles and ex-

plored areas, and the last channel representing the

person class. In our setup, we considered M = 240,

while each element of m

corresponds to a 25 cm

(5cm×5cm) cell in the physical world and indicates

whether the location contains an obstacle, has been

explored, or contains a person. The spatial map is ini-

tialized to all zeros at the beginning of an episode and

reﬁned during the navigation process.

Global Goal Policy. The Global Goal Policy network

consists of 5 convolutional layers followed by 3 fully

connected layers. It is responsible for determining the

long-term goal in order to reach the human by using

the current map m

. If the human is not detected, the

global goal policy aims to explore the environment

and hence predicts a long-term goal using the map

and the agent’s current and previous positions. To re-

duce the exploration complexity, the long-term goal is

predicted once every 25 steps as described in (Chap-

lot et al., 2020a). If the human is detected, the global

goal policy selects a point close to the human as a

long-term goal. It is worth mentioning that both ver-

sions of the task, i.e. V1 and V2, employ the same

global goal policy, with the only distinction being the

evaluation procedure when the robot reaches the hu-

man.

Local Policy. The Local Policy is used to navigate

continuously to the long-term goal deﬁned by the

Global Goal Policy by calculating the shortest path

from the current position to the target one using the

Fast Marching Method (Sethian, 1999). The obstacle

channel from the semantic map is used to determine

the optimal path while avoiding obstacles. The local

policy then uses deterministic actions to navigate the

agent along this shortest path. At each time step, the

map is updated and the path to the long-term goal is

re-computed.

5 EXPERIMENTS AND RESULTS

Gibson (Xia et al., 2018) and Matterport3D (MP3D)

datasets (Chang et al., 2017) were employed in the

Habitat simulator (Savva et al., 2019) for training pur-

poses. These datasets contain 3D reconstructions of

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

248

(b)

(a)

Figure 4: (a) Final observation of robot while successfully locating and reaching a human engaged in various activities, on

call, cooking, eating, and watching TV respectively. (b) Some examples where the robot was unable to locate the human

within 500 steps.

real-world environments. The training set includes

a total of 86 scenes, consisting of 25 scenes from

the Gibson tiny set and 61 scenes from the MP3D

dataset. Since the human models are not included in

these datasets, we employed Blender to create multi-

ple human models. These models were posed to align

with the execution of various activities, such as eating,

cooking, watching TV, and being on a call (see Fig

1), etc. We then edited the Gibson environments us-

ing Blender to integrate these human models at multi-

ple locations, including the kitchen, TV lounge, bed-

room, and other relevant areas. Note that, due to

the computation-intensive nature of the manual inte-

gration process, our current implementation includes

four human models.

The observation space consists of RGBD images

with a size of 4×640×480, while the action space

includes four possible actions: move forward (0.25

cm), turn right (10 degrees), turn left (10 degrees),

and stop. The success threshold is set to 1m. For

person detection and segmentation, we use a Mask-

RCNN semantic segmentation model (He et al., 2017)

with a ResNet50 (He et al., 2016) backbone, pre-

trained on MS-COCO (Lin et al., 2014). We use Suc-

cess weighted by Path Length (SPL) and Ratio of suc-

cessful episodes (SR) to measure the efﬁciency of lo-

cating and reaching the human. We evaluate the pro-

posed approach on the 20 modiﬁed environments of

the Gibson dataset that were not seen during the train-

ing of the different components of our approach. This

allowed us to examine how well the learned policies

generalize to previously unseen environments.

We run 2000 evaluation episodes, with each scene

containing 100 episodes. We consider the two vari-

ants of the task: V1 aims to reach the human from

any angle, whereas in V2 episode success depends on

the orientation difference between the robot and the

human at the end of the episode. Table 1 provides

quantitative results for V1, where the SPL and SR

values for each activity in different environments are

presented separately. The performance of the robot

is observed to vary across environments, with larger

Finding and Navigating to Humans in Complex Environments for Assistive Tasks

249

Table 1: SPL and SR for the V1 task on the Gibson validation dataset for each activity.

Gibson Environments

Collierville Corozal Darden Markleeville Wiconisco

Activity SPL SR SPL SR SPL SR SPL SR SPL SR

1. Eating 0.80 0.99 0.62 0.99 0.68 0.96 0.63 0.94 0.57 0.96

2. Cooking 0.59 0.95 0.41 0.89 0.54 0.88 0.69 0.97 0.41 0.85

3. Watching TV 0.70 0.99 0.31 0.77 0.40 0.94 0.55 0.90 0.69 0.99

4. On a call 0.74 0.99 0.67 0.96 0.56 0.98 0.53 0.93 0.40 0.84

Table 2: Average SPL and SR on the Gibson validation dataset.

Task SPL (↑) SR (↑)

V1 (any angle) 0.57 0.93

V2 (θ ≤ 60

◦

) 0.25 0.44

V2 (θ ≤ 30

◦

) 0.14 0.26

environments posing greater challenges for the robot.

Notably, the robot has a limit of 500 steps to locate

the human, and as a result, there are instances where

the robot fails to locate the human within the speciﬁed

time frame. Fig 4b provides visual examples of such

instances of failure. Our proposed approach achieves

a 93% SR and a 57% SPL under V 1. However, in

V 2(θ = 30

◦

), our approach only achieved a 26% SR

and a 14% SPL. This suggests that V2 of the task

is much more challenging and more research is still

needed.

To illustrate the effect of evaluating models with

different orientation thresholds, we plot the SPL and

SR for multiple variants of V 2 approach with varying

orientation thresholds (0

◦

− 180

◦

) in Fig 5. The plot

shows that the SPL and SR increase as we raise the

orientation tolerance threshold. Interestingly, even

when setting a threshold of 90

◦

, results are not sat-

isfactory, with a SPL of about 0.35 and a SR of about

0.6. This suggests that the considered task is chal-

lenging and there is a lot of space for improvement.

Fig. 4 ﬁnally shows some success (a) and failure (b)

qualitative navigation episodes along with the ﬁnal vi-

sual observation of the agent. As can be seen, the ap-

proach can reach the human from the right angle in

some of the cases. Table 2 presents the overall results

of the proposed approach.

6 CONCLUSION

In this paper, we proposed a navigation problem for-

mulation as a ﬁrst essential step towards building au-

tonomous task-oriented assistive robots for home use

cases. In the considered setup, an agent has to ﬁnd

humans and reach them at a safe distance, in order

to provide assistance. The experiments performed

Figure 5: SPL (left) and Success Rate (right) of the naviga-

tion tasks, considering different thresholds on the angle be-

tween the robot and the human. The episode is considered

successful if the robot reaches the human at a safe distance

and with a robot-human orientation difference lower than

the given threshold.

on the Gibson dataset comprising 3D human models

show that this is a promising direction for the devel-

opment of a ﬂexible framework for assistive robots.

In future research, we plan to extend the proposed

framework with more intelligent task-oriented robot

behaviors sensitive to the situational and social con-

ventions of natural home environments.

ACKNOWLEDGEMENT

This research has been supported by the PhilHumans

project

supported by Marie Skłodowska-Curie Inno-

vative Training Networks - European Industrial Doc-

torates under grant agreement No. 812882.

REFERENCES

Anderson, P., Chang, A., Chaplot, D. S., Dosovitskiy, A.,

Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mot-

taghi, R., Savva, M., et al. (2018a). On evalua-

tion of embodied navigation agents. arXiv preprint

arXiv:1807.06757.

https://www.philhumans.eu/

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

250

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M.,

underhauf, N., Reid, I., Gould, S., and Van Den Hen-

gel, A. (2018b). Vision-and-language navigation: In-

terpreting visually-grounded navigation instructions

in real environments. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 3674–3683.

Bai, S., Wang, J., Chen, F., and Englot, B. (2016).

Information-theoretic exploration with bayesian op-

timization. In 2016 IEEE/RSJ International Confer-

ence on Intelligent Robots and Systems (IROS), pages

1816–1822. IEEE.

Batra, D., Gokaslan, A., Kembhavi, A., Maksymets, O.,

Mottaghi, R., Savva, M., Toshev, A., and Wijmans,

E. (2020). Objectnav revisited: On evaluation of em-

bodied agents navigating to objects. arXiv preprint

arXiv:2006.13171.

Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza,

D., Neira, J., Reid, I., and Leonard, J. J. (2016). Past,

present, and future of simultaneous localization and

mapping: Toward the robust-perception age. IEEE

Transactions on robotics, 32(6):1309–1332.

Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner,

M., Savva, M., Song, S., Zeng, A., and Zhang, Y.

(2017). Matterport3d: Learning from rgb-d data in in-

door environments. arXiv preprint arXiv:1709.06158.

Chaplot, D. S., Gandhi, D., Gupta, S., Gupta, A., and

Salakhutdinov, R. (2020a). Learning to explore using

active neural slam. arXiv preprint arXiv:2004.05155.

Chaplot, D. S., Gandhi, D. P., Gupta, A., and Salakhut-

dinov, R. R. (2020b). Object goal navigation using

goal-oriented semantic exploration. Advances in Neu-

ral Information Processing Systems, 33:4247–4258.

Chen, T., Gupta, S., and Gupta, A. (2019). Learning

exploration policies for navigation. arXiv preprint

arXiv:1903.01959.

Chentanez, N., Barto, A., and Singh, S. (2004). Intrinsically

motivated reinforcement learning. Advances in neural

information processing systems, 17.

Dhiraj, G., Pinto, L., and Gupta, A. (2017). Learning to ﬂy

by crashing. In IROS.

Fu, J., Co-Reyes, J., and Levine, S. (2017). Ex2: Explo-

ration with exemplar models for deep reinforcement

learning. Advances in neural information processing

systems, 30.

Gupta, S., Davidson, J., Levine, S., Sukthankar, R., and Ma-

lik, J. (2017). Cognitive mapping and planning for

visual navigation. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2616–2625.

Hartley, R. and Zisserman, A. (2003). Multiple view geom-

etry in computer vision. Cambridge university press.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Henriques, J. F. and Vedaldi, A. (2018). Mapnet: An al-

locentric spatial memory for mapping environments.

In proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 8476–8484.

Kollar, T. and Roy, N. (2008). Trajectory optimization using

reinforcement learning for map exploration. The In-

ternational Journal of Robotics Research, 27(2):175–

196.

Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L.,

Herrasti, A., Deitke, M., Ehsani, K., Gordon, D., Zhu,

Y., et al. (2017). Ai2-thor: An interactive 3d environ-

ment for visual ai. arXiv preprint arXiv:1712.05474.

LaValle, S. M. (2006). Planning algorithms. Cambridge

university press.

Lim, V., Rooksby, M., and Cross, E. S. (2021). Social robots

on a global stage: establishing a role for culture dur-

ing human–robot interaction. International Journal of

Social Robotics, 13(6):1307–1333.

Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft COCO: common objects in context. In

Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars,

T., editors, Computer Vision - ECCV 2014 - 13th

European Conference, Zurich, Switzerland, Septem-

ber 6-12, 2014, Proceedings, Part V, volume 8693 of

Lecture Notes in Computer Science, pages 740–755.

Springer.

Lopes, M., Lang, T., Toussaint, M., and Oudeyer, P.-Y.

(2012). Exploration in model-based reinforcement

learning by empirically estimating learning progress.

Advances in neural information processing systems,

25.

Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard,

A. J., Banino, A., Denil, M., Goroshin, R., Sifre,

L., Kavukcuoglu, K., et al. (2016). Learning to

navigate in complex environments. arXiv preprint

arXiv:1611.03673.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017).

Curiosity-driven exploration by self-supervised pre-

diction. In International conference on machine learn-

ing, pages 2778–2787. PMLR.

Ramakrishnan, S. K., Chaplot, D. S., Al-Halah, Z., Malik,

J., and Grauman, K. (2022). Poni: Potential func-

tions for objectgoal navigation with interaction-free

learning. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

18890–18900.

Ramakrishnan, S. K., Gokaslan, A., Wijmans, E.,

Maksymets, O., Clegg, A., Turner, J., Undersander,

E., Galuba, W., Westbury, A., Chang, A. X., et al.

(2021). Habitat-matterport 3d dataset (hm3d): 1000

large-scale 3d environments for embodied ai. arXiv

preprint arXiv:2109.08238.

Sadeghi, F. and Levine, S. (2016). Cad2rl: Real single-

image ﬂight without a single real image. arXiv

preprint arXiv:1611.04201.

Savinov, N., Dosovitskiy, A., and Koltun, V. (2018). Semi-

parametric topological memory for navigation. arXiv

preprint arXiv:1803.00653.

Finding and Navigating to Humans in Complex Environments for Assistive Tasks

251

Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans,

E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J.,

et al. (2019). Habitat: A platform for embodied ai re-

search. In Proceedings of the IEEE/CVF international

conference on computer vision, pages 9339–9347.

Schmidhuber, J. (1991). A possibility for implementing

curiosity and boredom in model-building neural con-

trollers. In Proc. of the international conference on

simulation of adaptive behavior: From animals to an-

imats, pages 222–227.

Sethian, J. A. (1999). Fast marching methods. SIAM review,

41(2):199–235.

Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentiviz-

ing exploration in reinforcement learning with deep

predictive models. arXiv preprint arXiv:1507.00814.

Thrun, S. (2002). Probabilistic robotics. Communications

of the ACM, 45(3):52–57.

Xia, F., Zamir, A. R., He, Z., Sax, A., Malik, J., and

Savarese, S. (2018). Gibson env: Real-world percep-

tion for embodied agents. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 9068–9079.

Xu, K., Zheng, L., Yan, Z., Yan, G., Zhang, E., Niess-

ner, M., Deussen, O., Cohen-Or, D., and Huang,

H. (2017). Autonomous reconstruction of unknown

indoor scenes guided by time-varying tensor ﬁelds.

ACM Transactions on Graphics (TOG), 36(6):1–15.

Zhang, J., Tai, L., Liu, M., Boedecker, J., and Burgard, W.

(2017). Neural slam: Learning to explore with exter-

nal memory. arXiv preprint arXiv:1706.09520.

Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-

Fei, L., and Farhadi, A. (2017). Target-driven visual

navigation in indoor scenes using deep reinforcement

learning. In 2017 IEEE international conference on

robotics and automation (ICRA), pages 3357–3364.

IEEE.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

252