Hands-Free VR

Jorge Askur Vazquez Fernandez

1,2,∗

, Jae Joong Lee

1,∗

, Santiago Andr

es Serrano Vacca

Alejandra Magana

, Radim Pe

, Bedrich Benes

and Voicu Popescu

Department of Computer Science, Purdue University, West Lafayette, U.S.A.

School of Engineering and Science, Tecnol

ogico de Monterrey, Monterrey, Mexico

Department of Computer and Information Technology, Purdue University, West Lafayette, U.S.A.

Department of Informatics and Computers, Ostravska Universita, Ostrava, Czech Republic

Keywords:

Virtual Reality, Large Language Model, Retrieval-Augmented Generation, Speech-to-Text, Deep Learning.

Abstract:

We introduce Hands-Free VR, a voice-based natural-language interface for VR that allows interaction without

additional hardware just using voice. The user voice command is converted into text using a ﬁne-tuned speech-

to-text deep-learning model. Then, the text is mapped to an executable VR command using an LLM, which is

robust to natural language diversity. Hands-Free VR was evaluated in a within-subjects study (N = 22) where

participants arranged objects using either a conventional VR interface or Hands-Free VR. The results conﬁrm

that Hands-Free VR is: (1) signiﬁcantly more efﬁcient than conventional VR interfaces in task completion time

and user motion metrics; (2) highly rated for ease of use, intuitiveness, ergonomics, reliability, and desirability;

(3) robust to English accents (20 participants were non-native speakers) and phonetic similarity, accurately

transcribing 96.7% of voice commands, and (3) robust to natural language diversity, mapping 97.83% of

transcriptions to executable commands.

1 INTRODUCTION

Virtual reality (VR) provides users powerful immer-

sive visualizations of complex virtual environments

(VEs). One of VR’s great strengths is its natural in-

terface for specifying the desired view based on track-

ing the user’s head position and orientation. However,

conventional VR interfaces are not equally effective

when allowing users to conﬁgure, search, or mod-

ify the VE. Canonical tasks such as object creation,

search, and selection, which are building blocks of

complex interactions, often require repeated and te-

dious attempts to invoke, activate, tune, and undo op-

erations through interface constructs that can be un-

familiar and unintuitive to the user. These challenges

compound when applying the same command to mul-

tiple objects, such as isolating objects of a certain type

or arranging objects in a speciﬁc conﬁguration.

Conventional VR interfaces often adapt non-

immersive controls to virtual environments, resulting

in inefﬁciencies. For example, the familiar mouse

becomes a 3D laser pointer that must be aimed pre-

cisely in mid-air without the haptic feedback provided

by the physical stability of a desk, making ﬁne ob-

ject selection difﬁcult. Additionally, while modern

These authors contributed equally to this work.

VR headsets feature built-in tracking that removes lo-

cation constraints, navigating large virtual spaces in

smaller physical areas remains challenging. It can re-

quire complex, disorienting solutions like redirection.

Although VR can replicate non-immersive

controls-such as offering a 2D display + mouse

interface or a deeply nested ﬂoating menu—these

options break immersion and hinder the workﬂow.

VR-speciﬁc interfaces (Section 2) improve efﬁciency

but lose the familiarity of traditional controls.

Recent large language models (LLMs) handle

complex language so effectively that their users can

say what they want, and the LLM translates their

words into executable code. This voice-to-code ap-

proach reduces inefﬁciency and frustration in VR con-

trols, and as LLM-based code generation improves,

these beneﬁts become increasingly accessible.

This paper introduces Hands-Free VR, a voice-

based VR interface. The user issues a natural-

language voice command, a speech-to-text model for

diverse English accents, and converts to text. An

LLM then maps it to a unique executable VR com-

mand. Both models run on a workstation connected

wirelessly to the headset. Compared to conventional

VR controls that force users to walk around to select

and arrange objects repeatedly, Hands-Free VR en-

ables isolating and arranging them in fewer steps via

Vazquez Fernandez, J. A., Lee, J. J., Serrano Vacca, S. A., Magana, A., Peša, R., Benes, B. and Popescu, V.

Hands-Free VR.

DOI: 10.5220/0013115100003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 1: GRAPP, HUCAPP

and IVAPP, pages 533-542

ISBN: 978-989-758-728-3; ISSN: 2184-4321

533

00:53

00:08

00:13

Our

voice based

Conventional UI

Figure 1: Conventional interface versus our Hands-Free VR. The task is to ﬁnd all cylinders in a pile of objects and place them

in a circle. With the conventional interface, the user has to walk to grab each cylinder and place it, needing 53s to complete

the task. With our voice-based interface, the user ﬁrst selects all the cylinders and then places them in a circle, using natural

language spoken commands, completing the task in 13s.

voice commands (see Figure 1 and the video).

We evaluated Hands-Free VR in a controlled

within-subjects study with 22 participants, approved

by our Institutional Review Board (IRB). Participants

completed object-ﬁnding and arrangement tasks us-

ing either a conventional VR interface (grab, carry,

drop) or Hands-Free VR. With 20 participants be-

ing non-native English speakers, Hands-Free VR still

achieved high accuracy: it correctly transcribed voice

commands 96.71% of the time and converted them

into executable VR commands 97.83% of the time,

demonstrating robustness to accents, phonetic simi-

larities, and natural language diversity.

We do not advocate using voice commands exclu-

sively in VR. In applications where physical action is

essential, voice should not bypass core interactions.

However, we do believe voice can free users from te-

dious, repetitive, and non-essential tasks that hinder

the application’s main purpose. We claim the follow-

ing contributions:

1. Hands-Free VR is a voice-based VR interface that

is efﬁcient and robust to diverse accents, phonetic

similarities, and language variations.

2. A robust speech-to-text deep learning to accents.

3. A Large Language Model with Retrieval Aug-

mented Generation for custom VR commands.

2 RELATED WORK

We review prior work on conventional VR interfaces

that do not rely on the user’s voice and voice-based

VR user interfaces.

Conventional VR User Interfaces. Interactions in

VR have unique challenges for many user interface

tasks (Mine, 1995), and we limit the discussion to se-

lection and text entry: tasks relevant to our study.

Selection is a challenging VR interaction task that

has been extensively studied (Argelaguet and Andu-

jar, 2013). Users select targets using a virtual ray (An-

dujar and Argelaguet, 2007; Steinicke et al., 2006) or

directly with their hands (Ware, 1990; Han and Wan,

2010). A key challenge is the 3D visibility discrep-

ancy between the user’s eyes and hand (Argelaguet

et al., 2008), where some visible targets may be un-

reachable from a natural hand position. In cluttered

scenes, selection volumes instead of rays (Forsberg

et al., 1996; Pierce et al., 1997) help by reducing the

precision required to select small or occluded targets.

Our voice-based interface complements VR selection

methods, allowing users to select objects based on

known features, regardless of size or occlusion.

Text entry is a notoriously difﬁcult problem in VR,

as typing on air keyboards is slow, inaccurate, and tir-

ing. Innovative solutions have been proposed to ad-

dress this issue, such as the implementation of vir-

tual QWERTY keyboard layouts controlled using ﬁn-

HUCAPP 2025 - 9th International Conference on Human Computer Interaction Theory and Applications

534

ger and thumb gestures (Fashimpaur et al., 2020), or

such as the attachment of customized keycaps to the

VR headset (Hutama et al., 2021). Our deep learning

speech-to-text solution offers intuitive operation and

robustness to various accents and has the potential to

support text entry by dictation.

Voice-Based VR User Interfaces. Voice assistants

enhance VR applications like retail (Morotti et al.,

2020), offering greater efﬁciency and user preference

over traditional GUIs (Buchta et al., 2022c; Buchta

et al., 2022b). Our work uses LLM advancements to

increase user freedom and interface robustness.

Navigation in VR beneﬁts from voice interfaces,

enabling users to teleport to distant virtual locations

without physical movement (Hombeck et al., 2023;

Calandra et al., 2022). Natural Language Understand-

ing (Zhao et al., 2020) simpliﬁes interaction by re-

moving the need for complex commands. Our work

builds on this, ﬁne-tuning a state-of-the-art speech-

to-text model (Radford et al., 2023) with 13 English

accents and demonstrating our voice-based interface

for selection, posing, and navigation tasks.

Conversation highlights the power of voice-based

interfaces, as reviewed in (G

obl. et al., 2021).

Recent LLMs like GPT-3 (Brown et al., 2020),

PaLM (Chowdhery et al., 2022), LLaMA (Touvron

et al., 2023), and ChatGPT (OpenAI, 2022) enable

robust language understanding. Our work utilizes

LLMs to map user commands to VR actions, focus-

ing on robustness over conversation, and generates di-

verse language variants of VR commands ofﬂine.

Object manipulation in VR via voice has been

introduced in CAD (Chu et al., 1997) and distant ob-

ject interaction (Whitlock et al., 2018). Our approach

removes the need to memorize commands, allowing

users to describe intentions in their words using an

LLM, though it does not enable instant interpretation.

The interface is crucial to the user’s VR expe-

rience. Voice interfaces free the user’s hands for

tasks (Monteiro et al., 2021) and are often preferred

over GUIs, which feel tedious (Buchta et al., 2022a).

Hands-Free VR enables intuitive interaction without

memorizing or practicing predeﬁned commands.

3 OVERVIEW

The pipeline of Hands-Free VR is given in Figure 2.

The user speaks a command, the VR headset captures

their voice, and an edge server receives the audio data

wirelessly and converts them to text with a robust

deep learning model in diverse English accents and

phonetic similarity to words. Next, a large language

model (LLM) maps the transcribed text to executable

VR commands using Retrieval-Augmented Genera-

tion (RAG) (Lewis et al., 2021) as shown in Section 4.

The command is sent back to the VR headset and then

is executed in the application.

Audio

Voice

VR Headset

Mic

Execution

State update

Feedback

Rendering

Displays

Edge Server

Text to command

LLM synonym

enumeration & parsing

audio

VE vis.

Visual

command

Speech to text

Accent robustness

Context tuning

feedback

Text

Networking Interface

audio

Networking Interface

Figure 2: Hands-Free VR pipeline.

4 ROBUST SPEECH TO

COMMAND CONVERSION

Understanding user intentions is important for a ro-

bust voice-based VR interface. This requires accurate

handling of accents, phonetic similarities, and map-

ping free-form text into executable commands. Here,

we detail our approach to ensuring speech-to-text and

text-to-command robustness.

4.1 Data Synthesis for Robustness

Enhancement

In the ﬁrst ofﬂine step, our approach synthesizes data

to support the robustness of speech-to-text and text-

to-command. The data synthesis pipeline (Figure 3)

proceeds in reverse order compared to the run-time

order: data synthesis starts from the syntax and lexi-

con (1) and generates a set of natural language and a

set of accent-diversiﬁed audio ﬁles (7).

VR command

lexicon and

syntax

Diversified

verbal

commands

(audio)

Executable

commands

(text)

Verbal

commands

(text)

Exhaustive

instantiation

Text to accented

speech

Extensive NL

instantiation

Figure 3: Text (yellow) and audio (blue) synthesis data to

support robust speech-to-text and text-to-command conver-

sion. The data is used as shown in Fig. 4.

The VR application’s syntax and lexicon are

used to generate all possible executable commands

by assigning values to syntactic elements like

verbs, objects, and attributes (e.g., select(Cube), se-

lect(Pyramid, yellow), arrange(row)). Each com-

mand is input into ChatGPT (OpenAI, 2022), which

generates tens of natural language variants. These

verbal commands are converted to speech using Ama-

zon Polly (Amazon, 2016), producing tens of thou-

sands of audio ﬁles with diverse expressions and 13

Hands-Free VR

535

Accent Robustness

Fine-tuning

Speech-to-text

Deep Learning Model

Text -to-command

LLM

Instructor

Embedding

Index

command

speech

RAG

Offline

Online

Figure 4: Speech-to-text and text-to-command robustness

through ﬁne-tuning and through Retrieval Augmented Gen-

eration (RAG) using the command text and audio data gen-

erated as shown in Fig. 3.

English accents.

The selection, relocation, and arrangement tasks

of our user study (Section 6) are covered by a lexi-

con with 5 verbs, 5 objects, and 12 attributes (9 col-

ors and 3 types of alignment), and by a syntax with

sentences with just a verb, with a verb and an ob-

ject, and with a verb, an object, and an attribute.

Syntax and lexicon instantiation produced 66 exe-

cutable commands. The 66 executable commands

seeded 2,253 natural language variants, for an av-

erage of 34.14±10.52 variants per executable com-

mand. The minimum, median, and maximum vari-

ants per executable command are 4, 40, and 66, re-

spectively. The 2,253 variants were converted from

text to accented speech, resulting in 29,237 audio

ﬁles. We refer the reader to the supplemental material

FineTuningData.zip, which maps executable com-

mands to natural language variants and audio ﬁles.

4.2 Robustness Enhancement

The synthesized data is used to improve (a) the

speech-to-text and (b) the text-to-command conver-

sion as shown in Figure 4.

(a) The English command in text form (in-

put 5 in Figure 3) and the diversiﬁed English com-

mands in audio form (7 in Figure 3) are used

to ﬁne-tune the Whisper (Radford et al., 2023)

speech-to-text deep learning model using 5,000 it-

erations on our audio ﬁles, instead of training

from scratch to utilize pre-trained dataset knowl-

edge. (b) The (English command, executable com-

mand) pairs are used for Retrieval Augmented Gen-

eration (RAG) (Lewis et al., 2021) with a pre-trained

LLM (Chung et al., 2022) and embedding (Su et al.,

2023). The embedding space is trained on instruc-

tion tasks, so we put our data in an instruction-

like text format to use the embedding space (Su

et al., 2023) as “ special command of {English

command} is {executable command}”.

4.3 Speech-to-Command Conversion

At run-time, the user gives a verbal command such

as “select all red boxes”. The command is con-

verted to a text command by the speech-to-text deep

learning model, with robustness to the user’s spo-

ken English accent and to words with similar sounds,

e.g., preferring “boxes” to “foxes”. The text com-

mand is interpreted by the LLM, which locates within

its latent space the executable VR command that

best aligns with the intended action. The LLM is

queried with the text-based prompt “what is the

special command of {text command}?”. In this

example, the LLM replies with the executable com-

mand, i.e., “select(cube, red)”. Finally, the VR

command is executed.

5 IMPLEMENTATION

The edge server (Figure 2) has an Intel i9-12900k

4.8GHz CPU, 128GB RAM, and NVIDIA RTX3090

GPU using Python 3.9 and PyTorch 2.0.1. The client

is on a Meta Quest 3 (Meta, 2013) VR headset. The

VR application is in Unity 3D (Li et al., 2018), ver-

sion 2021.3.8f1. The server and client were connected

with a 6E Wi-Fi.

The speech-to-text model is ﬁne-tuned based on

Whisper (Radford et al., 2023) with 29,237 audio

ﬁles covering 13 different English accents. We use a

mixed-precision ﬂoating point, AdamW (Loshchilov

and Hutter, 2017), using a batch size 32 with a 10

−5

learning rate. The well-known speech-to-text model

metric, the word error rate, is used as our evaluation.

The Hands-Free VR text-to-command model

combines LLM (Chung et al., 2022) with In-

structor (Su et al., 2023) embedding space, di-

recting multiple words (English language com-

mand) to one word (executable command), by uti-

lizing RAG (Lewis et al., 2021) for its cost-

effectiveness and customization. We formulate data

for prompt augmentation by combining possible com-

mands from users and their corresponding executable

commands as “ special command of {text} is

{command}” with a vector database (chromadb,

2022).

special command is a unique identiﬁer

for our speciﬁc task to avoid any overlap with data

from the original training dataset. Once the user’s

voice command was converted to text, we queried the

user’s question to the embedding as “What is the

special command of {text}?”, and we retrieved

from the LLM the most relevant answer, i.e., the

executable command, using LangChain (LangChain,

2022). We apply 4-bit quantization (Tim Dettmers,

HUCAPP 2025 - 9th International Conference on Human Computer Interaction Theory and Applications

536

2022) to the LLM to ﬁt the 24GB VRAM GPU.

6 USER STUDY

We conducted an IRB-approved user study comparing

Hands-Free VR to a conventional VR interface.

6.1 Methods

Participants. We have recruited N = 22 participants

from our university. The study included 13 partici-

pants aged 18-25, 8 aged 26-30, and 1 over 30. Four

identiﬁed as women, 17 as men, and 1 chose an al-

ternative option. No participants identiﬁed as “Be-

ginner” in English; 4 were “Intermediate,” 16 “Ad-

vanced,” and 2 “Native Speakers.” One listed English

as their native language, 5 Mandarin, 4 Hindi, and 12

chose “Other”.

Study Design. We opted for a within-subjects study

design, with each participant performing the tasks in

each condition. The design brings the advantage of

greater statistical power for fewer participants. The

learning effects are minor, as the two interfaces are

substantially different from each other. The N =

22 participants are sufﬁcient to detect effects of a

large/very large size (i.e., Cohen’s d = 1.0) with 0.90

power at a signiﬁcance level α = 0.05.

Tasks. Participants performed two tasks: In Task

1 (T1), they moved a subset of 96 virtual objects

(spheres, hemispheres, pyramids, cubes, and cylin-

ders) from a pile to a nearby box (1 m away). Objects

were 10 cm tall, randomly colored, and the box mea-

sured 100 cm × 100 cm × 100 cm. In Task 2 (T2),

they arranged objects from the pile into a row, matrix,

or circular pattern, guided by red crosses on the ﬂoor,

2 m from the pile.

Conditions. In the control condition (CC), partici-

pants used a conventional VR interface to manipulate

objects with handheld controllers (Figure 5). They

grabbed objects by pressing and holding the trigger

button, releasing it to drop the object. Gravity and

collision allowed objects to fall above the T1 box, and

both hands could carry two objects at once (Figure 5).

In the experimental condition (EC), participants used

Hands-Free VR interface (Figure 6), pressing the trig-

ger button to issue voice commands and releasing it to

conﬁrm. In both conditions, participants selected ob-

jects before placing them in the box (T1) or on the

ﬂoor (T2).

Data Collection. We collected data to assess Hands-

Free VR’s robustness and compare it to the conven-

tional VR interface using objective and subjective

metrics. Objective metrics included task completion

time (seconds), cumulative viewpoint translation (me-

ters), view direction rotation (degrees), and controller

translations. Subjective data came from a user prefer-

ence questionnaire with ﬁve questions:

Q1 The interface is tedious. It requires a lot of work.

Q2 The interface is intuitive. I quickly ﬁgured out

how to use it.

Q3 The interface requires a lot of physical effort.

Q4 The interface is unreliable, it often does the wrong

thing.

Q5 I would love to have a similar interface on my

computer.

Responses were on a ﬁve-point Likert scale,

scored from 1 to 5. For negative questions (Q1, Q3,

Q4), scores were reversed (x replaced with 6 − x) so

higher scores always indicated better outcomes. Re-

search Hypotheses. We hypothesized that Hands-

Free VR is robust, more efﬁcient than a conventional

VR interface, and preferred by users.

RH1: Hands-Free VR is robust, with an overall spo-

ken command success rate s

STC

of over 90%, i.e.,

STC

> 0.90.

RH2: Participants complete both tasks faster, with

less viewpoint translation, with less view direc-

tion rotation, and with less hand translation when

using the Hands-Free VR interface than when us-

ing the conventional VR interface.

RH3: Participants ﬁnd that the Hands-Free VR in-

terface is less tedious and requires less physical

effort than the conventional VR interface and that

Hands-Free VR is intuitive and reliable.

Procedure. Participants completed a demographics

questionnaire, practiced each task in both conditions,

Figure 5: Control condition, task 2: The left column shows

the participant’s VR view as they collect two yellow cylin-

ders (top) and place them in a circular pattern marked by

red crosses (bottom).

Hands-Free VR

537

Table 1: Descriptive and inference statistics for the ﬁve objective metrics, considering the two tasks separately (T1 and T2),

and together (T12), and for the conventional (CC) and Hands-Free VR (EC) conditions. In all instances, EC has a signiﬁcant

efﬁciency advantage over CC (p < 0.05).

Metric

Time

[s]

Viewpoint Translation

[m]

View Rotation

[deg]

Left Hand Translation

[m]

Right Hand Translation

[m]

Task T1 T2 T12 T1 T2 T12 T1 T2 T12 T1 T2 T12 T1 T2 T12

CC 55.44 64.47 59.95 18.56 33.67 26.12 1,385 1,740 1,562 12.75 14.89 13.82 12.67 13.60 13.14

Mean

EC 23.76 22.16 22.96 0.91 1.04 0.97 224 294 259 1.16 1.44 1.30 0.99 1.07 1.03

CC 27.56 14.63 16.93 6.07 9.57 6.57 584 399 401 4.86 3.83 3.67 4.81 3.49 3.20

Std.dev.

EC 14.69 14.90 11.79 0.78 0.82 0.64 153 227 156 0.87 1.29 0.84 0.84 0.79 0.69

Z -3.95 -4.01 -4.01 -4.01 -4.01 -4.01 -4.01 -4.01 -4.01 -4.01 -4.01 -4.01 -4.01 -4.01 -4.01

Wilcoxon

p 0.00008 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001 <0.00001

Cohen’s d 1.43 2.87 2.54 4.08 4.80 5.39 2.72 4.46 4.28 3.32 4.71 4.70 3.38 4.95 5.23

Figure 6: Experimental condition, task 2. The user says

“Grab all the cylinders”, which are lifted from the pile of

objects (top), and then the user says “Put them in the circle”

for arranging the cylinders on the ﬂoor (bottom).

Figure 7: Viewpoint translation comparison.

and then performed three trials per task in both con-

ditions, in a counterbalanced order. They completed

a preference questionnaire via VR headset after each

condition. The 30-minute experiment concluded with

participants receiving a USD 20 gift card.

Data Analysis. We analyzed the data using de-

scriptive (tables, box plots) and inferential statis-

tics with appropriate tests with SPSS (IBM Corp.,

2022). Normality was checked via the Shapiro-Wilk

test (Shapiro and Wilk, 1965). Depending on normal-

ity, we used either the dependent t-test or the non-

Figure 8: Task completion time between conditions.

parametric Wilcoxon signed-rank test (Wilcoxon,

1992), suitable for our paired sample design. The

Wilcoxon test handled continuous (objective metrics)

and ordinal data (Likert scale responses). We set sig-

niﬁcance at α = 0.05 and calculated effect sizes using

Cohen’s d (Cohen, 2013) to assess statistical power.

6.2 Results and Discussion

Hands-Free VR Robustness: We measured the ro-

bustness of Hands-Free VR over all 359 verbal com-

mands issued by our study participants. The speech-

to-text conversion success rate s

ST T

= 96.71 ±0.05%,

the text-to-command success rate s

T TC

= 97.83 ±

0.07%, for an overall verbal command success rate

of s

STC

= s

ST T

× s

T TC

= 94.61%. This supports RH1,

i.e., Hands-Free VR is robust, including with the spo-

ken English accents of our participants, 20 of whom

were not native English speakers.

Hands-Free VR vs. Conventional Interface. The

Figure 9: View direction rotation between conditions.

HUCAPP 2025 - 9th International Conference on Human Computer Interaction Theory and Applications

538

Figure 10: Right-hand translation between conditions.

interface efﬁciency measurements according to the

ﬁve objective metrics are presented numerically in Ta-

ble 1, and graphically, using box plots, in Figures 7

to 11.

In the box plots, the black line represents the

median, the cross the mean, the bar spans the in-

terquartile range (q

to q

), whiskers show the range

± 1.5 × (q

− q

)), and dots mark outliers outside

this range. Hands-Free VR enabled faster task com-

pletion with less walking, head rotation, and hand

movement. The advantage was signiﬁcant for T1 and

even greater for T2, as the conventional VR inter-

face required precise object placement for T2, while

Hands-Free VR made both tasks equally simple with

verbal commands.

With Hands-Free VR, each task required two ver-

bal commands: one for selection and one for place-

ment. Commands were executed in 1.51 seconds on

average: 0.99 ± 0.009s for speech-to-text and 0.51±

0.017s for text-to-command conversion—much faster

than Quest’s built-in STT (2.29s per API call) Task

completion time (e.g., 22 s for T2) was mostly spent

reading task descriptions, and users had an improve-

ment of 15% over each iteration of the task. In con-

trast, the conventional VR interface showed a linear

dependency between completion time and the num-

ber of objects manipulated, despite using both hands

simultaneously. It also showed a better rate of im-

provement than the verbal interface, at 20%. This can

be explained by the user’s frustration with the con-

ventional interface, leading to rushing the task. Mean-

Figure 11: Left-hand translation between conditions.

Figure 12: User preference questionnaire.

while, Hands-Free VR’s ability to handle multiple ob-

jects in parallel gives it a growing advantage as the

object count increases.

The data does not have a normal distribution,

so we compared the means using Wilcoxon’s signed

rank test. The comparison conﬁrms the statistical sig-

niﬁcance of the efﬁciency advantage of Hands-Free

VR over the conventional VR user interface for each

task, for both tasks combined, and for each of the ﬁve

objective metrics. This provides support for RH2, i.e.,

Hands-Free VR has a signiﬁcant efﬁciency advantage

over the conventional VR user interface. The effect

sizes measured using Cohen’s d are all larger than

1.0, conﬁrming that the study’s N = 22 participants

are sufﬁcient for a power of 90.

The user preference questionnaire results are pre-

sented in Table 2 (numerical) and Figure 12 (box

plots). Negatively phrased questions were reversed

(6−x), so higher scores indicate better outcomes. The

largest and only signiﬁcant difference was for Q1,

where participants found the conventional VR inter-

face more ”tedious” and requiring ”more work” than

Hands-Free VR. For Q3, participants strongly dis-

agreed that Hands-Free VR required signiﬁcant phys-

ical effort (median 5), while the conventional inter-

face had a median of 3, though this difference was

not signiﬁcant. Despite this, participants often treated

the conventional interface as a physical exercise chal-

lenge. Questions 2, 4, and 5 had high, similar me-

dian scores for both conditions (5, 5, and 4, respec-

tively). Importantly, Hands-Free VR scored the max-

imum median for both intuitiveness (Q2) and reliabil-

ity (Q4). We conclude that RH3 is supported, except

Table 2: User preference questionnaire statistics.

Question Q1 Q2 Q3 Q4 Q5

CC 2 5 3 5 4

Median

EC 5 5 5 5 4

CC 2.84 4.60 3.4 4.44 3.48

Mean

EC 4.44 4.04 4.04 4.36 3.92

CC 1.68 0.87 1.47 0.87 1.39

Std. dev.

EC 0.82 1.46 1.62 1.11 1.26

Z -3.57 -1.64 -1.49 -0.35 -1.53

Wilcoxon

p 0.00036 0.10 0.134 ∼1 0.123

Hands-Free VR

539

the perceived reduction in physical effort with Hands-

Free VR was not signiﬁcant.

Discussion. Hands-Free VR reliably interprets spo-

ken commands, even with pauses or incomplete sen-

tences, using a trigger button and VR-optimized mod-

els. Tested with mostly non-native English speakers,

it excelled in accent robustness, outperforming Quest

3’s speech-to-text, which had a Word Error Rate of

43% in our tests, by supporting diverse accents from

day one. Most of our 22 participants were new to VR,

probably contributing to their preference for physical

interaction and viewing voice commands as efﬁcient

but less engaging.

In work settings, users may prefer voice-based in-

terfaces for complex tasks such as handling multi-

ple objects, precise conﬁgurations, or information re-

trieval, while traditional controls suit simpler tasks.

For applications focused on physical exercise or skill

building, such as engine assembly, core actions must

remain physical. Voice commands can be supple-

mented by identifying parts, showing assembly order,

or setting conditions, preserving embodied learning

while reducing menu navigation and cognitive load.

7 CONCLUSIONS, LIMITATIONS,

AND FUTURE WORK

We introduced Hands-Free VR, a voice-based VR

interface that converts speech into executable com-

mands. Fine-tuned for phonetic similarity, diverse ac-

cents, and natural language variation, Hands-Free VR

achieved a 94% command understanding, enabling

efﬁcient execution of complex tasks and offering sig-

niﬁcant advantages over traditional VR controls.

7.1 Limitations

One limitation is our reliance on a powerful external

workstation as an edge server. Although standalone

VR headsets have advanced considerably, running

speech-to-text and large language models directly on

such devices is still not feasible. Our tests showed

that speech-to-text took about 3 seconds and text-

to-command took over 21 seconds on a laptop-class

GPU (Nvidia RTX 3070), which is too slow for most

VR applications. Until new technology emerges, ro-

bust voice-based VR interfaces will likely depend on

a distributed server-client setup.

A limitation is the slow response time, even with

a powerful edge server, which can be problematic

for time-sensitive commands such as stopping a dy-

namic VE. Rewinding to when the user started speak-

ing could help. Future work could speed up infer-

ence using quantization or simpler LLMs, and com-

bining speech-to-text and text-to-command into one

model could streamline processing. Another limita-

tion is that complex commands must currently be split

into simpler steps, e.g., “Select blue cubes” and “Put

them into the box” instead of “Select blue cubes and

move them into the box”. Furthermore, the system

had a limited number of discrete parameters per com-

mand, which means that continuous commands such

as move x distance were not supported by our system

due to the LLM’s tendency to hallucinate compromis-

ing the robustness of the system.

7.2 Future Work

Hands-Free VR can be expanded simply by adding

new verbs, objects, and attributes to its syntax and lex-

icon, then re-running data synthesis and ﬁne-tuning.

The VR application would need the corresponding

execution capabilities. Future work aims to remove

the programmer from this process, letting users deﬁne

and extend the interface by demonstrating desired ac-

tions. The key advantage: Users providing only one

English formulation while Hands-Free VR generates

multiple variants should be preserved.

A more ambitious goal is to eliminate the ﬁxed set

of commands. Instead of mapping speech to prede-

ﬁned commands, Hands-Free VR could generate VR

interface source code on the ﬂy. This would support

an unbounded range of commands, although ensuring

robustness remains challenging.

Another direction is to determine which tasks are

best served by voice commands. Although our ex-

periments highlight clear advantages for certain ac-

tions, understanding which commands are general

and which are domain-speciﬁc would simplify de-

signing future VR interfaces.

Moreover, this type of interface could be applied

to systems in areas such as robotics and healthcare,

where it would allow users with mobility limitations

to interact with an agent or system.

An interesting direction would be the efﬁciency

and user preference of a hybrid system, where they

can experience the best of both worlds, the efﬁciency

of voice commands with the immersion and interac-

tivity of conventional VR interfaces.

Hands-Free VR improves VR efﬁciency by min-

imizing head movements, reducing rotations from

ﬁve full turns to less than one, potentially mitigat-

ing cyber-sickness—a beneﬁt for future study. It also

supports diverse accents but should be tested with

broader participant variability. Voice input adds a

valuable channel, simplifying tedious tasks and their

speciﬁcation.

HUCAPP 2025 - 9th International Conference on Human Computer Interaction Theory and Applications

540

ACKNOWLEDGEMENTS

This work was supported in part by the National Sci-

ence Foundation grants 2417510 and 2309564.

REFERENCES

Amazon (2016). Amazon AWS Polly.

Andujar, C. and Argelaguet, F. (2007). Anisomorphic

ray-casting manipulation for interacting with 2d guis.

Computers & Graphics, 31(1):15–25.

Argelaguet, F. and Andujar, C. (2013). A survey of 3d

object selection techniques for virtual environments.

Computers & Graphics, 37(3):121–136.

Argelaguet, F., Andujar, C., and Trueba, R. (2008). Over-

coming eye-hand visibility mismatch in 3d pointing

selection. In Proceedings of the 2008 ACM sympo-

sium on Virtual reality software and technology, pages

43–46.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., et al. (2020). Language models are few-

shot learners. Advances in neural information pro-

cessing systems, 33:1877–1901.

Buchta, K., W

ojcik, P., Nakonieczny, K., Janicka, J.,

Gałuszka, D., Sterna, R., and Igras-Cybulska, M.

(2022a). Microtransactions in vr. a qualitative com-

parison between voice user interface and graphical

user interface. In 2022 15th International Conference

on Human System Interaction (HSI), pages 1–5.

Buchta, K., W

ojcik, P., Nakonieczny, K., Janicka, J.,

Gałuszka, D., Sterna, R., and Igras-Cybulska, M.

(2022b). Modeling and optimizing the voice assistant

behavior in virtual reality. In 2022 IEEE International

Symposium on Mixed and Augmented Reality Adjunct

(ISMAR-Adjunct), pages 397–402.

Buchta, K., W

ojcik, P., Pelc, M., G

orowska, A., Mota, D.,

Boichenko, K., Nakonieczny, K., Wrona, K., Szym-

czyk, M., Czuchnowski, T., Janicka, J., Gałuszka, D.,

Sterna, R., and Igras-Cybulska, M. (2022c). Nux ive -

a research tool for comparing voice user interface and

graphical user interface in vr. In 2022 IEEE Confer-

ence on Virtual Reality and 3D User Interfaces Ab-

stracts and Workshops (VRW), pages 982–983.

Calandra, D., Prattic

o, F. G., and Lamberti, F. (2022). Com-

parison of hands-free speech-based navigation tech-

niques for virtual reality training. In 2022 IEEE 21st

Mediterranean Electrotechnical Conference (MELE-

CON), pages 85–90.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra,

G., Roberts, A., Barham, P., Chung, H. W., Sut-

ton, C., Gehrmann, S., et al. (2022). Palm: Scal-

ing language modeling with pathways. arXiv preprint

arXiv:2204.02311.

chromadb (2022). Trychroma. version 0.4.12.

Chu, C.-C., Dani, T., and Gadh, R. (1997). Multimodal in-

terface for a virtual reality based computer aided de-

sign system. In Proceedings of International Confer-

ence on Robotics and Automation, volume 2, pages

1329–1334 vol.2.

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fe-

dus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S.,

Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X.,

Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson,

K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao,

V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H.,

Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V.,

and Wei, J. (2022). Scaling instruction-ﬁnetuned lan-

guage models.

Cohen, J. (2013). Statistical power analysis for the behav-

ioral sciences. Academic press.

Fashimpaur, J., Kin, K., and Longest, M. (2020). Pinchtype:

Text entry for virtual and augmented reality using

comfortable thumb to ﬁngertip pinches. In Extended

abstracts of the 2020 CHI conference on human fac-

tors in computing systems, pages 1–7.

Forsberg, A., Herndon, K., and Zeleznik, R. (1996). Aper-

ture based selection for immersive virtual environ-

ments. In Proceedings of the 9th annual ACM sym-

posium on User interface software and technology,

pages 95–96.

obl., B., Kriglstein., S., and Hlavacs., H. (2021). Conver-

sational interfaces in serious games: Identifying po-

tentials and future research directions based on a sys-

tematic literature review. In Proceedings of the 13th

International Conference on Computer Supported Ed-

ucation - Volume 1: CSEDU, pages 108–115. IN-

STICC, SciTePress.

Han, X. and Wan, H. (2010). A framework for virtual hand

haptic interaction. Transactions on edutainment IV,

pages 229–240.

Hombeck, J., Voigt, H., Heggemann, T., Datta, R. R., and

Lawonn, K. (2023). Tell me where to go voice-

controlled hands-free locomotion for virtual reality

systems. In 2023 IEEE Conference Virtual Reality and

3D User Interfaces (VR), pages 123–134.

Hutama, W., Harashima, H., Ishikawa, H., and Manabe, H.

(2021). Hmk: Head-mounted-keyboard for text input

in virtual or augmented reality. In Adjunct Proceed-

ings of the 34th Annual ACM Symposium on User In-

terface Software and Technology, pages 115–117.

IBM Corp. (2022). IBM SPSS Statistics [Computer soft-

ware]. Version 27.0.

LangChain (2022). Langchain. version 0.0.299.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,

V., Goyal, N., K

uttler, H., Lewis, M., tau

Yih, W., Rockt

aschel, T., Riedel, S., and Kiela,

D. (2021). Retrieval-augmented generation for

knowledge-intensive nlp tasks.

Li, Z., Zhang, S., Anwar, M. S., and Wang, J. (2018). Appli-

cability analysis on three interaction paradigms in im-

mersive vr environment. In 2018 International Con-

ference on Virtual Reality and Visualization (ICVRV),

pages 82–85.

Loshchilov, I. and Hutter, F. (2017). Fixing weight decay

regularization in adam. CoRR, abs/1711.05101.

Meta (2013). Quest 3: New Mixed Reality Headset. https:

Hands-Free VR

541

//www.meta.com/quest/products/quest-3/. Accessed:

2023-10-18.

Mine, M. R. (1995). Virtual environment interaction tech-

niques. UNC Chapel Hill CS Dept.

Monteiro, P., Gonc¸alves, G., Coelho, H., Melo, M., and

Bessa, M. (2021). Hands-free interaction in im-

mersive virtual reality: A systematic review. IEEE

Transactions on Visualization and Computer Graph-

ics, 27(5):2702–2713.

Morotti, E., Donatiello, L., and Marﬁa, G. (2020). Fos-

tering fashion retail experiences through virtual real-

ity and voice assistants. In 2020 IEEE Conference on

Virtual Reality and 3D User Interfaces Abstracts and

Workshops (VRW), pages 338–342.

OpenAI (2022). Chat GPT.

Pierce, J. S., Forsberg, A. S., Conway, M. J., Hong, S.,

Zeleznik, R. C., and Mine, M. R. (1997). Image plane

interaction techniques in 3d immersive environments.

In Proceedings of the 1997 symposium on Interactive

3D graphics, pages 39–ff.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey,

C., and Sutskever, I. (2023). Robust speech recog-

nition via large-scale weak supervision. In Inter-

national Conference on Machine Learning, pages

28492–28518. PMLR.

Shapiro, S. S. and Wilk, M. B. (1965). An analysis

of variance test for normality (complete samples).

Biometrika, 52(3/4):591–611.

Steinicke, F., Ropinski, T., and Hinrichs, K. (2006). Object

selection in virtual environments using an improved

virtual pointer metaphor. In Computer Vision and

Graphics: International Conference, ICCVG 2004,

Warsaw, Poland, September 2004, Proceedings, pages

320–326. Springer.

Su, H., Shi, W., Kasai, J., Wang, Y., Hu, Y., Ostendorf,

M., tau Yih, W., Smith, N. A., Zettlemoyer, L., and

Yu, T. (2023). One embedder, any task: Instruction-

ﬁnetuned text embeddings.

Tim Dettmers (2022). Bitsandbytes. version 0.41.1.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,

M.-A., Lacroix, T., Rozi

ere, B., Goyal, N., Hambro,

E., Azhar, F., et al. (2023). Llama: Open and ef-

ﬁcient foundation language models. arXiv preprint

arXiv:2302.13971.

Ware, C. (1990). Using hand position for virtual object

placement. The Visual Computer, 6:245–253.

Whitlock, M., Harnner, E., Brubaker, J. R., Kane, S., and

Szaﬁr, D. A. (2018). Interacting with distant objects

in augmented reality. In 2018 IEEE Conference on

Virtual Reality and 3D User Interfaces (VR), pages

41–48.

Wilcoxon, F. (1992). Individual comparisons by ranking

methods. Springer.

Zhao, J., Parry, C. J., dos Anjos, R., Anslow, C., and Rhee,

T. (2020). Voice interaction for augmented reality nav-

igation interfaces with natural language understand-

ing. In 2020 35th International Conference on Image

and Vision Computing New Zealand (IVCNZ), pages

1–6.

HUCAPP 2025 - 9th International Conference on Human Computer Interaction Theory and Applications

542