VR Mover - LLM-based VR Object Manipulation

📄 Paper 📹 Video 💻 Code 🔗 ACM DL 🎮 Interactive Demo

VR Mover Trailer — VR Mover aggregates user-centric information — what the user is saying, seeing, and pointing at — and generates atomic API calls (e.g. `CREATE`, `MOVE`, `LOOKAT`) to assist with object placement.

Abstract

In our daily lives, we naturally convey instructions for spatially manipulating objects using words and gestures. Transposing this form of interaction into virtual reality (VR) object manipulation can be beneficial. We propose VR Mover, an LLM-empowered multimodal interface that can understand and interpret the user's vocal instructions combined with gestures to support object manipulation. By simply pointing and speaking, the user can command the LLM to manipulate objects without structured input.

Compared to classic interfaces, our user study demonstrates that VR Mover enhances user usability, overall experience, and performance on multi-object manipulation, while also reducing workload and arm fatigue. Users prefer the proposed natural interface for broad movements and may complementarily switch to gizmos or virtual hands for finer adjustments. These findings are believed to contribute to design implications for future LLM-based object manipulation interfaces, highlighting the potential for more intuitive and efficient user interactions in VR environments.

Video

Interacting with VR Mover

Just like instructing a real-life mover, the user combines natural speech with simple gestures — no structured commands, no menus, no grammar to memorise.

Pointing

Point at an object and then at its destination: "Move this [point] to here [point]". Since VR Mover knows what the user is seeing, objects can also be referred to by speech alone. Complex instructions are handled in a single utterance — e.g. moving three chairs to three different spots while making them face the table.

Lining

Drawing a line expresses direction and magnitude — "Move the object that way [line]" — or indicates where a row of objects should roughly go: "I want 4 pictures along the wall here [line]".

Intelligent Responses

Powered by an LLM, VR Mover has spatial common sense (chairs should face the table), is context-aware ("move it back"), and adapts to requests like undo — even though there is no undo function.

User-centric

Understands instructions from the user's perspective — "move the chair away from me".

Asynchronous Multi-object

Refer to several objects at once and apply a different manipulation to each of them.

Compound Instruction

Stack requests together — "move the chair to here and then make it face the window".

Coarse-to-fine

VR Mover handles broad placement; gizmos and virtual hand remain available for fine-tuning.

Method

VR Mover addresses the challenge of real-time LLM-based object manipulation by decomposing complex tasks into atomized functions. The system consists of four key components:

System overview of VR Mover with four modules: scene modelling, user-centric augmentation, LLM processing, and scene update — The system overview of VR Mover with its four major modules. GPT-4o is used as the LLM's core.

Scene Modeling

Converts 3D spatial information into text-based JSON format using oriented bounding boxes (OBBs) to represent object positions, rotations, and dimensions. Objects are categorized into environmental (static) and manipulatable (dynamic) elements, with metadata including object names and descriptions. This structured representation enables the LLM to understand spatial relationships and object properties efficiently.

JSON expression of manipulatable prefabs, manipulatable objects, and environmental objects

User-Centric Augmentation

Processes multimodal inputs including speech recognition (via Azure's cloud service), focus frames (groups of continuous viewports during speech), and gestural cues (pointing and lining gestures). A text-based time serialization scheme efficiently injects gestural cues into speech transcripts using ID tags, enabling the LLM to understand temporal relationships between speech and actions.

LLM Processing

Uses GPT-4o to generate atomic API calls (CREATE, MOVE, SCALE, DELETE, etc.) instead of complex code scripts or JSON. This approach achieves an average response time of 2.29 seconds, significantly faster than previous LLM-based VR systems. The atomized function approach enables real-time manipulation while maintaining accuracy.

Scene Update

Parses and executes API calls asynchronously, updating the virtual environment in real-time. The module processes incoming function calls through a buffer system, allowing for frame-by-frame updates without requiring recompilation. This ensures smooth, responsive object manipulation while maintaining system stability.

Performance

Evaluation across multiple LLM models shows robust behavior for object manipulation tasks: GPT‑4o yielded 0% API errors in the user study, while offline evaluation with Llama3.1 models showed < 2% API errors.

VR Mover's response time, error rate, and placement examples across GPT-4o, Llama3.1-405B, and Llama3.1-70B — VR Mover's (a) response time, (b) error rate and (c) placement examples given different LLM models.

Results

Our user study with 24 participants compared VR Mover against Control (gizmos + virtual hand) and Voice Command (rule-based) interfaces across single and multi-object manipulation tasks.

⚡
Multi-object Performance
                                2.4×
                                faster than Control
                            

                                1.7×
                                faster than Voice Command
                            

💪
Reduced Fatigue
                                47%
                                less arm fatigue vs Control
                            

                                28%
                                less than Voice Command
                            

📊
Lower Workload
                                35%
                                reduction in NASA-TLX vs Control
                            

                                18%
                                reduction vs Voice Command
                            

👥
User Preference
                                63%
                                first choice for Task 1
                            

                                71%
                                first choice for Task 2
                            

Task Performance Details

Task 1A: Single Mid-air

Single Object

Similar performance to Control interface, significantly faster than rule-based Voice Command

Task 1B: Multi-object

Multiple Objects

Coarse: 29.9s vs 72.1s (Control), 50.5s (Voice)

Fine: 52.6s vs 78.5s (Control), 70.0s (Voice)

Hand Movement

Task 1B

3.6m vs 7.5m (Control), 5.1m (Voice)

User Experience

Overall

SUS: 3.90 vs 3.17 (Control)

UEQ-S: 5.73 vs 3.92 (Control)

User Study Charts

Task 1A/1B coarse and fine manipulation times, and hand movement distance

Borg C10 (arm fatigue), SUS (usability), and Presence (PQ)

NASA-TLX workload scores and subscales

UEQ-S user experience scores and preference ranking

UEQ-S (overall, pragmatic, hedonic) and preference ranking for Task 1 and Task 2

User Study

24 participants (ages 18–35) tried all three interfaces in randomized order on a Meta Quest 3, completing performance and creative tasks followed by questionnaires and an interview (~2 hours per session).

Compared Interfaces

The gizmos, virtual hand, and rule-based voice command interfaces — **Control** — gizmos + virtual hand, the classic approach; **Voice Command** — an LLM-removed, rule-based variant limited to predefined commands like *"move this here"*; **VR Mover** — our LLM-based interface. All three include gizmos and virtual hand for fine-tuning.

Tasks

Task 1A environment: a chair and a mid-air target

Task 1A — Single Mid-air

Move one object to a semi-transparent mid-air target.

Task 1B environment: multiple objects and ground targets

Task 1B — Multi-object

Move several objects to their respective ground targets.

Task 2 — Sandbox Room

Freely furnish an empty room for 7 minutes using a prefab menu.

Task 2 — Reference

A mini-room shown in VR serves as the soft goal to replicate.

Measures

Coarse / fine manipulation time Hand movement distance Arm fatigue (Borg C10) Workload (NASA-TLX) Usability (SUS) User experience (UEQ-S) Presence (PQ) Preference ranking

Design Recommendations

From our implementation and study findings, we derived six design implications for future LLM-based VR interfaces.

Leverage both LLM and classic interfaces

Mix the LLM interface with classic controls for fine-tuning, with seamless switching between the two.

Prioritise coarse placement

Users care most about getting objects roughly where they want them — optimise coarse control first.

Focus on asynchronous multi-object interaction

The LLM's main strength is applying different manipulations to multiple objects in one request.

Develop an optimal set of atomic functions

Atomized functions cut errors and response time dramatically compared to code or JSON generation.

Interaction discovery and tuning

New interactions can be "taught" to the LLM with a single prompt exemplar — no reprogramming needed.

Mimic real-world behaviour

Replicating how people naturally convey spatial instructions reduces the learning curve in VR.

Explorative interactions: asynchronous multi-object undo, repeated transform, and area-drawing — Explorative interactions built on these implications: (a) asynchronous multi-object undo, (b) repeated transform, and (c) area-drawing — plus filtered selection, e.g. *"I only want the yellow flowers in this area to be larger."*

BibTeX

@inproceedings{wang2025vrmover,
  author    = {Wang, Xiangzhi Eric and Sin, Zackary P. T. and Jia, Ye and Archer, Dan and Fong, Wynonna H. Y. and Li, Qing and Li, Chen},
  title     = {Can You Move These Over There? Exploring an LLM-based VR Mover to Support Natural Multi-object Manipulation},
  booktitle = {Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology},
  series    = {UIST '25},
  articleno = {185},
  numpages  = {18},
  year      = {2025},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  isbn      = {9798400720376},
  doi       = {10.1145/3746059.3747673},
  url       = {https://doi.org/10.1145/3746059.3747673},
  keywords  = {LLM, Object Manipulation, VR Mover, Natural User Interface}
}