VR Mover Trailer

Abstract

In our daily lives, we naturally convey instructions for spatially manipulating objects using words and gestures. Transposing this form of interaction into virtual reality (VR) object manipulation can be beneficial. We propose VR Mover, an LLM-empowered multimodal interface that can understand and interpret the user's vocal instructions combined with gestures to support object manipulation. By simply pointing and speaking, the user can command the LLM to manipulate objects without structured input.

Compared to classic interfaces, our user study demonstrates that VR Mover enhances user usability, overall experience, and performance on multi-object manipulation, while also reducing workload and arm fatigue. Users prefer the proposed natural interface for broad movements and may complementarily switch to gizmos or virtual hands for finer adjustments. These findings are believed to contribute to design implications for future LLM-based object manipulation interfaces, highlighting the potential for more intuitive and efficient user interactions in VR environments.

Video

Method

VR Mover addresses the challenge of real-time LLM-based object manipulation by decomposing complex tasks into atomized functions. The system consists of four key components:

Scene Modeling

Converts 3D spatial information into text-based JSON format using oriented bounding boxes (OBBs) to represent object positions, rotations, and dimensions. Objects are categorized into environmental (static) and manipulatable (dynamic) elements, with metadata including object names and descriptions. This structured representation enables the LLM to understand spatial relationships and object properties efficiently.

User-Centric Augmentation

Processes multimodal inputs including speech recognition (via Azure's cloud service), focus frames (groups of continuous viewports during speech), and gestural cues (pointing and lining gestures). A text-based time serialization scheme efficiently injects gestural cues into speech transcripts using ID tags, enabling the LLM to understand temporal relationships between speech and actions.

LLM Processing

Uses GPT-4o to generate atomic API calls (CREATE, MOVE, SCALE, DELETE, etc.) instead of complex code scripts or JSON. This approach achieves an average response time of 2.29 seconds, significantly faster than previous LLM-based VR systems. The atomized function approach enables real-time manipulation while maintaining accuracy.

Scene Update

Parses and executes API calls asynchronously, updating the virtual environment in real-time. The module processes incoming function calls through a buffer system, allowing for frame-by-frame updates without requiring recompilation. This ensures smooth, responsive object manipulation while maintaining system stability.

Performance

Evaluation across multiple LLM models (GPT-4o, Llama3.1-405B, Llama3.1-70B) shows consistent results with error rates below 2%, demonstrating the system's robustness and reproducibility for object manipulation tasks.

Results

📊

User Study Results

Coming Soon...
📈

Performance Metrics

Coming Soon...

BibTeX

@inproceedings{vrmover2025,
  title={Can You Move These Over There? Exploring an LLM-based VR Mover to Support Natural Multi-object Manipulation},
  author={Wang, Xiangzhi Eric and Sin, Zackary P. T. and Jia, Ye and Archer, Daniel and Fong, Wynonna H. Y. and Li, Qing and Li, Chen},
  booktitle={Proceedings of the ACM Symposium on User Interface Software and Technology},
  year={2025},
  publisher={ACM},
  doi={10.1145/XXXXXXX.XXXXXXX},
  url={https://doi.org/10.1145/XXXXXXX.XXXXXXX}
}