The original VR Mover project runs on Unity + Oculus VR headsets, which makes it complicated to duplicate, run, and maintain — it requires a full Unity setup, VR hardware, and several paid service subscriptions. We therefore re-produced the key functionality (LLM streaming, API-call extraction, conversation context, operating rounds, speech-to-text and timing) in a single, well-documented JavaScript file with zero dependencies. This makes the method easy to reproduce and extend for other use cases — and because everything lives in one self-contained file, you can hand it to an AI coding agent and have it transformed into any other programming language with minimal effort.
VR Mover relies on the exact timing of each recognized word to align spoken commands
with pointing gestures (e.g. matching "put it here" with the moment the user pointed).
The browser's free Web Speech API used in this reproduction does not expose per-word timestamps, so
timing here is best-effort. The original project uses the
Azure Speech SDK, which provides precise word-level
offsets (Offset / Duration per word) — refer to the Azure documentation for
detailed usage in your language of choice. The same capability exists in most other commercial STT SDKs
(Google Cloud Speech, AWS Transcribe, Whisper with timestamps, etc.), so the approach ports directly.