An experimental AI voice prompt that edits a Unity scene

Hey Ramen Lovers, we wanted to give you a look into what we’re working on these days. While Zenith may no longer be our primary focus, that doesn't mean we're slowing down. In fact, we’re working on something even more exciting: our next game and the tools that will shape the future of VR game development. We’re still dedicated to creating immersive virtual worlds, and even more importantly, we’re opening the door for others to join us in that endeavor by building tools. We believe AI isn’t just another buzzword, it’s a force multiplier. Done right, it could be the key to unlocking creative potential, allowing you to bring your dreams and ideas to life faster than ever before. Imagine reducing the tedious aspects of game development and focusing more on the creativity that makes games truly special.

Why AI and XR?

First off, why XR? XR, short for Extended Reality, is an umbrella term that encompasses Augmented Reality (AR), Virtual Reality (VR), and Mixed Reality (MR). We've always been passionate about VR (it's in our name), but now we've decided to extend our focus to all of XR. We believe that XR allows for more realistic interactions and worlds, integrating fantastical elements into the world around you. We also believe that XR headsets (not VR-only) will be the first ones to replace desktop computing and that it won't just be for games when it happens. In addition to the time we've spent in VR, we've spent hundreds of hours in XR over the past few years, working, playing, and everything in between and we truly believe that having the flexibility to do all types of extended reality is the future.

So now let’s talk about AI in the context of VR and XR. We believe AI can be a tool that helps people realize their dreams and ideas faster. The potential is enormous if we apply it the right way. AI can help reduce the challenging, tedious, and/or repetitive aspects of game design and development, leaving more room for humans to do the creative, fun, fulfilling aspects. We believe that good ideas can come from anywhere, so enabling people to express their creativity through games is incredibly important.

To that end, we’ve been working on demos to learn about different technologies that may help us to get to that future even faster.

The Blue Wolf

In this experiment, we asked, “What if you could just ask the editor to place objects in a scene for you?” No endless scrolling through the project, no memorizing asset names or locations, just a simple, intuitive interaction.

Cool, right? But let’s be honest, this feels like a really fancy, involved version of text-to-speech. It’s a start, sure, but it’s not quite there yet. Sure, it places the object where you want it, but the mental load is still there. You’re still stuck remembering every single model in the game and selecting a specific one each time. This is where we realized we needed to get more intelligence involved.

The Power of Vibes

Now, let’s take it up a notch.

And just like that, we’re starting to have a real conversation with the editor! Now, you can request objects based on categories like trees, houses, or cars—or even by vibes, like “something cute.” This level of flexibility is perfect for those of us who want creative control without always getting bogged down in the nitty-gritty. Suddenly, the editor isn’t just placing objects, it’s interpreting your intent. You’re no longer required to deal in specifics, and you have the flexibility to focus on the bigger picture. This is the future we’re building towards: tools that understand what you want, even when you’re not sure how to articulate it.

This flexibility in granularity might also lend itself well to younger users or new game developers who can get overwhelmed by the amount of detail that goes into setting up a scene. It might be good for a designer to be able to spend their time placing individual specific monsters but then say “add a few more trees over here”.

Behind the Curtain

The demo project is composed of two halves, the Unity client and a Python server that runs locally. At a high level the process could be described in three parts:

Transcribe the user’s voice to text
Use an LLM to convert this text into a YAML scene description
Resolve any unknown/ambiguous terms in the YAML into specific, known GameObjects

1. Like and Transcribe (Transcribe the user’s voice to text)

When a user presses and holds down the Space bar in the Unity client, the client begins recording audio input from the user's microphone. During this time, binary waveform chunks of the user's speech are continuously accumulated. These chunks are small, segmented pieces of the audio that are temporarily stored while the user speaks.

As soon as the user releases the Space bar, the recording process stops. All the collected binary waveform chunks are concatenated together to form a complete audio segment representing the user's speech input. This concatenated audio segment is then transmitted to the backend through a specific WebSocket call, called voice/transcribe. This call is designed to handle voice data and initiates the transcription process.

On the backend, the system uses a self-hosted version of OpenAI's whisper-large-v3 model, to perform the transcription. Whisper is an advanced deep learning model capable of converting spoken language into written text with high accuracy. The backend processes the incoming audio, transcribes it into text, and sends the transcription results back to the client using another WebSocket event called voice/transcription.

2. Can I get to the YAML? (Use an LLM to convert this text into a YAML scene description)

Once the transcription is completed and returned, the text output is further processed by a large language model (LLM). We went with MistralAI's Mistral-7B-Instruct-v0.3 model, available through Hugging Face. The LLM is configured to take the transcription results and begin generating additional content or responses as needed. This output from the LLM is streamed back to the client in small, manageable chunks, each containing up to 10 tokens at a time. These chunks are sent via a text/chunk event to ensure a smooth, progressive display of text on the client side, which enhances the user experience by providing near-real-time feedback.

When the LLM has completed processing the entire transcription, it sends out two distinct events: text/end and text/full. The text/end event signifies that all chunks have been transmitted, while the text/full event delivers the complete output as a single, consolidated piece of text. On the client side, this output is then parsed as YAML (YAML Ain't Markup Language), which is a human-readable data serialization standard commonly used for configuration files and data exchange (like JSON or XML). In our case, it's particularly useful because Unity uses YAML to represent scenes and prefabs under the hood. This parsing allows for further manipulation or display of the final transcribed and generated text in the desired format.

The Unity client receives the YAML and, using the object map, updates the YAML slightly and sends it off for one last WebSocket call, change-components. The change-components LLM prompt has some constraints:

Convert player requests to YAML modifications.
Do not remove or replace objects unless explicitly asked.
New or changed components should be described in the path property.

The response from the model is received as modified YAML. Here is an example of how some of the YAML might look at this point:

components:
- guid: 123
  path: shop/shop_banner_holo_vertical_a
  position:
    x: 1.000
    y: 0.000
    z: 0.000

In this snippet you can see a GameObject made from a vertical banner prefab, located at (1,0,0).

The object map

Imagine a vast, complex game world filled with a myriad of objects, trees, stones, creatures, buildings, each with its own unique properties and specifications. To manage and make sense of all these objects, the system uses an object map, a structured dictionary that serves as a catalog of every valid object in the game. The object map isn't conjured out of thin air. It originates from a predefined data file, object-collection.json, which contains detailed information about all objects that could potentially exist in the game world. This JSON file is loaded at the start of the application and contains dozens of prefab paths along with tags that might describe them, like "wall", "door", "ladder". For example, a wall panel might have tags:

{
  "path": "shop/shop_wall_panel_column_a",
  "tags": "shop, wall, panel, column, a"
}

3. Resolutionary Tactics (Resolve any unknown/ambiguous terms in the YAML into specific, known GameObjects)

For each component in the YAML response, the system checks if it already has a valid path. If the path is already known to the game, meaning it exists in our predefined object map, no further action is needed. The system can confidently skip over it, knowing it’s already in good shape. But if the path isn’t recognized, that’s where things get interesting. For those unrecognized paths, we need to find out what the player meant as closely as possible. Here, the system taps into a powerful tool: embeddings.

By converting the ambiguous path into a high-dimensional vector (an embedding) using a pre-trained model like SentenceTransformer, the system creates a numerical representation of this new path. To make the comparisons fair and effective, the system normalizes these embeddings, ensuring they’re on the same scale. Next comes the clever part: comparing this new embedding with a catalog of embeddings for all existing, valid paths in the game. Using cosine similarity, a technique that measures how close two vectors (or paths) are, the system calculates which existing path in the catalog is most similar to the new one. (Imagine scanning through a library for a book title that most closely matches a vague description you were given.)

Once the system finds the closest match, it replaces the ambiguous or new path with this best match. Now, the component has a validated, known path that fits neatly within the existing game world. Finally, after all the components have been reviewed and corrected, the updated YAML is packaged up and converted back into a string format. This refined YAML, now fully validated and ready to be interpreted by the Unity engine, is sent back as the method’s output and rendered in the Unity scene view!

Close-up on the resulting YAML that the scene view renders!

[yaml]
action: createObject 
type: racoon 
color: #FFD7D7
position: cursor

What’s Next?

So, what’s next on the horizon? This was just a demo of some potential technology, but we’re not just stopping here. We’re continuing to refine and expand these tools, with some exciting updates and new features in the pipeline, along with working on our next title. Keep an eye out for announcements on Twitter and Discord when we’re ready to share more and possibly even let you test out some of these tools yourself.

And if you’re eager to get even more involved, we’ve got some good news: we’re hiring! We’re looking for talented engineers to help us push the boundaries of what’s possible in XR.

The future of XR is bright, and we can’t wait to share it with you.

Ramen VR’s New Frontier