Beyond the Meow: Infusing Virtual Cats with Real Purr-sonality
Introduction
Imagine having a virtual pet that doesn't just respond with preset reactions, but truly talks back, reacts to your mood, and even surprises you with witty or caring responses. With advances in AI technology, this is no longer just a fantasy—large language models (LLMs) are transforming the world of pet simulators, bringing digital pets to life with dynamic, engaging personalities.
In this post, we’ll take you behind the scenes of how we integrated an LLM into a pet simulator to create a fully interactive, "talking" pet. By using an LLM, our pet doesn’t just reply; it responds in ways that feel spontaneous and personalized, allowing players to build real connections with their virtual companions.
We'll dive into the challenges and decisions that shaped this feature, from choosing the right AI model to designing a voice that fits each pet’s unique personality. Whether you're an AI enthusiast or simply curious about the future of virtual companionship, you'll discover the potential of LLMs to make digital pets feel like part of the family.
Section 1: Why Give Pets a Voice?
The Value of Personality
A talking pet adds an entirely new dimension to the virtual pet experience. Rather than relying on canned responses or limited actions, an LLM-powered pet can engage in real-time, personalized interactions. This deeper engagement goes beyond basic care routines, letting players build a stronger emotional bond with their pets. A talking pet doesn’t just react; it converses, listens, and adapts, creating a sense of companionship that feels both interactive and rewarding.
Current Limitations
In many traditional pet simulators, interactions are confined to a few predictable behaviors or reactions that can quickly become repetitive. These pets may wag their tails, bat at a toy mouse, or make sounds, but there’s little opportunity for actual conversation or realistic interactions. Without meaningful dialogue, interactions can feel mechanical, lacking the spontaneity and connection of a real pet. As a result, player engagement tends to diminish over time, especially as interactions become predictable and lacking the connection that you form to real pets.
LLM Potential
This is where large language models (LLMs) come in, with their ability to process and generate human-like language based on context. With an LLM, a virtual pet can respond with phrases and reactions that are nuanced, witty, and often surprising. LLMs analyze the input from players, crafting responses that feel specific to the player’s actions or mood. This conversational ability opens up a range of new possibilities, allowing virtual pets to express a unique “personality” and deepening the sense of companionship for the player.
By giving pets a voice through an LLM, we’re adding a layer of personality that feels real and engaging, transforming what a pet simulator can offer in terms of depth and connection.
Section 2: Choosing the Right LLM for the Job
Model Selection
Selecting the correct models to bring your furry friend to life is no small task. We had to try a ton of models and after narrowing it down, we were left with Meta's Llama-3.2-1B-Instruct and OpenAI's gpt-4o. After testing for performance and cost (see Section 4), we eventually settled on gpt-4o. Our model needed to run in as close to real-time as possible, given that it was powering a spoken conversation.
Section 3: Architecture and Integration Process
System Overview
At a high level, the talking cat simulator is composed of several parts. The flow is as follows:
- The cat model, including animations and basic agent behavior in Unity
- The voice input, implemented as a watch with a microphone button
- Speech-to-text, in order to convert the player's request into text
- The LLM itself!
- Text-to-speech, to convert the LLM result into an audio file, which is sent back to Unity
- Lip syncing, playing the resulting audio file and moving the cat's mouth to match the sound
Key Components:
1. The cat model
The cat model, rig, animations, and agent were made internally as a part of another project and repurposed for this demo.
2. The voice input
When the user presses the microphone button, we begin recording an AudioClip. While the button is held down, we convert the audio clip to chunks of float values. Once the user lets go of the button, we concatenateß everything into a base64 string and send the audio to a Python server via a socket.
3. Speech-to-text
We used OpenAI's whisper-large-v3 to process the audio and convert it to text.
4. The LLM itself!
The text is passed into gpt-4o as the user prompt. In order to get it to return an appropriate response, we also prime it with a system prompt:
You are a cat!
You speak in a concise, whimsical, and catlike manner.
You are sometimes a little humorous, but never mean or rude.
You love your owner and they love you.
You never include roleplay actions between asterisks (ex. "*swats at a fly*"), only spoken dialogue, but your owner might use them.
The system prompt ensures that the tone and length of the response are appropriate for the use-case (pet simulator, as opposed to an assistant or other common use-cases for LLMs).
5. Text-to-speech
Before we continue, we remove any actions, emojis, or special characters from the response. Then, similar to the speech-to-text step, we pass the sanitized response into OpenAI's tts1 model. Because this is a demo, we opted to use the built-in Nova voice, but if this were to become a full fledged product we would train a custom voice with recordings by a voice actor, in order to add even more personality to the cat. The resulting audio file is sent via socket back to the Unity application.
6. Lip syncing
Upon receving the audio file, the Unity client immediately begins to play it. We use the SpeechBlend lip sync library to pull out some basic information about the audio as it's playing, such as its volume and intensity. Because cats don't normally talk, we don't have to match the phonemes perfectly like we would with a human model; avoiding uncanny valley is still important but the player will not immediately recognize a mismatch in phoneme and lip shape when looking at a cat vs a human. As a result, we opted to just use the volume/intensity to determine how much the cat's mouth should open (often referred to a lip-flapping style animation) as opposed to a more involved mapping of phonemes to specific blendshapes.
Section 4: Challenges and Solutions
Local vs. Cloud?
The biggest challenge we faced building this demo was determining whether to run the models locally or in the cloud. There was an immediate tradeoff between speed and cost that we had to take into account. The Python server was running on a laptop with an AMD Ryzen 9 6900HS and an NVIDIA GeForce RTX 3060 Laptop GPU. It was important to us that the demo be able to run on a fairly average consumer laptop, vs a professional AI workstation.
We ended up deciding on a mix of both, running whisper-large-v3 locally, while running gpt-4o and tts1 in the cloud, which allowed the whole experience to run end to end in approximately 8 seconds, while costing < $0.003 per end-to-end run.
Average inference times per model:
Model | Time (seconds) |
---|---|
whisper-large-v3 (local) | 5.52s |
gpt-4o (cloud) | 0.71s |
tts1 (cloud) | 1.69s |
Total | 7.92s |
Polygot Cat?
The whisper-large-v3, gpt-4o, and tts1 models can run in single language or multilingual mode. This enabled us to speak to the cat in any language, changing the language in real time with each request. In other words, if the player addressed the cat in a particular language, the cat would respond in the same language.
Unfortunately it added time (~5 more seconds) to the requests, which made the entire experience feel unresponsive and slow. For now, we opted to choose the language up front, rather than having the cat interpret the language upon hearing it.
Section 5: Conclusion
A New Era of Virtual Pets
Creating a "talking" pet through the integration of an LLM has opened up exciting possibilities for our pet simulator. By merging advanced AI with creative character design, we’ve brought players closer to a digital pet that feels responsive, engaging, and uniquely personal. This isn't just a step forward for pet simulators—it hints at a future where digital companions can develop meaningful connections with users, responding dynamically to emotions, preferences, and individual interactions.
Key Takeaways
We’ve covered the many technical and creative decisions that brought this project to life, from selecting the best LLM and optimizing for latency to designing interactions that feel fun and natural. Each decision was a balance between performance, user experience, and personality, aiming to make the interaction as smooth and enjoyable as possible.
Looking Ahead
This project is just the beginning. As LLMs continue to evolve, we see potential for adding even deeper, real-time personality adjustments and enhanced voice interactions. Future iterations could include more expressive emotion recognition or even custom voice synthesis to make every pet truly unique.
Join Our Team!
If this project excites you, imagine what we can build together. We’re hiring for multiple engineering roles to help us push the boundaries of interactive, AI-driven experiences! Join us as we explore new frontiers in gaming, AI, and real-time interactivity. We’re looking for talented people for roles such as:
- ML Product Engineer – Shape how AI interacts with users in real-time.
- Senior Gameplay Engineer – Create immersive and engaging game features that captivate and delight users.
- Lead Engine Programmer – Drive performance and innovation, pushing the limits of what our engine can achieve.
- Technical Artist – Bring the visual magic to life, bridging the gap between art and engineering.
If you’re passionate about building the next generation of interactive experiences, we’d love to hear from you! Check out our careers page to apply.