Introducing audio2blendshapes: Turn Voice into Motion
- Daniela Brenes
- Jul 23
- 2 min read

Meet audio2blendshapes, a deep learning model by RidgeRun.ai that converts raw audio into a sequence of blendshape values — the foundational data for facial animation.
Provide an audio signal, and the model outputs frame-by-frame blendshape movement that reflects the general pacing and dynamics of spoken language. It’s a lightweight, flexible solution for animating avatars, digital characters, or virtual humans using only voice input.
What Are Blendshapes?
Blendshapes are a standard technique in 3D animation for controlling facial expressions and movements. Each blendshape represents a specific facial feature: rom cheek puffing, nose sneering to jaw opening.
By blending these poses in different amounts, you can create realistic, expressive facial motion for digital characters. This approach is widely used in games, virtual avatars, film, and augmented reality.
Our model follows the Apple ARKit standard, which defines 52 facial blendshapes to represent a broad range of expressions and mouth movements. These are the same parameters used in many 3D engines and facial animation systems, making it easy to integrate the output of our model into your existing pipeline.
How Does audio2blendshapes Work?
Using audio2blendshapes is simple: just provide an audio file audio signal (from a file, or a microphone capture, for example), and the model takes care of the rest.
It processes the audio and outputs a frame-by-frame sequence of blendshape values, each one representing the facial motion associated with a slice of the audio. You can configure the output frame rate to match your needs — whether it's 30 FPS, 60 FPS, or any other value you need.
The model receives an audio signal as input:
Length: variable
Channels: 1
Sample rate: 16kHz
Format: 16 bits per sample
For each audio signal received, it produces a series of blendshape frames, each one consisting of:
Framerate: Â customizable, such as 60FPS
Timestamp: in seconds
Blendshape format: 51 ARKit blendshapes
Blendshape range: floating point between 0 and 1
Add a Voice and a Face to Your Characters
Bring your virtual characters to life by combining audio2blendshapes with our real-time Voice Agent.
While Voice Agent powers natural, low-latency conversations — letting users speak, interrupt, and get instant responses — audio2blendshapes adds facial movement driven directly by the voice. The result? A talking avatar that not only sounds human, but looks engaged.
Perfect for embedded systems, cloud solutions, or web interfaces, this duo lets you build expressive, voice-driven characters that feel truly interactive.
🚀🎠Ready to Animate with Just Audio? Let’s Talk!
audio2blendshapes makes it easy to add expressive facial motion to your projects: just drop in an audio file and bring your avatars, agents, or digital humans to life.
Whether you're prototyping, launching a product, or exploring new ways to enhance user interaction, we’re here to help you make it happen.
Reach out to us at contactus@ridgerun.ai and let’s start planning your project!