Building Talking Characters with Lip Sync in Babylon.js
Lip sync is the intersection of several technologies: 3D character modeling with facial blend shapes, text-to-speech or speech recognition, phoneme-to-viseme mapping, and real-time animation. Each component has its own complexity, but the overall system follows a clear pipeline. Text goes in, audio and timing data come out of a TTS service, and the timing data drives morph targets on the character mesh while the audio plays through the browser's audio system. Babylon.js handles the rendering and animation side, while the speech processing happens server-side through cloud APIs.
Step 1: Prepare the Character Model
The character model needs morph targets (called Shape Keys in Blender) for different mouth positions. The standard set includes 15 viseme shapes that cover all phonemes in English speech. The most important are: mouth open wide (viseme_aa), lips pursed forward (viseme_O), teeth together lips open (viseme_I), lips wide flat (viseme_E), upper teeth on lower lip (viseme_FF), tongue behind upper teeth (viseme_TH), lips closed (viseme_PP), and a neutral rest position.
In Blender, create these shapes by duplicating the base mesh and sculpting each mouth position. Name them consistently using the standard viseme naming convention that your TTS service expects. Ready Player Me avatars come with these viseme morph targets pre-built, making them a fast starting point for prototyping. Export the model as GLB with "Shape Keys" enabled in the glTF export settings.
Beyond mouth shapes, include morph targets for brow raising, brow furrowing, eye squinting, eye widening, and blinking. These expression targets add emotional range to the character. A character that only moves its mouth looks robotic; one that also raises its eyebrows when surprised and furrows its brow when confused feels alive.
Step 2: Understand the Viseme Standard
A phoneme is a unit of speech sound, like the "b" in "bat" or the "ah" in "father." A viseme is the corresponding mouth shape for that sound. Multiple phonemes can map to the same viseme because some sounds look identical on the lips. The sounds "b," "m," and "p" all produce the same closed-lips viseme. The sounds "f" and "v" both produce the upper-teeth-on-lower-lip viseme.
The standard mapping groups approximately 40 English phonemes into 15 viseme categories. When a TTS service generates speech, it also generates a sequence of viseme events with timestamps. Each event says "at time 0.342 seconds, the mouth should be in viseme position 6." Your animation system reads these events and sets the corresponding morph target weights at the right times.
Different TTS services use different viseme numbering or naming schemes. Amazon Polly uses numeric IDs (0 through 21). Azure Speech Services uses the same IDs but with a slightly different mapping. ElevenLabs provides phoneme-level timing that you map to visemes yourself. Whichever service you use, you need a lookup table that translates the service's viseme identifiers to your model's morph target names.
Step 3: Connect a Text-to-Speech Service
The TTS service converts text into speech audio and, critically, provides viseme timing data synchronized with that audio. Without timing data, you would need to analyze the audio in real time to detect mouth positions, which is computationally expensive and less accurate.
Amazon Polly is a straightforward choice. Send a SynthesizeSpeech request with OutputFormat set to "mp3" and SpeechMarkTypes set to "viseme." Polly returns two streams: the audio file and a JSON lines file where each line contains a viseme ID and its timestamp in milliseconds. Parse the JSON lines into an array of viseme events sorted by time.
Azure Speech Services provide a similar capability through the Speech SDK. The SDK fires viseme events in real time as the audio is synthesized, including both the viseme ID and a blend shape array for fine-grained facial animation. Azure's approach is more granular than Polly's, providing smoother transitions because it gives intermediate blend values rather than discrete viseme IDs.
For the audio playback, use the Web Audio API or an HTML5 audio element. Babylon.js has a built-in Sound class that wraps Web Audio and provides playback controls, volume management, and spatial audio positioning. Create a BABYLON.Sound from the TTS audio data, and when you start playback, simultaneously start the viseme animation timeline.
Step 4: Drive Morph Targets in Real Time
The animation system needs to read the viseme timeline and update morph target weights on each frame. Use scene.onBeforeRenderObservable to run a function every frame that checks the current audio playback time against the viseme event list. Find the current viseme (the last event whose timestamp is less than or equal to the current playback time) and the next viseme, then interpolate between them.
Interpolation is essential for smooth lip movement. Jumping instantly from one viseme to another creates a jerky, mechanical appearance. Instead, calculate a blend factor based on how far between two viseme events the current time is, and use that factor to lerp the morph target weights. The outgoing viseme's weight fades from 1.0 to 0.0 while the incoming viseme's weight rises from 0.0 to 1.0 over a transition period of about 60 to 100 milliseconds.
Reset all viseme morph targets to zero before applying the current blend. This prevents morph targets from accumulating, which would distort the face. Set each viseme target's influence to zero, then set only the current and transitioning viseme targets to their interpolated values. This clean-slate approach avoids bugs where old viseme influences linger.
Synchronization between audio and animation is critical. If the morph targets run ahead of or behind the audio, the lip sync looks wrong. Use the audio element's currentTime property as the authoritative time source, and derive all morph target updates from it. If the audio buffers or stutters, the morph targets will pause with it, maintaining sync.
Step 5: Add Expression Layers
Lip sync alone produces a character that speaks but feels emotionally flat. Expression layers add personality by modulating non-mouth morph targets based on the content or context of the dialogue. When the character asks a question, raise the inner brows slightly. When it makes a joke, add a subtle smile. When it delivers bad news, furrow the brows and lower the mouth corners.
Implement expressions as a separate animation channel that runs independently from lip sync. The lip sync channel controls mouth-related morph targets, while the expression channel controls brow, eye, and cheek targets. Because morph targets are additive when applied to different parts of the face, both channels can update simultaneously without interfering with each other.
Automatic expression detection can be powered by sentiment analysis of the dialogue text. Send the text to a sentiment analysis API or use keyword matching to assign emotional categories: positive, negative, questioning, emphatic, neutral. Each category maps to a set of morph target presets. The expression system fades between presets as the emotional tone of the dialogue changes.
Blinking is a small detail that makes a large difference. Characters that never blink look uncanny. Add a periodic blink animation that fires every 2 to 5 seconds at random intervals, closing the eyes for about 150 milliseconds. Layer this on top of both lip sync and expression channels. The blink animation simply ramps the eye-close morph target from 0 to 1 and back over a short duration.
Step 6: Optimize for Real-Time Performance
Morph targets are GPU operations, but updating many targets every frame has a cost. Limit the number of active morph targets to the ones currently in use. If the character is not speaking, disable the lip sync update loop entirely. If the character is far from the camera, reduce the update frequency to every other frame or stop facial animation completely, since the player cannot see the details at distance.
Audio latency is a concern on mobile browsers. The Web Audio API has a context creation latency, and the first audio playback on mobile requires a user gesture. Pre-create the audio context on the first touch event, and pre-decode audio buffers so they are ready for instant playback when the character needs to speak. This prevents the noticeable delay between the character starting to move its mouth and the audio actually playing.
Memory management matters when the character speaks frequently. Each speech response generates audio data and viseme timing data. Store these in a cache with a maximum size, evicting the oldest entries when the cache is full. For long conversations, stream the audio rather than loading the entire file at once, and process viseme events as they arrive rather than waiting for the complete list.
Talking characters in Babylon.js work by synchronizing morph target animation with TTS audio using viseme timing data. The technical pipeline is clear: text goes to a TTS service, audio and timing come back, and the game drives face blend shapes in sync with playback. The quality of the result depends on smooth interpolation, emotional expression layers, and tight audio synchronization.