Lip Sync for Characters in Babylon.js
This guide assumes you have a Babylon.js scene running in the browser and a 3D character model exported from Blender (or another tool) in GLB format with morph targets for the standard 15 Oculus visemes. If your model uses ARKit blend shapes from Ready Player Me, the same principles apply, but you need an additional mapping step from visemes to ARKit shape combinations.
Prepare a Character Model with Viseme Morph Targets
The foundation of lip sync is the model itself. In Blender, open your character, select the head mesh, go to the Shape Keys panel in the Properties editor, and create a shape key for each viseme. The base shape key ("Basis") is the neutral resting face. Add shape keys named viseme_sil, viseme_PP, viseme_FF, viseme_TH, viseme_DD, viseme_kk, viseme_CH, viseme_SS, viseme_nn, viseme_RR, viseme_aa, viseme_E, viseme_ih, viseme_oh, and viseme_ou.
For each shape key, enter edit mode and sculpt the mouth into the correct position. The PP shape key presses the lips together firmly. The aa shape key drops the jaw open and stretches the corners of the mouth slightly. The oh shape key rounds the lips into a medium circle. Pay close attention to the area around the corners of the mouth and the chin, since these regions define the visual distinctiveness of each viseme. Export the model as GLB with "Shape Keys" enabled in the glTF export settings. Babylon.js reads these as morph targets automatically.
Load the Model and Map Morph Targets
Load the GLB file using BABYLON.SceneLoader.ImportMesh or BABYLON.SceneLoader.ImportMeshAsync. Once loaded, find the mesh that contains the morph targets. This is typically the head or face mesh. The mesh will have a morphTargetManager property if it contains blend shapes.
Build a lookup dictionary that maps viseme names to MorphTarget indices. Iterate through the MorphTargetManager's targets using morphTargetManager.numTargets and morphTargetManager.getTarget(index). Each MorphTarget has a name property that matches the shape key name from Blender. Store this mapping in an object for quick access during the animation loop. For example, if viseme_PP is at index 3, your dictionary stores {"PP": 3}. This initialization runs once and makes the per-frame lookup a simple property access rather than an iteration.
If your model uses the Ready Player Me naming convention (mouthOpen, mouthSmile_L, jawOpen, etc.), you need an additional mapping layer. Create a table that maps each Oculus viseme to a combination of ARKit blend shapes with weights. For instance, the aa viseme might map to jawOpen at 0.7 and mouthOpen at 0.5. During animation, you apply multiple morph targets per viseme rather than a single one.
Set Up Audio Playback and Analysis
Create an AudioContext and load your dialogue audio. For pre-recorded audio, use the fetch API to load an ArrayBuffer and decode it with audioContext.decodeAudioData(). Create a source node with audioContext.createBufferSource(), set the buffer, and connect it to the destination (speakers). Between the source and destination, insert an AnalyserNode with audioContext.createAnalyser(). Set the AnalyserNode's fftSize to 256 or 512 for a good balance between frequency resolution and performance.
The AnalyserNode gives you two types of data each frame: frequency data via getByteFrequencyData() and time-domain (waveform) data via getByteTimeDomainData(). For simple amplitude-based lip sync, the time-domain data is sufficient, you compute the RMS (root mean square) of the samples to get a volume level between 0 and 1. For frequency-based phoneme estimation, the frequency data lets you identify formant regions that correspond to different vowel sounds. Low frequencies (85 to 150 Hz) correlate with open vowels, mid frequencies (150 to 250 Hz) with mid vowels, and higher frequencies (250 Hz and above) with front vowels and sibilant consonants.
Build the Viseme Timeline or Real-Time Detector
For pre-recorded dialogue with a known transcript, use Rhubarb Lip Sync to generate a JSON timeline offline. Rhubarb processes a WAV file and outputs an array of objects, each containing a start time, end time, and viseme identifier. Load this JSON in the browser alongside the audio file. During playback, binary search the timeline array to find the active viseme for the current audio time, which you get from the AudioContext's currentTime or by tracking elapsed time since playback began.
For real-time lip sync without pre-computed data, build a simple phoneme estimator using the AnalyserNode frequency data. Divide the frequency spectrum into bands and map each band's energy to a viseme. High energy in the 200 to 800 Hz range with low energy above 2000 Hz suggests an open vowel (viseme aa). Strong energy above 4000 Hz suggests a sibilant (viseme SS). Broad energy across the spectrum suggests a plosive or fricative. This approach is approximate, but when combined with smooth blending, it produces acceptable results for most game scenarios. For higher accuracy, consider a WebAssembly port of a proper phoneme recognizer running in a Web Worker.
Drive Morph Targets in the Render Loop
Register a callback with scene.registerBeforeRender() that runs every frame before the scene renders. Inside this callback, determine the current target viseme (from the timeline or the real-time detector) and its intensity. Look up the corresponding MorphTarget index from your dictionary. Set that target's influence to the desired weight.
Apply smoothing by interpolating between the current influence values and the target values rather than setting them directly. For each morph target, compute the new influence as: currentInfluence + (targetInfluence - currentInfluence) * blendFactor, where blendFactor is a value between 0.1 and 0.5. A blend factor of 0.3 provides natural-looking transitions at 60fps. Store the current influence values in a separate array that persists between frames, since reading back from the MorphTarget object every frame adds unnecessary overhead.
Reset all morph target influences toward zero before applying the active visemes each frame. This prevents stale viseme shapes from lingering when the speech moves to a different mouth position. The reset should also use lerp rather than a hard zero, so the mouth smoothly returns to neutral during pauses in speech.
Add Smoothing and Polish
Tune the blend factor by testing with actual dialogue. Fast speech benefits from a higher blend factor (0.4 to 0.5) so the mouth keeps up with rapid phoneme changes. Slow, deliberate speech works better with a lower factor (0.2 to 0.3) so the transitions look smooth rather than twitchy. Some implementations adapt the blend factor based on speech rate, detected from the gap between consecutive viseme changes in the timeline.
Add a jaw amplitude fallback that runs when the real-time phoneme detector is uncertain or when the character is a background NPC that does not need full viseme resolution. This fallback uses the RMS amplitude from the AnalyserNode to drive a single jawOpen morph target, producing basic mouth movement that is cheap to compute. You can blend between the full viseme system and the amplitude fallback based on camera distance, giving close-up characters high-quality lip sync while distant characters use the simpler approach.
Handle the end of speech gracefully. When the audio finishes or the viseme timeline runs out, smoothly interpolate all morph target influences back to zero over 200 to 300 milliseconds rather than snapping the mouth shut. Register an onended callback on the AudioBufferSourceNode that triggers this return-to-neutral transition. If using streaming TTS audio, monitor the audio buffer state and begin the transition when no new audio chunks have arrived for a configurable timeout period.
Babylon.js lip sync comes down to three things: a model with well-sculpted viseme morph targets, a mapping from viseme names to MorphTarget indices, and a render loop that smoothly interpolates influence values based on audio timing data. The MorphTargetManager handles GPU blending automatically, so your code focuses on which shapes should be active and how strongly.