Lip Sync for Characters in Three.js
This guide works with both vanilla Three.js and React Three Fiber. The core morph target API is the same in both, the difference is where you put the animation loop code. In vanilla Three.js, you write the update logic inside the requestAnimationFrame callback or a Clock-based loop. In React Three Fiber, you use the useFrame hook. The concepts and math are identical.
Export a Character with Morph Targets
In Blender, select your character's head mesh and open the Shape Keys panel under Object Data Properties. The first shape key, "Basis," is the neutral face. Add a new shape key for each viseme in the Oculus 15-viseme set. Name them clearly: viseme_sil, viseme_PP, viseme_FF, viseme_TH, viseme_DD, viseme_kk, viseme_CH, viseme_SS, viseme_nn, viseme_RR, viseme_aa, viseme_E, viseme_ih, viseme_oh, viseme_ou.
For each shape key, enter edit mode and adjust the vertices around the mouth to form the correct shape. Work with a mirror modifier active so both sides of the face move symmetrically. Focus on the lips, jaw, chin, and the corners of the mouth. When all 15 are sculpted, export to GLB format. In Blender's glTF export dialog, under the Mesh section, make sure "Shape Keys" is checked. Three.js GLTFLoader reads these as morph targets and populates the morphTargetDictionary and morphTargetInfluences arrays on the loaded mesh.
If you are using Ready Player Me avatars, the GLB comes pre-loaded with ARKit-compatible blend shapes instead of Oculus visemes. These include names like jawOpen, mouthClose, mouthFunnel, mouthPucker, mouthLeft, mouthRight, and about 46 more. You can still do lip sync with these shapes by creating a mapping table that converts each Oculus viseme into a weighted combination of ARKit shapes.
Load the GLB and Build the Viseme Map
Load the GLB with THREE.GLTFLoader (or useGLTF in React Three Fiber with drei). The loaded scene contains a hierarchy of Object3D nodes. The mesh with morph targets is typically a SkinnedMesh. Traverse the scene graph to find it:
Call gltf.scene.traverse() and check each node for the presence of morphTargetDictionary. When you find the mesh, store a reference to it. The morphTargetDictionary is an object where keys are shape key names (viseme_PP, viseme_aa, etc.) and values are integer indices into the morphTargetInfluences array.
Build your viseme lookup by extracting the indices you need. Create an object like {PP: 3, FF: 5, aa: 12, ...} where each value is the index from morphTargetDictionary for the corresponding viseme name. This lookup runs during initialization and lets your animation loop access indices with a simple property read rather than a string lookup on every frame. If the morphTargetDictionary uses a different naming convention (no viseme_ prefix, ARKit names, etc.), your initialization code handles the translation once.
Connect Audio to an AnalyserNode
Create an AudioContext using the standard Web Audio API. Note that browsers require a user gesture (a click or keypress) before the AudioContext can start, so create and resume the context inside a click handler. Load your audio file with fetch(), decode it with audioContext.decodeAudioData(), and create a source node with audioContext.createBufferSource().
Create an AnalyserNode with audioContext.createAnalyser(). Set fftSize to 256 for quick, low-resolution frequency analysis or 512 for more detail. Connect the source to the analyser, and the analyser to the destination: source.connect(analyser) followed by analyser.connect(audioContext.destination). Allocate a Uint8Array of length analyser.frequencyBinCount to hold frequency data, and another for time-domain data. These arrays are reused every frame, so allocate them once during setup.
For streaming TTS audio, use an AudioWorklet or MediaStreamSource instead of a buffer source. The analyser connection works the same way regardless of the source type. The key is that the analyser sits between the source and the speakers, passively reading audio data without modifying it.
Load or Generate Viseme Timing Data
For pre-baked lip sync, load the JSON timeline generated by Rhubarb Lip Sync or a similar tool. The timeline is an array of entries, each with a start time (in seconds), an end time, and a viseme identifier. At runtime, track the audio playback position and binary search the array to find the current entry. Store the last searched index to avoid re-scanning from the beginning each frame, since playback is sequential and the next entry is almost always adjacent to the current one.
For real-time lip sync without a pre-computed timeline, build a simple estimator from the AnalyserNode data. Each frame, call analyser.getByteFrequencyData(frequencyArray). Compute the energy in several frequency bands. The lowest band (indices 0 to 3 with fftSize 256 at 44100Hz) covers roughly 0 to 700 Hz, which is where vowel formants live. The mid band (indices 4 to 8) covers 700 to 1500 Hz. The high band (indices 8 to 20) covers 1500 to 3500 Hz, where sibilants and fricatives are prominent. Map these energy values to viseme weights: high low-band energy with low high-band energy suggests an open vowel (aa or oh). High energy across all bands suggests a plosive. Dominant high-band energy suggests a sibilant (SS or CH). This is crude but functional for real-time scenarios where no transcript is available.
Update morphTargetInfluences Each Frame
In your animation loop (or useFrame hook in React Three Fiber), execute the following sequence. First, determine the active viseme and its target weight from either the timeline lookup or the real-time estimator. Second, for each morph target in your viseme set, compute the new influence value using lerp: newValue = currentValue + (targetValue - currentValue) * blendFactor. The blendFactor controls transition speed. A value of 0.3 at 60fps gives approximately 100ms effective blend time. Third, write the computed values to mesh.morphTargetInfluences[index] for each active viseme.
Keep a local array of current influence values that persists between frames. Reading from morphTargetInfluences each frame works but costs a property access per target. A local Float32Array of 15 elements is faster and gives you full control over the smoothing state. After computing new values, write them both to your local array (for next frame's lerp input) and to morphTargetInfluences (for rendering).
When multiple visemes are active simultaneously (during transitions), write non-zero influences for all active visemes. The Three.js shader blends them additively, so the visual result is a weighted combination of all active shapes. Ensure the total active influence stays reasonable. If three visemes are each at 0.5, the combined effect is stronger than any single viseme at 1.0, which can produce exaggerated mouth shapes. Normalize the weights if the sum exceeds 1.0, or use a dominance system where only the top 2 to 3 visemes by weight are applied per frame.
Refine Transitions and Handle Edge Cases
Tune the blendFactor by testing with a variety of speech samples. Dialogue with varied pacing exposes blending issues that a single test clip might not. Fast speech with many short phonemes needs a higher blend factor (0.4 to 0.5) so the mouth keeps pace. Slow narration benefits from a lower factor (0.2 to 0.25) for smoother, more deliberate transitions. Consider making the blend factor a configurable parameter or even adapting it based on detected speech rate.
When speech ends, do not snap all morph targets to zero. Instead, set the target state to all zeros and let the lerp smoothing handle the transition over the next several frames. This creates a natural mouth-closing motion over roughly 200 milliseconds. Detect speech end either from the audio source's ended event, from reaching the end of the viseme timeline, or from the AnalyserNode reporting silence (RMS below a threshold) for more than 100 milliseconds.
For characters at a distance from the camera, switch to a simpler amplitude-based system. Compute the RMS volume from the AnalyserNode time-domain data each frame: sum the squared differences from 128 (the zero-crossing for unsigned byte data), take the square root of the average, and normalize to a 0 to 1 range. Use this value to drive a single jawOpen morph target. This looks adequate at distances where the player cannot see detailed mouth shapes, and it saves the CPU work of full viseme calculation. Use the camera distance to the character as the threshold, switching from full visemes to amplitude mode at a configurable distance, typically 5 to 10 meters in game units.
Three.js lip sync revolves around writing correct values to morphTargetInfluences at the right time. The morphTargetDictionary gives you the name-to-index mapping, the AnalyserNode or a pre-computed timeline gives you the timing, and lerp smoothing makes the transitions natural. Everything else is tuning and edge-case handling.