How Lip Sync and Visemes Work
Phonemes: The Sounds of Speech
A phoneme is the smallest distinct unit of sound in a language. English uses approximately 44 phonemes, which combine to form every word in the language. The word "cat" contains three phonemes: /k/, /ae/, and /t/. The word "thought" also contains three phonemes (/th/, /aw/, /t/) despite having seven letters, because phonemes describe sounds rather than spelling.
Phoneme detection is the first step in any lip sync pipeline. Given an audio recording or real-time audio stream, the system must identify which phonemes occur and precisely when they start and end. There are two broad approaches to phoneme detection. The first is forced alignment, where the system knows the text being spoken (because it was scripted or generated by a TTS engine) and aligns each phoneme in the expected sequence to the matching moment in the audio. This approach is highly accurate because the system already knows what words are being said and only needs to find the timing. The second approach is recognition, where the system has no text reference and must identify phonemes purely from the audio signal. This is harder and less accurate, but necessary when working with audio that has no accompanying transcript.
For game development, forced alignment is the preferred method whenever the text is available. Tools like Rhubarb Lip Sync accept both an audio file and a text transcript, using the transcript to improve detection accuracy dramatically. When no text is available, spectral analysis techniques examine the frequency content of the audio to identify formant patterns characteristic of each phoneme. Vowels have strong, stable formant frequencies that make them relatively easy to detect. Consonants, particularly plosives like /p/ and /t/ that involve brief bursts of energy, are harder to isolate.
Visemes: The Shapes of Speech
A viseme is the visual equivalent of a phoneme. It represents a distinct mouth shape associated with one or more speech sounds. The critical insight is that many phonemes that sound different look the same on the face. The sounds /b/, /p/, and /m/ are acoustically distinct (voiced, voiceless, nasal) but they all produce the same mouth shape: lips pressed together. This means the number of visemes needed to represent all of speech is significantly smaller than the number of phonemes. While English has 44 phonemes, a typical viseme set contains only 12 to 15 entries.
The Oculus (Meta) viseme set has become the de facto standard in game development. It defines 15 visemes that cover the full range of English mouth shapes with enough resolution for convincing lip sync. The set includes: sil (silence, the neutral resting position), PP (lips pressed together for /p/, /b/, /m/), FF (lower lip tucked under upper teeth for /f/, /v/), TH (tongue visible between teeth for the "th" sounds), DD (tongue pressed behind upper teeth for /d/, /t/, /n/), kk (tongue raised at back of mouth for /k/, /g/), CH (lips slightly rounded and open for /ch/, /j/, /sh/), SS (teeth nearly closed with slight gap for /s/, /z/), nn (lips slightly parted with tongue behind teeth for /n/, /l/), RR (lips slightly pursed for /r/), aa (jaw dropped wide for the "ah" vowel), E (lips pulled back laterally for /eh/, /ae/), ih (slight jaw drop with relaxed lips for the "ih" sound), oh (lips rounded at medium aperture for "oh"), and ou (lips tightly rounded with small aperture for "oo").
The mapping from phonemes to visemes is stored as a lookup table. When the phoneme detector identifies a /b/ sound at time 320ms, the mapping table resolves it to viseme PP. When it identifies /ah/ at time 350ms, the mapping resolves to viseme aa. This phoneme-to-viseme conversion is a simple dictionary lookup, making it extremely fast even in real-time systems. The complexity lies not in the mapping itself but in the transitions between consecutive visemes, which is where interpolation and coarticulation come in.
Morph Targets: The Geometry of Expression
Morph targets, also called blend shapes, are the mechanism by which visemes become visible on a 3D character. A morph target is a stored variant of the character's face mesh where specific vertices have been moved to create a particular shape. The neutral face mesh is the base shape. Each morph target stores the displacement of every affected vertex from its base position. At runtime, the engine blends between the base shape and multiple morph targets simultaneously by applying a weight (0.0 to 1.0) to each target.
For lip sync, each viseme in your chosen set corresponds to one morph target on the character mesh. The PP viseme maps to a morph target where the lips are pressed together. The aa viseme maps to a morph target where the jaw is dropped and the mouth is wide open. When the lip sync system determines that the current audio position should show viseme aa at 80% and viseme PP at 20% (because the character is transitioning between the two), the engine sets the aa morph target influence to 0.8 and the PP morph target influence to 0.2. The GPU interpolates the vertex positions accordingly, producing a mesh that is mostly jaw-open but with a slight lip-press influence from the outgoing viseme.
Creating morph targets happens during character authoring in a 3D tool like Blender, Maya, or 3ds Max. In Blender, morph targets are called "shape keys." The artist creates the base mesh, then adds a shape key for each viseme, sculpting the mouth into the correct position for each one. When the model is exported to glTF or GLB format, the shape keys are included as morph targets. Both Babylon.js and Three.js read these targets when loading the model and make them available through their respective morph target APIs.
The quality of lip sync is heavily dependent on the quality of the morph targets themselves. A morph target for the aa viseme that only opens the jaw without pulling the cheek muscles or adjusting the lip corners will look stiff. Good morph targets model the full facial response to each mouth position, including subtle movements in the cheeks, chin, and lower nose area. This is why character artists, not programmers, are the primary determinant of lip sync quality. The best phoneme detection and smoothest interpolation cannot compensate for poorly sculpted mouth shapes.
Coarticulation and Blending
Human speech is not a sequence of isolated mouth positions. When you say the word "spoon," your lips begin rounding for the "oo" vowel while you are still producing the "sp" consonant cluster. This phenomenon is called coarticulation, and it means that the shape of the mouth at any moment is influenced by the sounds that precede and follow the current sound, not just the current sound itself.
Lip sync systems model coarticulation through overlapping blend weights. Rather than switching instantly from one viseme to the next, the system creates a smooth transition where the outgoing viseme's weight decreases while the incoming viseme's weight increases over a short time window. A common approach is cosine interpolation over a blend window of 60 to 100 milliseconds. This produces natural-looking transitions without the jarring "snapping" effect that occurs when visemes change instantaneously.
More advanced systems use dominance functions to handle coarticulation. Each viseme is assigned a dominance value that represents how strongly it asserts its shape during blending. Visemes with high visual salience, like the tightly rounded "ou" or the wide-open "aa," have high dominance values. Visemes with subtle shapes, like the slight jaw drop of "ih," have lower dominance. During blending, dominant visemes retain more of their shape when adjacent to non-dominant visemes, preventing the averaging effect that can wash out distinctive mouth positions into a generic partially-open shape.
The interpolation method also matters. Linear interpolation (lerp) produces functional results but can look mechanical because the transition speed is constant. Ease-in-ease-out curves (smoothstep or cosine) produce more natural transitions by accelerating into the new shape and decelerating as it arrives. The difference is subtle in isolation but compounds over an entire sentence, giving the face a more organic feel. Most implementations use THREE.MathUtils.lerp in Three.js or Scalar.Lerp in Babylon.js with a per-frame blend factor rather than time-based interpolation, which ties the smoothing to frame rate. A blend factor of 0.3 at 60fps produces roughly 100ms effective blend time, which matches natural speech transition speeds.
The Complete Pipeline
Putting all the pieces together, the lip sync pipeline runs as a continuous loop during character speech. At each frame, the system reads the current playback position of the audio. It looks up which phoneme is active at that position (from the pre-computed timeline or the real-time detector). It maps the phoneme to a viseme through the lookup table. It calculates blend weights for the current and adjacent visemes based on the interpolation curve. It writes these weights to the morph target system. The GPU applies the weights and renders the deformed mesh.
The entire loop must complete within a single frame budget, roughly 16 milliseconds at 60fps. The phoneme lookup and viseme mapping are trivial operations, effectively free. The weight calculation involves a few multiplications and lerp calls, also negligible. The morph target application is a GPU operation that scales with vertex count and active target count but is parallelized across shader cores. For typical game characters with 5000 to 15000 face vertices and 15 active viseme targets, the GPU cost is a fraction of a millisecond. The bottleneck, if any, is the real-time audio analysis step, which is why offloading it to a Web Worker or using pre-computed data is the standard practice.
Lip sync is a pipeline with four stages: phoneme detection, viseme mapping, weight interpolation, and morph target application. The quality comes from the morph target artwork and the interpolation smoothness, not from detecting every phoneme perfectly. Close enough timing with well-sculpted shapes produces better results than perfect timing with stiff shapes.