Tools for Game Character Lip Sync

Updated June 2026
The right lip sync tool depends on whether your dialogue is pre-recorded or generated at runtime, whether you need viseme-level precision or simple jaw animation, and whether you can afford a cloud service or need everything to run client-side. This guide covers the major tools available for web game developers, what each one does well, and where each one falls short.

Rhubarb Lip Sync

Rhubarb Lip Sync is an open-source command-line tool that takes a WAV audio file and an optional text transcript as input and produces a timestamped viseme sequence as output. It uses a combination of hidden Markov models and forced alignment (when a transcript is provided) to detect phonemes, then maps them to a simplified viseme set. The output is a JSON or TSV file with entries like: start time 0.320, end time 0.400, viseme B (lips together).

Rhubarb's viseme set uses letters (A through H) rather than the Oculus numeric or named visemes, but the mapping is straightforward. Rhubarb shape A corresponds to the neutral/silence pose, B to the lips-together shapes (p, b, m), C to open vowels, D to the tongue-behind-teeth shapes (t, d, n), E to rounded vowels, F to the lower-lip-under-teeth shape (f, v), G to back-of-throat consonants (k, g), and H to the "L" sound with tongue forward. You build a translation table from Rhubarb letters to your morph target names during initialization.

Rhubarb is designed for offline pre-processing, not real-time use. You run it as a build step over your audio files and ship the resulting JSON alongside the audio in your game assets. At runtime, you load the JSON, play the audio, and look up the active viseme for the current playback position. This is the highest-quality lip sync approach for pre-recorded dialogue because the analysis runs without time constraints and can use the transcript for improved accuracy.

For web developers, the main limitation is that Rhubarb is a native binary (C++), not a JavaScript library. You cannot run it in the browser directly. However, it can be compiled to WebAssembly, and some community ports exist for this purpose. If you need client-side Rhubarb analysis (for example, to process audio from a microphone or TTS service), a WASM build running in a Web Worker is the path, though it requires more setup than the standard command-line workflow. For most web games with scripted dialogue, running Rhubarb offline during development and bundling the JSON output is the simpler and better approach.

Oculus (Meta) Lipsync SDK

The Oculus Lipsync SDK is a real-time lip sync library developed by Meta, primarily for Unity and Unreal Engine in VR applications. It analyzes audio input frame by frame and outputs a set of 15 viseme weights that can be applied directly to morph targets. Unlike Rhubarb, which produces discrete viseme events, Oculus Lipsync outputs a continuous weight vector where multiple visemes can have non-zero values simultaneously, producing smooth blended mouth shapes.

The Oculus Lipsync SDK is not directly available for web platforms. It ships as a native library for Windows, Android, and Quest, with engine integrations for Unity and Unreal. There is no official JavaScript or WebAssembly port. However, its 15-viseme standard has been widely adopted as the reference viseme set across the game industry, and most other tools (including Azure Speech) output data compatible with this set. When you see references to "Oculus visemes" in web development contexts, it typically means the viseme naming convention rather than the actual SDK.

For web developers, the Oculus Lipsync SDK serves as a conceptual reference rather than a practical tool. Its architecture, real-time audio analysis producing per-frame viseme weights, is the model that client-side web implementations replicate using the Web Audio API's AnalyserNode. The key insight from Oculus Lipsync is that real-time lip sync should output weighted blends of multiple visemes per frame rather than discrete single-viseme selections, because blended output produces smoother and more natural-looking results.

Azure Cognitive Services Speech Visemes

Azure Cognitive Services Speech is the only major cloud TTS service that delivers viseme timing data as a built-in feature. When you use the Azure Speech SDK to synthesize speech, you can subscribe to a visemeReceived event that fires for each viseme in the utterance. Each event provides a viseme ID (integer 0 to 21) and an audio offset in 100-nanosecond ticks.

Azure's viseme set contains 22 entries (IDs 0 through 21), which is a superset of the Oculus 15-viseme set. The extra entries provide finer distinctions for certain mouth shapes. Mapping Azure viseme IDs to the Oculus 15-viseme morph targets requires a simple lookup table. IDs 0 and 1 both map to the silence/neutral viseme. IDs 2 through 21 map to the 14 active Oculus visemes with some IDs sharing a target (for example, Azure IDs 6 and 7 both map to the "kk" viseme because Azure distinguishes between two subtly different back-consonant shapes that look identical in most 3D models).

The integration workflow with Azure is: create a SpeechSynthesizer, subscribe to visemeReceived, call speakTextAsync or speakSsmlAsync, collect the viseme events into a timeline array as they fire, play the resulting audio through the Web Audio API, and drive morph targets from the timeline synchronized with the audio playback position. The viseme events arrive before or concurrent with the audio data, so the timeline is fully built by the time playback begins. This is the cleanest, most straightforward lip sync integration available for web developers because no client-side audio analysis is required.

Azure also offers blend shape output through the visemeReceived event. Instead of a single viseme ID, you can request a blend shape array that provides weights for all 55 Azure blend shapes per frame. This is more data-intensive but produces smoother animation because you receive pre-calculated interpolation rather than discrete viseme selections. The blend shape output is especially useful if your character model uses the ARKit blend shape set, since Azure's 55 shapes map closely to the ARKit standard.

The cost of Azure Speech TTS is approximately $16 per million characters for neural voices. For a game with moderate dialogue volume, this is typically a few dollars per month. The quality-to-integration-effort ratio is the best available: good voices, built-in viseme data, and a mature JavaScript SDK that handles WebSocket connections and audio buffering.

Client-Side Audio Analysis with the Web Audio API

When you cannot use an external tool or service for viseme data, the Web Audio API's AnalyserNode provides the raw material to build your own lip sync system entirely in the browser. This is the approach you use for any TTS provider that does not supply viseme timing (ElevenLabs, OpenAI, Web Speech API) and for pre-recorded audio where you have not run an offline tool like Rhubarb.

The simplest client-side approach is amplitude-based jaw animation. Compute the RMS volume each frame and map it to a single jawOpen morph target weight. Silence produces a closed mouth, loud audio produces an open mouth, and the smooth transition between them creates a basic talking effect. This looks passable at distance and for minor characters. It does not produce recognizable mouth shapes for specific sounds, so it fails for close-up dialogue where players expect to see distinct lip positions.

A more sophisticated approach is frequency-band viseme estimation. Divide the AnalyserNode's frequency data into bands corresponding to formant regions of speech. Low frequencies (200 to 800 Hz) with strong energy suggest open vowels (viseme aa). Mid frequencies (800 to 2000 Hz) suggest mid vowels (viseme E, ih). Strong high-frequency energy (2000 to 5000 Hz) suggests sibilants (viseme SS) or fricatives (viseme FF). Silence across all bands maps to viseme sil. This approach produces recognizable mouth shape variation that is noticeably better than pure amplitude, though it cannot match the accuracy of phoneme-based tools.

For the best client-side results without an external service, consider a WebAssembly phoneme detector. Several open-source speech recognition models can be compiled to WASM and run in a Web Worker. Vosk, a lightweight speech recognition toolkit, has WASM builds that can output phoneme-level alignments from audio in near real time. The phoneme output feeds into the standard viseme mapping pipeline. This approach is heavier than frequency-band estimation (the WASM model is typically 10 to 50 MB depending on language and accuracy level) but produces significantly better lip sync because it actually identifies speech sounds rather than guessing from spectral shapes.

Choosing the Right Tool

The decision tree is straightforward. If your dialogue is pre-recorded and you have the audio files during development, use Rhubarb Lip Sync to generate offline viseme timelines. The quality is the highest, the runtime cost is zero, and the integration is simple JSON file loading. If your dialogue is generated dynamically and you can use a cloud TTS service, use Azure Speech for its built-in viseme data. The integration is clean and the per-character cost is manageable. If you must use a TTS provider without viseme output (ElevenLabs for voice quality, OpenAI for API simplicity), add client-side frequency-band estimation for acceptable lip sync, or invest in a WASM phoneme detector for good lip sync. If you need the absolute simplest implementation with minimal code, use amplitude-based jaw animation. It works, it is cheap, and it is better than a mouth that does not move at all.

Many production games use multiple approaches simultaneously. Major story characters get pre-baked Rhubarb lip sync for recorded lines and Azure viseme data for dynamic LLM dialogue. Background NPCs get amplitude-based jaw flapping. The lip sync system accepts data from any source and applies it through the same morph target pipeline, so mixing approaches is an architectural decision, not a technical limitation.

Key Takeaway

Rhubarb is the best tool for pre-recorded dialogue. Azure Speech is the best option for dynamic TTS dialogue because it delivers viseme data with the audio. Client-side frequency analysis is the fallback when no external tool is available. Amplitude-based jaw animation is the minimum viable approach. Most games combine multiple tools for different character tiers.