Combining LLM Dialogue with Voice and Lip Sync
The pipeline has four stages, each producing output that feeds the next: the player provides input, the LLM generates a text response, the TTS service converts text to audio, and the lip sync system converts audio to facial animation. Without streaming, each stage must complete before the next begins, and the total latency is the sum of all four. With streaming, the stages overlap, and the player hears the character begin speaking while the LLM is still generating the end of its response.
Design the Pipeline Architecture
The core data flow is: player text input -> LLM API (streaming) -> sentence buffer -> TTS API (streaming) -> audio chunks -> Web Audio playback + viseme extraction -> morph target animation. Each arrow represents a producer-consumer relationship where the upstream stage streams data to the downstream stage.
The LLM stage uses the streaming API of your chosen model provider. For OpenAI-compatible APIs, this means setting stream: true in the request and reading server-sent events (SSE) from the response. For Anthropic's API, use the messages endpoint with stream: true and read the content_block_delta events. Each event delivers a small number of tokens, typically 1 to 4 words. These tokens arrive at an irregular rate, with bursts of fast tokens interspersed with pauses of 50 to 200 milliseconds.
The sentence buffer sits between the LLM and TTS stages. It accumulates tokens until a sentence boundary is detected, then dispatches the complete sentence to TTS. Sentence boundaries are detected by looking for period, exclamation mark, question mark, or colon characters followed by a space or end of stream. This buffering is important because TTS produces higher quality output and more accurate timing data when given complete sentences rather than fragments. Sending individual words or phrases to TTS results in unnatural prosody and wasted API calls.
The TTS stage accepts each sentence, synthesizes audio, and streams the audio back. If using Azure, the viseme events arrive alongside the audio. If using ElevenLabs or OpenAI, the audio arrives as chunks that must be analyzed for lip sync data on the client side. The audio chunks are queued for sequential playback so that sentences play back-to-back without gaps or overlaps.
Stream LLM Text to TTS in Sentence Chunks
Set up the LLM request with streaming enabled. For a fetch-based implementation, read the response body as a ReadableStream and parse the SSE format manually, or use a library like eventsource-parser. As tokens arrive, append them to a string buffer. After each append, check if the buffer contains a sentence-ending punctuation mark.
When a sentence boundary is found, extract the complete sentence from the buffer and immediately send it to the TTS service as a new request. Do not wait for the LLM to finish its entire response. This pipelining is the single most important latency optimization. If the LLM generates a 3-sentence response over 2 seconds, the first sentence is sent to TTS at roughly 600ms, the second at 1200ms, and the third at 2000ms. The TTS for the first sentence begins processing at 600ms rather than 2000ms.
Track the TTS requests as an ordered queue. Each request corresponds to one sentence and will produce one audio chunk. The audio playback system reads from this queue in order, starting each sentence's audio immediately after the previous one finishes. If the TTS for a later sentence finishes before the audio for an earlier sentence has played, the later audio waits in the queue. This guarantees correct ordering even if TTS processing times vary.
Handle the edge case where the LLM generates a very long sentence without punctuation. Set a maximum buffer length (around 200 characters) and dispatch the buffer contents at that threshold even without a sentence boundary. This prevents the buffer from growing indefinitely and ensures the player does not wait through an unusually long silence.
Stream TTS Audio to the Player with Immediate Playback
When the first TTS response arrives, begin audio playback immediately. Use the Web Audio API to create an AudioBufferSourceNode for each audio chunk. Connect each source through an AnalyserNode (for lip sync data) to the audio destination. Schedule playback using audioContext.currentTime so that consecutive sentences play gapless.
For TTS services that stream audio in chunks within a single sentence (ElevenLabs streaming, OpenAI PCM streaming), accumulate the chunks into a single AudioBuffer before playback, or use a MediaSource with a SourceBuffer to append chunks as they arrive and play them seamlessly. The MediaSource approach is more complex to implement but provides true streaming playback where audio begins before the full sentence has been synthesized.
Track the global playback position across all sentences. Each sentence's audio has a start time relative to the beginning of the conversation. The lip sync system needs this global position to look up the correct viseme data. Maintain a running total: when sentence N finishes, sentence N+1's start time is the current global time. The viseme timeline for each sentence uses offsets relative to its own start, so convert them to global time by adding the sentence start time.
Monitor the audio queue for underruns. If the player has finished listening to all queued audio but the LLM is still generating (the TTS has not delivered the next sentence yet), the character enters a "thinking" silence. During this pause, keep the facial animation running with idle behaviors (blinks, saccades, subtle expression shifts) so the character does not freeze. When the next audio chunk arrives, resume lip sync smoothly.
Extract Viseme Data and Feed the Lip Sync System
If using Azure Speech, the visemeReceived events arrive before or concurrent with the audio. Store them in a per-sentence timeline array. When playback of that sentence begins, activate the timeline and drive morph targets using the same timeline playback mechanism described in the Babylon.js and Three.js lip sync guides. The viseme IDs map directly to morph targets through your viseme-to-index dictionary.
If using a TTS provider without viseme data, connect the audio output to an AnalyserNode and run real-time analysis each frame. The simplest approach is amplitude-based jaw animation: compute RMS from the time-domain data and map it to jawOpen intensity. For better quality, use the frequency-band estimation approach described in the engine-specific guides, mapping low-band energy to open vowels, mid-band to mid vowels, and high-band to consonants.
For ElevenLabs with alignment data, use the word timestamps to create a hybrid approach. The alignment response tells you when each word starts and ends in the audio. Look up each word in a pronunciation dictionary to get its phoneme sequence, distribute the phonemes evenly across the word's time span, map phonemes to visemes, and build a synthetic viseme timeline. This produces better results than pure audio analysis because it uses linguistic information about what words are being spoken, not just how they sound.
Regardless of the viseme source, apply the same smoothing and interpolation as with pre-recorded dialogue. The morph target system does not care whether the viseme data came from a pre-baked Rhubarb analysis, an Azure viseme stream, or a real-time frequency estimator. It receives weights and applies them with lerp smoothing.
Handle Interruptions and Conversation Flow
Players expect to interrupt NPCs, especially in game scenarios where they have already heard enough or want to redirect the conversation. When the player submits new input while the NPC is still speaking, implement a clean cutoff: stop the current audio immediately (disconnect or stop the AudioBufferSourceNode), clear the TTS audio queue, cancel any pending TTS requests (abort their fetch controllers), and optionally cancel the streaming LLM response if the conversation context has changed.
After the cutoff, smoothly transition the character's face back to neutral over 200 to 300 milliseconds using the standard lerp decay on all morph targets. Then submit the new player input to the LLM and restart the pipeline from the beginning. Preserve the conversation history so the LLM has context about what was said before the interruption.
Also handle natural conversation turn-taking. When the NPC finishes speaking (all audio has played, the LLM has stopped generating), transition to a listening pose: face forward, eyes on the player, mouth closed, expression attentive. If your game uses voice input, activate the microphone at this point. If it uses text input, display the input field. The character should not freeze after speaking. It should visibly wait, with blinks, subtle gaze shifts, and gentle breathing animation keeping it alive during the player's thinking time.
Optimize for Perceived Responsiveness
The actual latency from player input to first audio, even with full streaming, is typically 500ms to 1500ms. This is the sum of the LLM time-to-first-token (200 to 800ms), the sentence buffer fill time (100 to 400ms), and the TTS processing time (200 to 500ms). While each component is streaming, the initial response still requires minimum processing time at each stage.
Reduce perceived latency with visual cues. When the player submits input, immediately show the character reacting: a slight head tilt, a brow raise, an intake of breath (a short inhale sound effect). These cues signal that the character heard the player and is about to respond. They buy 300 to 500ms of subjective latency tolerance because the player interprets the animation as "thinking" rather than "loading."
Pre-generate conversation starters. If your game can predict likely conversation scenarios (an NPC the player approaches, a quest giver with a known trigger), pre-generate the first sentence of the response before the player speaks. Cache the audio and viseme data. When the player actually initiates conversation, play the cached first sentence immediately while the rest of the response generates in real time. This can reduce perceived first-response latency to near zero for predictable interactions.
Use smaller, faster models for the LLM stage when quality permits. GPT-4o-mini or Claude Haiku produce responses with significantly lower latency than their larger counterparts. For ambient NPC chatter, lore explanations, or simple quest directions, the quality difference is negligible and the latency improvement is substantial. Reserve larger models for narrative-critical dialogue where response quality justifies the wait.
The key to responsive LLM-voiced characters is streaming at every stage. Buffer LLM tokens into sentences, dispatch each sentence to TTS immediately, begin audio playback from the first chunk, and run lip sync analysis in real time. The stages overlap so the player hears the character begin speaking while the rest of the response is still being generated.