Facial Animation: Eyes, Blinks and Expression
Facial animation runs as a set of independent systems that each control a specific region of the face. The lip sync system owns the mouth. The blink system owns the eyelids. The gaze system owns the eyes. The emotion system owns the eyebrows, cheeks, and forehead. Each system writes to its own subset of morph targets, and the GPU blends them all together each frame. This separation is essential because it prevents one system from overwriting another's output.
Set Up a Procedural Blink System
Humans blink every 2 to 6 seconds on average, with significant natural variation. The blink itself is fast: the eyelids close in about 80 milliseconds, stay closed for 50 to 100 milliseconds, then open in about 150 milliseconds. The asymmetry between close speed and open speed is important for realism. A blink that opens as quickly as it closes looks mechanical.
Implement the blink as a state machine with three phases: waiting, closing, and opening. During the waiting phase, a countdown timer runs from a random value between 2.0 and 6.0 seconds. When the timer reaches zero, the system enters the closing phase and drives the eyeBlink (or eyeBlinkLeft and eyeBlinkRight for separate control) morph target from 0.0 to 1.0 over 80 milliseconds. After a brief hold at 1.0, the opening phase drives the target from 1.0 back to 0.0 over 150 milliseconds. The waiting timer then resets to a new random value.
Add variation to prevent robotic regularity. Occasionally trigger a double blink, where the waiting phase after the first blink is only 200 to 400 milliseconds before blinking again. About 1 in 8 blinks should be a double. Also vary the intensity, some blinks are full lid closure (1.0) while others are partial "sleepy" blinks (0.7 to 0.9). During active conversation, increase the blink rate slightly (2.0 to 4.0 second intervals) because humans blink more frequently when engaged in social interaction. During idle states, decrease it (4.0 to 7.0 seconds).
The morph targets for blinking should be separate from lip sync targets. If your model uses the ARKit blend shape set, use eyeBlinkLeft and eyeBlinkRight. If you authored custom shapes, add dedicated blink_L and blink_R targets that only affect the eyelid geometry. Never use a generic "eyes closed" target that also affects the brow or cheek, because the lip sync or emotion systems might need those regions simultaneously.
Implement Eye Gaze and Saccades
Eye gaze is controlled through bone rotation rather than morph targets. Each eye in the character rig has a bone (typically named eye_L and eye_R) that rotates the eyeball geometry. To make the character look at a target, calculate the direction vector from each eye bone's world position to the target's world position, convert that to local rotation angles, and apply them to the bone transforms. Both Babylon.js and Three.js expose bone transforms through their skeletal animation systems.
In Babylon.js, access the eye bones through the skeleton's bones array or by name with skeleton.bones.find(). Set the bone's rotation quaternion to face the target direction. In Three.js, find the bones in the loaded model's skeleton and set their quaternion property. Limit the rotation range to a natural maximum, approximately 30 degrees horizontally and 20 degrees vertically. Eyes that rotate beyond this range look unnatural because real humans turn their head rather than rotating their eyes to extreme angles.
Saccades are the small, rapid eye movements that occur between fixation points. Real eyes never hold perfectly still. They exhibit micro-saccades every 200 to 600 milliseconds, tiny jittering movements of 1 to 3 degrees that are a strong cue for biological life. Implement saccades by adding a small random offset to the gaze target direction at random intervals. Use a timer similar to the blink timer: every 200 to 600 milliseconds, generate a new random offset in both horizontal and vertical axes, each between -2 and +2 degrees. Lerp toward this offset over 50 milliseconds (saccades are fast) then hold until the next saccade fires.
During conversation, the character should look at the speaker or the player. Set the gaze target to the camera position (for first-person perspective) or the other character's head bone position (for third-person). When the character itself is speaking, occasionally shift gaze slightly away and then back, which mimics the natural pattern of looking away momentarily while formulating thoughts. This can be triggered randomly every 3 to 8 seconds during active speech.
Add Eyebrow Animation Driven by Audio Prosody
Eyebrows track the pitch contour of speech. When a speaker emphasizes a word, their fundamental frequency (F0) rises, and their eyebrows lift simultaneously. This correlation is strong enough that even a simple mapping from audio pitch to eyebrow height adds noticeable life to a speaking character.
To extract pitch, use the AnalyserNode's frequency data. The fundamental frequency of speech typically falls between 85 Hz (deep male voice) and 300 Hz (high female voice). With an FFT size of 2048 at a sample rate of 44100 Hz, each frequency bin spans about 21.5 Hz. Look at the bins covering 80 to 350 Hz and find the bin with the highest energy, as this approximates the current pitch. Alternatively, use an autocorrelation method on the time-domain data, which is more accurate for pitch detection but more computationally expensive.
Map the detected pitch to eyebrow height. Establish a baseline pitch by averaging the detected pitch over the first few seconds of speech. When the current pitch exceeds the baseline by more than 20%, raise the eyebrow morph targets proportionally. A pitch 50% above baseline should produce approximately 0.5 eyebrow raise weight. Cap the maximum at 0.7 to prevent cartoonish over-raising. When pitch drops below baseline, do not lower the eyebrows below neutral (0.0), as lowered eyebrows signal specific emotions (anger, concern) that should be handled by the emotion system, not by pitch tracking.
The eyebrow targets are browInnerUp (or browInnerUp_L and browInnerUp_R for asymmetric control) in the ARKit set, or custom brow_raise targets. Apply a lerp smoothing factor of 0.15 to 0.2 to prevent twitchy response to frame-by-frame pitch variation. The eyebrows should follow the general prosodic contour, not every micro-fluctuation in the audio.
Build an Emotion Layer with Blend Shape Presets
Define emotion presets as named configurations of blend shape weights. A "happy" preset might set mouthSmileLeft to 0.6, mouthSmileRight to 0.6, cheekSquintLeft to 0.3, and cheekSquintRight to 0.3. A "concerned" preset might set browInnerUp to 0.4, browDown_L to 0.2, and browDown_R to 0.2. A "surprised" preset sets browInnerUp to 0.7, eyeWideLeft to 0.5, eyeWideRight to 0.5, and jawOpen to 0.2.
Store these presets as plain data objects. To transition between emotions, crossfade from the current preset to the target preset over 500 to 1000 milliseconds using a smoothstep or ease-in-ease-out curve. During the transition, each blend shape interpolates from its value in the outgoing preset to its value in the incoming preset. A neutral preset with all weights at 0.0 serves as the default resting state.
Trigger emotion changes from your dialogue system. If the dialogue text includes emotion tags (many LLMs can output these when prompted), parse the tag and switch to the corresponding preset. If no tags are available, simple keyword analysis can approximate emotion: words like "sorry" or "worried" trigger concern, words like "great" or "wonderful" trigger happiness, questions trigger slight surprise. The emotion layer should change infrequently, once per sentence or dialogue beat, not per word. Rapid emotion switching looks erratic.
Be careful about which blend shapes the emotion layer controls. It must not write to shapes owned by the lip sync system (mouth viseme targets) or the blink system (eyelid targets). The emotion presets should only use cheek, brow, nose, and forehead shapes. If an emotion needs to affect the mouth (like a smile), use the smile-specific shapes (mouthSmileLeft, mouthSmileRight) rather than the jaw or lip shapes used by lip sync. This prevents the emotion from fighting the mouth animation.
Layer All Systems Without Conflict
The key to clean facial animation is strict ownership of blend shapes. Define a manifest that assigns every blend shape on the model to exactly one system. Lip sync owns the jaw and lip shapes (jawOpen, mouthClose, mouthFunnel, mouthPucker, etc., plus all viseme targets). Blink owns the eyelid shapes (eyeBlinkLeft, eyeBlinkRight, eyeSquintLeft, eyeSquintRight). Gaze owns the eye bones (eye_L, eye_R) and optionally the eyeLookDown/Up/In/Out shapes. Emotion owns the brow, cheek, nose, and smile shapes.
Each system runs its update independently each frame and writes only to its assigned shapes. Because the GPU additively blends all morph targets, the outputs of all systems combine naturally. A character can simultaneously have a lip sync viseme active, eyes blinking, gaze directed at the player, and a slight happy expression, all without any system needing to know about the others.
Run the systems in a fixed order: gaze first (it uses bone transforms, not blend shapes), then blink, then emotion, then lip sync. Lip sync runs last because it is the highest-frequency animation and the one players are most likely to scrutinize during dialogue. This ordering does not affect the final visual result since each system writes to different shapes, but it makes debugging easier because you can disable systems from the end of the chain backward to isolate issues.
Tune and Test with Real Dialogue
Testing with actual dialogue sequences is essential because static test animations do not expose timing issues. Play a full conversation where the character speaks multiple sentences with pauses, emphasis, and emotional shifts. Watch for specific problems: blinks that coincide with key words (distracting), eyebrows that twitch too rapidly (smoothing factor too high), gaze that snaps to a new target without transition (missing lerp), or emotions that change too abruptly.
Record screen captures of the character from the front at close range. Play them back at half speed to spot subtle artifacts that are invisible at full speed but register subconsciously as "something feels off." Common issues include: the mouth snapping shut between sentences instead of closing smoothly, eyebrows raising during silence (pitch detector noise on low-amplitude audio), and eye gaze not matching the conversation context (looking away when being addressed).
Tune parameters iteratively. Each system has 2 to 4 key parameters (blink interval range, saccade magnitude, eyebrow smoothing factor, emotion transition duration, lip sync blend speed). Adjust one parameter at a time and compare the result against the previous version. Keep a set of reference dialogue clips with varied emotional content, speaking rates, and pause patterns. Test every parameter change against the full set rather than just one clip, because improvements for fast speech sometimes degrade slow speech and vice versa.
Convincing facial animation is a layered system where each layer owns its region of the face. Blink cycles, eye gaze with saccades, pitch-driven eyebrow motion, and blend shape emotion presets each run independently and combine through the GPU's additive morph target blending. The result is a face that feels alive because every part of it is in subtle, independent motion.