Audio-Driven Character Animation
Audio-driven body animation works on a different principle than lip sync. Lip sync maps specific speech sounds to specific mouth shapes. Body animation maps broad audio features, amplitude, rhythm, pitch contour, to generalized movements. A loud syllable triggers a slight head nod. A rising pitch triggers a subtle head tilt. A pause in speech triggers a weight shift. These mappings are approximate and procedural, which is why they work for dynamic, unpredictable dialogue where pre-authored animation is not practical.
Extract Audio Features for Animation
The AnalyserNode from the Web Audio API provides everything you need. Connect it to your audio source (the same connection used for lip sync), and each frame extract three categories of data.
Amplitude is the most useful feature. Call analyser.getByteTimeDomainData(timeDomainArray) and compute the RMS (root mean square) value: sum the squared differences of each sample from 128 (the unsigned byte midpoint), divide by the number of samples, and take the square root. Normalize to a 0 to 1 range by dividing by 128. This gives you a single number representing how loud the audio is at this moment. Store the RMS history over the last 10 to 20 frames to compute a moving average and detect peaks relative to the recent baseline.
Pitch contour drives head tilt and eyebrow movement. Approximate the fundamental frequency using the autocorrelation method on the time-domain data: for each possible lag from 2 to half the buffer length, compute the correlation between the signal and a delayed copy. The lag with the highest correlation corresponds to the period of the fundamental frequency. Convert from lag to frequency: f = sampleRate / lag. This method is CPU-intensive, so run it every 3 to 5 frames rather than every frame, and cache the result between updates.
Spectral energy distribution helps distinguish speech from silence and identifies emphasis patterns. Call analyser.getByteFrequencyData(frequencyArray) and sum the energy in a few broad bands: low (0 to 500 Hz), mid (500 to 2000 Hz), and high (2000+ Hz). A sudden increase in broadband energy signals the onset of a new phrase or an emphasized word, which is a natural trigger point for a gesture or posture change.
Drive Head Movement from Speech Rhythm
Humans nod subtly during their own speech, especially on stressed syllables. The nod is small, typically 2 to 5 degrees of forward pitch, and it correlates with amplitude peaks. Detect amplitude peaks by comparing the current RMS to the moving average: when the current RMS exceeds the average by more than 30%, a peak is occurring. When a peak is detected, apply a small downward rotation to the head bone over 150 milliseconds, then return to neutral over 300 milliseconds.
Limit the nod frequency to prevent twitchy rapid-fire head movement. After triggering a nod, impose a cooldown of 400 to 600 milliseconds before allowing another. This ensures the character nods once per stressed word or phrase rather than on every syllable. The resulting motion looks like natural conversational emphasis rather than rhythmic bobbing.
Head tilt tracks pitch variation. When the detected pitch rises compared to the baseline, apply a slight lateral tilt (2 to 4 degrees). When it falls, tilt slightly the other direction. Alternate the tilt direction each time to prevent the head from drifting permanently to one side. Apply heavy smoothing (blend factor 0.05 to 0.1) so the head drifts slowly rather than snapping. The effect is subtle but adds significant life, giving the impression that the character is actively engaged in their own speech rather than reciting from a script.
During pauses in speech (RMS below 0.05 for more than 300 milliseconds), return the head to a neutral forward position using the same smoothed interpolation. The character should not hold a tilted or nodded position during silence.
Add Gesture Triggers on Emphasis Peaks
Hand gestures in conversation are not random. They cluster around emphasis points, moments where the speaker stresses a key word or begins a new thought. Detect these moments using a combination of amplitude peaks and spectral onsets. An emphasis point is a frame where the RMS exceeds 150% of the moving average AND the broadband spectral energy has increased by more than 40% from the previous analysis frame.
When an emphasis point is detected, trigger a gesture from a pool of pre-authored gesture animations. A gesture pool might include: a small open-palm raise, a slight forward lean, a one-handed point, a both-hands-open spread, and a head-and-shoulders shrug. Store these as animation clips or as procedural bone rotation sequences. Select randomly from the pool with weighting to avoid repeating the same gesture consecutively.
Gestures must be layered on top of the base pose without disrupting it. Use additive animation blending: the gesture modifies bone transforms relative to the current pose rather than setting absolute positions. In both Babylon.js and Three.js, you can achieve this by storing the gesture as delta rotations applied on top of the current bone state. Apply the gesture with a quick ease-in (100 to 200ms) at the emphasis point, hold for 300 to 500ms, and ease-out over 400 to 600ms. The result is a brief, natural movement that coincides with what the character is saying.
Limit gesture frequency to one every 2 to 4 seconds. Constant gesturing looks manic. Sparse, well-timed gestures look thoughtful and intentional. If your dialogue tends to be fast-paced with many emphasis points, increase the cooldown period. If it is slow and deliberate, allow more frequent gestures.
Implement Idle Breathing and Body Sway
Breathing is a continuous motion that should run whether the character is speaking or silent. Implement it as a sine wave driving the chest bone's scale or position. A breathing cycle is approximately 4 seconds (15 breaths per minute), consisting of a 1.5-second inhale and a 2.5-second exhale. The inhale raises the chest slightly (scale 1.0 to 1.02 on the Y axis) and the exhale lowers it. The motion is tiny but perceptible, especially when the character is otherwise still.
During speech, breathing should modulate but not stop. Real speakers breathe between phrases, not during them. Detect phrase breaks as moments where RMS drops below a threshold for more than 200 milliseconds. During speech (RMS above threshold), reduce the breathing amplitude by half so it does not conflict with shoulder and chest movement from gestures. During phrase breaks, restore full breathing amplitude briefly. This creates a natural rhythm of speech, pause, breath, speech that makes the character feel physically grounded.
Weight shifting adds another layer of standing-character life. Apply a very slow lateral sway to the hip bone using a sine wave with a period of 6 to 10 seconds and an amplitude of 1 to 2 degrees of rotation. This simulates the natural weight transfer that happens when a person stands and talks. The sway should be so slow and subtle that the player never consciously notices it, but its absence would register as stiffness.
Blend Audio-Driven Animation with Lip Sync
Audio-driven body animation and lip sync facial animation operate on different parts of the skeleton and mesh. Lip sync writes to face morph targets (viseme shapes). Body animation writes to bone transforms (head, spine, shoulders, arms). Because they operate on different data, they naturally layer without conflict.
The exception is the head bone, which both the gaze system (from facial animation) and the nod/tilt system (from audio-driven animation) want to control. Resolve this by separating concerns: the gaze system controls the eyes (eye bones and optionally eye morph targets), while the audio-driven system controls the head bone. If both need the head bone, apply gaze as the base rotation and audio-driven nod/tilt as additive deltas on top. This way, the character looks at the player (gaze) while also nodding on emphasis (audio-driven), and both behaviors combine correctly.
Run the animation systems in a defined order each frame: breathing and sway first (lowest frequency, background), then head nod and tilt (medium frequency, speech-driven), then gestures (event-triggered), then gaze (continuous tracking), then lip sync (highest frequency). Each system reads the current bone or morph target state as its input and adds its contribution. The final state is the sum of all layers, which the engine renders.
Tune Responsiveness and Prevent Over-Animation
The most common mistake in audio-driven animation is making it too responsive. When every amplitude spike triggers a head nod, every pitch change causes a tilt, and every emphasis point fires a gesture, the character looks like it is having a seizure. Natural human movement is characterized by selective response, most audio variations produce no visible body movement, and only the strongest peaks trigger noticeable gestures.
Set high thresholds for gesture triggers. The amplitude must exceed the moving average by 50% or more, not just 10%. Spectral onset energy must spike by 40% or more. These thresholds mean that most speech produces only subtle head movement and breathing, with distinct gestures occurring only a few times per long sentence. Test with actual dialogue and count the gestures. Two to three gestures per 10-second utterance is about right. More than five looks frantic.
Apply generous smoothing to all continuous animations (head nod, tilt, breathing, sway). Blend factors of 0.05 to 0.15 produce motion that flows rather than twitches. The animation should feel like it is responding to the general character of the speech, not tracking individual phonemes. If the head movement is visibly syncing to individual syllables, the smoothing is too low.
Compare the animated character to a video of a real person speaking. Human speakers are remarkably still during most of their speech, with movement concentrated at key moments. The goal is not to animate every moment but to animate the right moments. An almost-still character that nods at exactly the right time looks more alive than a constantly moving character that nods at random times.
Audio-driven body animation uses the same audio signal as lip sync but extracts broader features, amplitude peaks for nods and gestures, pitch contour for head tilt, and rhythm for breathing modulation. The key is restraint. Set high thresholds, apply generous smoothing, and let most speech moments pass without visible body reaction. The few well-timed movements that do trigger will look natural and intentional.