Handling VR Controllers and Hands

Updated June 2026
WebXR games need to handle three distinct input modes: tracked controllers with buttons and thumbsticks, bare hand tracking with 25 joints per hand, and gaze-based pointing on devices like the Apple Vision Pro. This guide covers each input type and shows how to build an adaptive system that works across all WebXR-capable devices.

Input handling is the area where WebXR games differ most from traditional web games, and it is also the area most likely to cause compatibility problems. Different headsets ship with different controllers, some devices support hand tracking while others do not, and the Apple Vision Pro uses a gaze-and-pinch model that has no analog in other VR platforms. Building robust input means designing for all of these modes from the start.

Step 1: Understand the WebXR Input Model

WebXR reports input devices through the XRSession's inputSources property. Each input source is an XRInputSource object with several key properties. The handedness property tells you whether this is a left-hand device, right-hand device, or ambiguous (none). The targetRayMode describes how the device points at things: "tracked-pointer" for controllers that project a ray, "gaze" for headset-based pointing, "screen" for touch input on the device screen, or "transient-pointer" for momentary pointing sources like hand pinch rays.

Each input source has two coordinate spaces. The targetRaySpace represents the pointing origin and direction, which is what you use for ray-based interaction. For controllers, this ray starts at the front of the controller and extends forward. For gaze input, it starts at the eyes and extends in the look direction. The gripSpace represents the physical location of the device in the player's hand. For controllers, this is roughly where the player grips the handle. For hands, this is the wrist joint.

Button and axis data comes through the standard Gamepad API interface attached to each input source's gamepad property. The gamepad has an array of buttons (each with pressed, touched, and value properties) and an array of axes (thumbstick X/Y values ranging from -1 to 1). The exact number and mapping of buttons varies by controller hardware, which is why the input profile registry exists.

The XRSession fires three key input events: "inputsourceschange" when devices connect or disconnect, "select" (and selectstart/selectend) for the primary action (trigger pull on controllers, pinch on hands, tap on screen), and "squeeze" (and squeezestart/squeezeend) for the secondary action (grip button on controllers). These events are the most portable way to handle input since they work across all device types.

Step 2: Map Controller Buttons Across Devices

The challenge with controller input is that different hardware has different button layouts. A Meta Quest Touch controller has a trigger, grip, thumbstick, and two face buttons (A/B on right, X/Y on left). A Valve Index controller has a trigger, grip (with force sensing), thumbstick, trackpad, and A/B buttons. Windows Mixed Reality controllers have a trigger, grip, thumbstick, and touchpad.

The WebXR Input Profiles specification addresses this. Each input source includes a profiles array listing compatible profile names in order of specificity. The first profile is the most specific match (e.g., "meta-quest-touch-plus"), and later entries are more generic fallbacks (e.g., "generic-trigger-squeeze-thumbstick"). You can use these profiles to decide which buttons to display in tutorials and how to label controls.

For practical game development, focus on the common denominator. Nearly every VR controller has a trigger (mapped to gamepad button 0 or the select event), a grip (mapped to gamepad button 1 or the squeeze event), and a thumbstick (axes 0 and 1 for X/Y). Design your core interactions around these three inputs and you will cover every major controller on the market.

If your game needs more buttons, check the gamepad.buttons array length and the profiles list to determine what is available. Provide fallback interactions for controllers that lack specific buttons. For example, if your game uses the A button for jumping, provide an alternative like a thumbstick-click jump for controllers without dedicated face buttons. Display the correct button labels by matching the input profile to your button prompt assets.

Both Babylon.js and Three.js provide controller model factories that load accurate 3D models matching the player's hardware. Use these to show controller hints in tutorials and pause menus, highlighting the relevant button with a glow or color change.

Step 3: Implement Hand Tracking

Hand tracking is enabled through the WebXR Hand Input module. Request "hand-tracking" as an optional feature when creating your session. When the player puts down their controllers (or uses a device that only supports hands), the input source switches from a tracked-pointer to a hand, and the XRInputSource provides an XRHand object containing 25 joints.

The 25 joints per hand follow a consistent naming convention. The wrist joint is the base. From there, four finger chains extend: thumb (4 joints: metacarpal, proximal, distal, tip), index finger (5 joints: metacarpal, proximal, intermediate, distal, tip), middle finger (same as index), ring finger (same as index), and pinky (same as index). Each joint has a position, orientation, and a radius that represents the approximate size of the finger at that point.

The most common hand gesture for interaction is the pinch. Detect it by measuring the distance between the thumb tip joint and the index finger tip joint. When this distance drops below a threshold (typically 2 to 3 centimeters), the player is pinching. This maps naturally to the select action: pinch start is selectstart, pinch end is selectend. Most WebXR browsers fire the select event automatically during pinch detection, but you may want to implement your own threshold for more control.

Grab gestures can be detected by measuring the curl of all four fingers. When the index, middle, ring, and pinky fingers have their tips close to the palm (measured by comparing tip joint positions to the wrist/metacarpal positions), the hand is in a fist or grab pose. Combine this with proximity to a grabbable object for a natural pickup mechanic.

Pointing detection checks whether the index finger is extended (tip far from palm) while other fingers are curled. This lets you use the index finger as a natural pointer for menu interaction, replacing the controller ray with a finger ray. Project a ray from the index metacarpal through the index tip for the pointing direction.

Hand tracking introduces unique design challenges. There are no buttons to press, so all interactions must be spatial. Provide visual feedback for gesture recognition: show a highlight when the system detects the player is about to pinch, change the color of a grabbed object while held, and animate a release effect when the hand opens. Without this feedback, players cannot tell if their gestures are being recognized.

Step 4: Support Gaze-Based Input

The Apple Vision Pro introduced gaze-and-pinch as a primary input paradigm for WebXR. The player looks at an interactive element (gaze targeting), and the browser reports the gaze direction through an input source with targetRayMode set to "transient-pointer". When the player pinches their fingers, the browser fires a select event at the gazed-upon location.

Gaze input changes how you design interactive elements. Objects must be large enough to target comfortably with eye gaze, which typically means minimum 3 to 5 centimeters in apparent size at the expected viewing distance. Small buttons, thin sliders, and densely packed UI elements that work fine with controller ray pointing become frustrating with gaze targeting because the eye cannot focus as precisely as a laser pointer.

For devices without hand tracking or controllers (some cardboard-style viewers, early AR glasses), implement gaze-and-dwell as a fallback. The player looks at an interactive element, a radial timer fills around a reticle at the gaze point, and when the timer completes, the action triggers. Dwell times between 1.5 and 2.5 seconds balance speed against accidental activation. This is the least preferred input method but ensures your game works even on minimal hardware.

Show a visible reticle at the gaze point so players know what they are targeting. On gaze-capable devices, position a small dot or ring mesh at the intersection of the gaze ray with the nearest interactive surface. The reticle should snap to interactive elements (growing slightly when hovering over a button) to provide clear feedback about what will be activated.

Step 5: Build an Adaptive Input System

A well-built WebXR game detects available input at runtime and switches seamlessly between modes. The XRSession fires an "inputsourceschange" event whenever controllers connect, disconnect, or switch to hands. Your input manager should listen for this event and reconfigure the interaction system accordingly.

Design your input system with three layers. The hardware layer reads raw data from controllers, hands, or gaze, normalizing it into a common format. The action layer maps raw input to game actions (select, grab, move, turn, menu). The feedback layer shows visual cues appropriate to the current input mode (controller ray, hand highlight, gaze reticle). When the input source changes, only the hardware and feedback layers need to reconfigure. The action layer stays the same.

Test with multiple input types during development. Start a session with controllers, then put them down to switch to hand tracking. Try your game with the WebXR Emulator's gaze mode. Check that all core interactions work in each mode and that the visual feedback adapts correctly. The most common bug in adaptive input systems is stale visual state, where a controller ray remains visible after switching to hands or a gaze reticle persists after picking up controllers.

For multiplayer games, consider that different players in the same session might use different input types. One player on a Quest 3 might use controllers while another on a Vision Pro uses gaze-and-pinch. Your game's avatar system needs to represent each input type appropriately, showing controller models for one player and hand models for another.

Key Takeaway

Robust WebXR input handling requires supporting three modes: tracked controllers with button and thumbstick input, hand tracking with joint-based gesture detection, and gaze-based pointing for devices like the Apple Vision Pro. Design around the common denominator (trigger, grip, thumbstick), detect available input at runtime through the inputsourceschange event, and provide clear visual feedback for each input mode so players always know what the system recognizes.