Shader Performance on the Web and Mobile
Web games face a unique performance challenge: they must run in a browser sandbox on hardware ranging from powerful desktop GPUs to low-end mobile phone chipsets. A shader that runs smoothly on a discrete NVIDIA or AMD GPU may consume an entire frame budget on a mobile Adreno, Mali, or Apple GPU. Understanding where shader time goes and how to reduce it is essential for delivering a consistent experience across the devices your players actually use.
Profile GPU Performance with Browser Tools
Before optimizing, you need to measure. Guessing at performance bottlenecks is unreliable because GPU behavior is often counterintuitive. A shader that looks complex might run fast due to hardware-level optimizations, while a simple-looking shader with a few texture reads might be the actual bottleneck.
Chrome DevTools Performance panel shows frame timing, including how much time the GPU spends compositing. Long GPU composite times indicate that rendering (including shaders) is the bottleneck. The flame chart breaks down JavaScript execution, which helps distinguish between CPU-side overhead (uniform updates, buffer uploads, draw call submission) and GPU-side work (actual shader execution).
Spector.js captures an entire frame's WebGL call sequence and lets you inspect each draw call individually. You can see which shader program ran, what textures and uniforms were bound, how many triangles were drawn, and the state of the depth and blend settings. This level of detail reveals redundant state changes, unexpected shader switches, and draw calls that should not be happening. Spector.js is the closest thing to a full GPU profiler available in the browser environment.
Engine-specific profilers provide higher-level information. Three.js's renderer.info reports draw calls, triangles, textures, and shader programs per frame. Babylon.js's built-in Inspector shows render time breakdown, active material count, and per-mesh rendering cost. Monitoring these values over time (especially when adding or modifying shader effects) reveals whether a change helped or hurt performance.
A practical profiling workflow starts by establishing a baseline frame time, then making one change at a time and measuring the impact. If the frame time drops, the optimization worked. If it stays the same, the bottleneck is elsewhere. This methodical approach prevents wasted effort on optimizations that do not address the actual problem.
Optimize Precision and Data Types
GLSL ES requires you to declare a default float precision in the fragment shader using precision mediump float; or precision highp float;. On mobile GPUs, this choice has measurable performance implications. Desktop GPUs typically process all precision levels at the same speed, but mobile GPUs (especially older Qualcomm Adreno and ARM Mali chips) execute mediump operations significantly faster than highp because they use narrower data paths and smaller register files.
The practical strategy is to use mediump as the default precision in fragment shaders and selectively promote variables to highp only where precision artifacts would be visible. Calculations that need highp include world-space position reconstruction (large coordinate ranges produce jitter at lower precision), depth buffer comparisons (small depth differences require fine resolution), UV calculations on very large tiling textures (accumulated error causes visible texture swimming), and trigonometric functions on large input values.
Calculations that work perfectly at mediump include color operations (the 0 to 1 range has plenty of mediump resolution), normal vector math (direction vectors stay small), lighting dot products and intensity calculations, and UV coordinates for standard non-tiled texture sampling. Some calculations even work at lowp: boolean-like values (0.0 or 1.0 flags), small integer emulation, and color channel masks.
Data packing reduces register pressure and memory bandwidth. Instead of passing four separate float varyings from the vertex shader to the fragment shader, pack them into a single vec4. Instead of using three separate textures for diffuse, roughness, and metallic, pack roughness into the diffuse texture's alpha channel and metallic into a combined ORM (occlusion, roughness, metallic) texture. Every varying and every texture sample that you eliminate reduces the data the GPU must interpolate and fetch.
Reduce Texture Sampling Cost
Texture sampling is one of the most expensive operations in a fragment shader because it involves memory lookups that can stall the GPU pipeline if the requested data is not in cache. Each sample reads from GPU texture memory, and the latency depends on whether the data is in the texture cache (fast) or must be fetched from VRAM (slow). Minimizing the number of texture samples per pixel and improving cache utilization are among the most effective shader optimizations.
Mipmaps are pre-computed downscaled versions of a texture that the GPU selects based on the screen-space size of the textured surface. For distant or angled surfaces, the GPU reads from a smaller mipmap level, which reduces the area of texture memory accessed and improves cache hit rates. Generating mipmaps costs some extra memory (about 33% more) but dramatically improves performance for any texture that appears at varying distances. Always enable mipmaps for game textures unless they are used at a fixed screen size (like UI elements).
Texture atlas packing combines multiple small textures into one large texture. This reduces the number of texture binds per frame (each bind is a state change that the GPU must process) and improves cache locality when multiple elements share the same atlas. Particle sprites, UI icons, and tilesets are ideal candidates for atlasing.
Channel packing stores multiple data values in a single RGBA texture. An "ORM" texture stores ambient occlusion in the red channel, roughness in green, and metallic in blue. Sampling this once gives you three material properties for the cost of one texture read. The alpha channel can store yet another value (height, emission mask, or a custom parameter). This technique halves or thirds the number of texture samples needed for complex materials.
Dependent texture reads (where the UV coordinates used for one texture sample come from the result of another texture sample) are particularly expensive on mobile GPUs because they prevent the GPU from prefetching texture data. Minimizing dependent reads by computing all UVs from vertex shader outputs rather than from intermediate texture lookups improves mobile performance significantly.
Eliminate Expensive Branching Patterns
GPUs execute shader instances in groups called warps (NVIDIA), wavefronts (AMD), or execution groups. All instances in a group run the same instruction at the same time. When a branch (if/else statement) causes different instances in the group to take different paths, the GPU must execute both paths for the entire group, masking off the results that do not apply to each instance. This "warp divergence" means that a branch where different pixels take different sides is nearly as expensive as running both sides unconditionally.
Branchless math replaces conditional logic with arithmetic that produces the same result. step(edge, x) returns 0.0 or 1.0 without a branch, acting as a hard threshold. mix(a, b, step(edge, x)) selects between two values without an if-statement. max(dot(N, L), 0.0) clamps negative lighting without a conditional. smoothstep(edge0, edge1, x) creates a soft transition that would otherwise require an if-else chain.
Some branches are safe. A branch that all pixels in a group take the same way (a "uniform branch" that depends on a uniform value rather than per-pixel data) costs almost nothing because there is no divergence. Branching on a uniform like if (enableNormalMap) is effectively free, since either all pixels use the normal map or none do. The compiler may even optimize it away at compile time.
Early exit branches can be beneficial despite their cost. If a complex effect applies to only a small portion of the screen (like a spotlight), a branch that skips the expensive calculation for pixels outside the light's range can save more work than the divergence costs. The tradeoff depends on how much work the branch skips versus how many pixels straddle the branch boundary. Profiling is the only reliable way to determine whether a specific early-exit branch helps or hurts.
Minimize Overdraw and Fill Rate Pressure
Overdraw occurs when multiple triangles cover the same pixel, each running the fragment shader before the depth test determines which one is actually visible. For opaque geometry, rendering front-to-back allows the depth test to reject hidden fragments before the shader runs, eliminating wasted work. Three.js and Babylon.js sort opaque objects by distance automatically, but developers should be aware that very complex scenes with many overlapping objects still incur overdraw at triangle edges and depth-test boundaries.
Transparent objects are the worst offenders for overdraw because they must be rendered back-to-front (for correct visual blending) and cannot benefit from early depth rejection. Each transparent layer runs its fragment shader fully, and scenes with many overlapping transparent surfaces (fog layers, glass windows, particle effects, translucent UI panels) can easily run the fragment shader five or ten times per pixel. Minimizing the screen coverage of transparent surfaces and reducing the complexity of their shaders are the most effective optimizations.
Fill rate is the GPU's capacity to process fragment shader invocations per second. It is determined by the number of shader cores, clock speed, and the complexity of the shader program. When the fill rate is exhausted, the GPU cannot process fragments fast enough to maintain the target frame rate, and frame time increases. Reducing shader complexity (fewer texture samples, simpler math, lower precision) directly reduces fill rate pressure.
Post-processing passes are full-screen operations that each run a fragment shader on every pixel. Four post-processing passes on a 1080p display means eight million additional fragment shader invocations per frame. Combining multiple simple effects into a single shader pass (for example, a single pass that applies vignette, color grading, and film grain together) reduces the overhead compared to running each effect as a separate pass. This trade-off between shader complexity and pass count is one of the key architectural decisions in post-processing pipeline design.
On mobile, tile-based rendering architectures (used by all major mobile GPUs) interact differently with overdraw and framebuffer operations. Mobile GPUs divide the screen into tiles and render each tile entirely in fast on-chip memory before writing the result to system memory. Excessive shader complexity per tile can overflow the tile's register budget, causing "register spilling" to slower memory. Keeping shaders simple and avoiding deep nested operations helps mobile GPUs stay within their tile-local fast path.
Shader performance optimization starts with profiling, not guessing. Use browser tools and engine profilers to identify actual bottlenecks, then apply targeted techniques: mediump precision on mobile, fewer texture samples, branchless math, and reduced overdraw. The goal is consistent frame rates across the full range of devices your players use.