Compute Shaders for Game Effects in WebGPU
Before WebGPU, the only way to run GPU computations in a browser was to disguise them as rendering operations, encoding data into textures and processing it with fragment shaders. This GPGPU workaround was limited to operations that fit the texture read/write model, which excluded most real game simulations. WebGPU's compute pipeline removes that limitation entirely.
Understand the Compute Pipeline Model
A compute shader is a standalone GPU program that processes data in parallel without any connection to vertex processing, rasterization, or pixel output. You dispatch a compute shader by specifying how many workgroups to launch. Each workgroup contains a fixed number of invocations defined by the @workgroup_size attribute in the shader. If you set @workgroup_size(64) and dispatch 1000 workgroups, the GPU runs 64,000 shader invocations in parallel.
Each invocation knows its position in the dispatch through built-in variables. The @builtin(global_invocation_id) gives the invocation's unique index across the entire dispatch, @builtin(local_invocation_id) gives its index within its workgroup, and @builtin(workgroup_id) identifies which workgroup it belongs to. These indices are vec3u values because dispatches can be organized in three dimensions, though one-dimensional dispatches (with y and z set to 1) are the most common for simple data processing.
Invocations within the same workgroup can share data through workgroup-scoped memory, declared with var<workgroup> in WGSL. This shared memory is much faster than global storage buffer access because it resides in on-chip memory close to the compute units. The workgroupBarrier() function synchronizes all invocations in a workgroup, ensuring that writes to shared memory are visible to all invocations before any proceed past the barrier. This shared memory and barrier model is essential for algorithms that need neighbor access, such as blur filters, prefix sums, and spatial queries.
Create a Compute Pipeline and Storage Buffers
Setting up a compute pipeline requires a GPUShaderModule with a @compute entry point, a bind group layout describing the resources the shader accesses, and the pipeline object itself. The pipeline is created with device.createComputePipeline(), passing the shader module and entry point name. Unlike render pipelines, compute pipelines have no vertex layout, fragment state, or color attachment configuration because they do not interact with the rendering system.
Storage buffers are the primary data interface for compute shaders. You create a GPUBuffer with the STORAGE usage flag (plus COPY_SRC or COPY_DST if you need to read data back to the CPU or upload data from the CPU). The buffer contains your simulation data, typically a flat array of structs. A particle buffer, for example, stores position (vec3f), velocity (vec3f), lifetime (f32), and any other per-particle data in a repeating struct pattern.
The bind group connects your buffers to the shader's @group/@binding declarations. Create a bind group layout that specifies each binding's type (buffer with storage visibility), then create a bind group that pairs each layout entry with an actual buffer. When you record compute commands, you bind this group and the pipeline, then dispatch the workgroups.
A compute pass is recorded with encoder.beginComputePass(), which is simpler than a render pass because there are no attachments or clear operations. Inside the pass, you set the pipeline, set bind groups, and call dispatchWorkgroups(x, y, z) to launch the computation. The GPU processes all workgroups and writes results to the storage buffers, which can then be read by a subsequent render pass or another compute pass in the same command buffer.
Build a GPU Particle System
A GPU particle system moves the entire simulation, update and rendering data, onto the GPU. The compute shader reads each particle's current state from one buffer, applies physics (gravity, wind, drag, collision), updates the position and velocity, decrements the lifetime, and writes the new state to an output buffer. A ping-pong pattern alternates between two buffers each frame: the compute shader reads from buffer A and writes to buffer B, then the next frame reads from buffer B and writes to buffer A.
The render pass draws particles directly from the same buffer the compute shader just wrote to. Each particle becomes a billboard quad (two triangles facing the camera) or a point sprite, with the vertex shader reading position and size from the storage buffer using the instance index. This eliminates the CPU-to-GPU data transfer that makes CPU particle systems expensive. The particle data never leaves GPU memory.
Particle emission is handled by maintaining a counter in the buffer or a separate atomic counter buffer. When new particles need to spawn, the compute shader finds dead particles (lifetime at or below zero) and reinitializes them with new positions, velocities, and lifetimes. The emission rate, spawn position, velocity distribution, and other parameters come from a uniform buffer that the CPU updates each frame.
This architecture scales to millions of particles. A CPU particle system updating 100,000 particles at 60 FPS consumes substantial CPU time and generates a large data upload every frame. A GPU particle system handles the same count with negligible CPU cost, and the rendering step reads directly from GPU memory. Scaling to 1,000,000 or more particles is primarily a matter of fragment shader fill rate, not simulation overhead.
Implement GPU Physics with Shared Memory
Physics simulations that involve interactions between nearby elements, such as cloth, soft bodies, and fluid particles, benefit from workgroup shared memory. The basic pattern loads a tile of elements into shared memory, synchronizes with workgroupBarrier(), processes each element using neighbor data from shared memory, and writes results to the output buffer.
Cloth simulation demonstrates this well. A cloth mesh is a grid of particles connected by springs. Each particle's new position depends on the forces from its four (or eight) neighbors. The compute shader assigns one workgroup to a tile of the grid, loads the tile plus a border of neighboring particles into shared memory, and each invocation computes the spring forces for its particle using the fast shared memory for neighbor lookups.
Spatial hashing enables collision detection for unstructured particle sets. A compute pass assigns each particle to a grid cell based on its position. A second pass sorts particles by cell using a GPU sort algorithm (bitonic sort or radix sort implemented as compute shaders). A third pass checks each particle against its neighbors in the same and adjacent grid cells. All three passes run entirely on the GPU, with the sorted cell data enabling efficient neighbor queries.
Verlet integration is a popular choice for game physics on the GPU because it is stable, simple, and position-based. Each particle stores its current and previous position. The update step computes the new position from the current position, the displacement from the previous position (which encodes velocity implicitly), and the accumulated forces. Constraint satisfaction, such as keeping cloth particles at fixed distances, runs as a separate pass that iteratively moves particles to satisfy distance constraints.
Generate Procedural Terrain on the GPU
Terrain generation involves computing a heightmap from noise functions, which is embarrassingly parallel since each grid point's height depends only on its coordinates. A compute shader evaluates multiple octaves of Perlin or simplex noise at each grid position, combines them with fractal Brownian motion (fBm) weighting, and writes the resulting height to a storage buffer. For a 1024x1024 heightmap, the shader dispatches 1024 workgroups of 1024 invocations and completes in milliseconds.
Hydraulic erosion refines the heightmap with physically-inspired water flow simulation. Each compute invocation simulates a water droplet that follows the terrain gradient, picks up sediment from steep slopes, and deposits it in flat areas. Running thousands of droplets in parallel across many iterations produces natural-looking valleys, river beds, and erosion patterns. The iterative nature means multiple dispatch calls, but each dispatch is fast because the heightmap stays in GPU memory throughout.
Vegetation placement uses the generated heightmap and additional noise layers to determine where to place trees, grass, and rocks. A compute shader evaluates biome rules (altitude, slope, moisture) at each grid point and writes instance data (position, rotation, scale, type) to a storage buffer that the render pass uses directly for instanced drawing. The CPU never processes the placement data.
Normal map generation is a follow-up compute pass that reads the heightmap and computes surface normals using finite differences. Each invocation samples the height at its position and its four neighbors, computes the cross product of the horizontal and vertical tangent vectors, and writes the resulting normal vector to a normal map buffer. This normal map feeds into the terrain shader for lighting calculations.
Apply Compute-Based Post-Processing
Screen-space post-processing effects operate on the rendered image after the main render pass. While these effects can be implemented with fragment shaders and full-screen quads, compute shaders offer better control over memory access patterns and enable optimizations that fragment shaders cannot express.
Bloom is a multi-pass effect that brightens areas of the image that exceed a luminance threshold. The compute implementation downsamples the bright regions through a chain of progressively smaller textures, applies a blur at each level, then upsamples and accumulates the blurred results. Each level of the chain is a compute dispatch that reads from one texture and writes to another. Compute shaders can use shared memory to perform the blur with fewer texture reads than a fragment shader approach.
Screen-space ambient occlusion (SSAO) samples the depth buffer around each pixel to estimate how occluded it is by surrounding geometry. The compute version loads a tile of depth values into shared memory, then each invocation samples from shared memory instead of performing redundant global texture reads. For a 4x4 tile with a 32-sample SSAO kernel, shared memory eliminates hundreds of redundant texture fetches per tile.
Temporal anti-aliasing (TAA) blends the current frame with previous frames using motion vectors and color clamping. The compute shader reads the current frame, the previous accumulated frame, and the motion vector buffer to find the corresponding pixel in the previous frame. It applies neighborhood clamping to prevent ghosting artifacts, blends the current and previous colors, and writes the result to the output buffer. The compute approach allows the shader to read neighboring pixels from shared memory for the clamping step, which requires access to a local window of color values.
Compute shaders move game simulations and effects processing from the CPU to the GPU, where they run orders of magnitude faster. The pattern is consistent: create storage buffers, dispatch compute workgroups, and let the render pass consume the results directly from GPU memory.