WebAssembly Performance for Games
The Boundary Is the Bottleneck
The most common reason a wasm game runs no faster than its JavaScript version is a chatty boundary. Every call from JavaScript into the module, and every call back out, has a small fixed cost. That cost is negligible if you make a few calls per frame and enormous if you make thousands. A module that exposes a function to update one entity, called once per entity per frame, can spend more time crossing the boundary than it saves with faster math, ending up slower overall. The wasm code is fast, but the game is bottlenecked on the traffic between the two sides.
The fix is to make the boundary coarse. Instead of many small calls, make one large call that does a batch of work. Step the entire physics simulation in a single call rather than one body at a time. Generate a whole chunk of procedural terrain per call rather than one tile. The goal is for each boundary crossing to carry a lot of work, so the fixed cost is amortized across thousands of operations inside the module. When people report that wasm gave them a big speedup, they almost always have a coarse interface like this, and when they report disappointment, a fine-grained one is usually why.
Share Memory, Do Not Copy It
The second performance lever is how data moves across the boundary. The slow way is to copy: serialize data on one side, pass it across, deserialize on the other. The fast way is to share. The module's linear memory is just an ArrayBuffer that JavaScript can view directly through typed arrays such as Float32Array. So instead of the module returning a list of positions that JavaScript then copies, the module writes positions into a region of linear memory, and JavaScript reads that same region with no copy at all. Large amounts of data move for the cost of agreeing on an offset.
This zero-copy sharing is the technical heart of fast wasm games, and it is why thinking about memory layout matters. You decide where in linear memory each buffer lives, the module writes there, and the renderer or the JavaScript logic reads there. The loading article shows the mechanics of creating these typed-array views over the module's memory. The performance principle is simple: data that lives in shared memory and is read in place costs almost nothing to hand across, while data that is copied or serialized every frame can quietly consume the budget you were trying to save.
Wasm performance is won or lost at the boundary. Make few, large calls and share data through linear memory in place. A fine-grained, copy-heavy interface erases the speedup no matter how fast the module's code is.
Build Flags Matter More Than You Expect
A wasm module compiled without optimization can be several times slower than the same module compiled with optimization turned on, so the build flags are not an afterthought. With Emscripten, an optimization level of -O2 or -O3 enables the compiler optimizations that make the generated code fast, and there are additional flags that control how aggressively functions are inlined and how the module is structured. The difference between a quick debug build and a tuned release build is large enough that benchmarking a debug build and concluding wasm is slow is a genuine mistake people make.
Memory flags also affect both performance and reliability. The module's linear memory can be fixed or allowed to grow, and growable memory has a small cost but avoids hard crashes when the module needs more than you reserved. Choosing an initial size that covers normal play and allowing growth for spikes is a sensible default. With Rust, the equivalent is building in release mode rather than debug, which similarly turns on the optimizations that matter, and tools like wasm-opt can shrink and further optimize the binary after compilation. Whatever the toolchain, profile the release build, never the debug one.
SIMD and Threading Where They Fit
Two newer capabilities can push performance further when the workload suits them. SIMD, which stands for single instruction multiple data, lets one instruction operate on several numbers at once, which accelerates the vector math that fills physics, particle systems, and audio mixing. When your hot loop does the same arithmetic across arrays of floats, enabling SIMD in the build can deliver a meaningful additional speedup on top of regular wasm, because the CPU processes four or more values per step instead of one.
Threading brings true parallelism through Web Workers and shared memory, letting a heavy simulation use more than one CPU core. It is powerful but comes with real friction: it requires specific HTTP response headers to enable shared memory in the browser for security reasons, and it demands careful design to coordinate work across threads safely. For many games the single-threaded module with a coarse boundary and SIMD is already fast enough, and threading is the lever you pull only when one core genuinely cannot keep up. Knowing both exist lets you reach for them deliberately rather than assuming a single-threaded module is the ceiling.
Measure the Real Win
The discipline that ties all of this together is measurement, and it is the same discipline that governs all web game performance work. Before moving a system into wasm, profile it in JavaScript to confirm it is actually the bottleneck. After moving it, profile again to confirm the speedup is real and large enough to justify the boundary and the build step. Use the browser's profiler to see where frame time goes, and watch specifically for time spent crossing the boundary, which shows up as many small calls rather than a few large ones.
The reason to insist on measurement is that intuition about wasm is often wrong. The compiled code can be three times faster while the game is no faster overall because the boundary ate the difference. Or a system you assumed was heavy turns out to be trivial, and the real cost is rendering, which wasm cannot help. Only the profiler tells you the truth for your specific game on real hardware, including the budget phones where the gap between a tuned wasm core and a struggling JavaScript one can matter most. Build the fast core, share memory, set the flags, and then prove the win with numbers rather than assuming it.