Handling Latency and Cost for LLM NPCs
Latency and cost are the two practical constraints that determine whether an LLM NPC system is viable in production. Players expect conversational response speeds, ideally seeing the first words appear within a second of their input. Developers need per-interaction costs that scale sustainably with their player base. Both constraints have proven engineering solutions.
The challenge is that these two goals often conflict. The most capable language models produce the best dialogue but are also the slowest and most expensive. Optimization is about finding the combination of techniques that delivers acceptable quality at acceptable speed and cost for your specific game. The strategies below are ordered roughly by impact, starting with the techniques that deliver the largest improvements.
Step 1: Implement Response Streaming
Response streaming is the single most impactful latency optimization. Without streaming, the player submits input and waits while the model generates the entire response, which can take 2 to 5 seconds or more depending on response length and model speed. With streaming, the game displays text as it arrives from the model, token by token, so the player starts reading almost immediately.
Most cloud APIs and local inference engines support streaming natively through server-sent events or chunked HTTP responses. The technical implementation is straightforward: instead of waiting for the complete response, the game processes each token as it arrives and appends it to the display. The result feels like watching someone type in real time, which is a natural conversational experience that players accept even when the full response takes several seconds to complete.
The perceived latency with streaming is roughly the time to first token, which is the delay between the API request and the arrival of the first output token. For cloud APIs, this is typically 200 to 800 milliseconds. For local models on a GPU, it can be as low as 50 to 150 milliseconds. Even the longer cloud delays are acceptable when the player sees text beginning to appear rather than staring at a blank response area.
Streaming does require adjustments to the response processing pipeline. Safety filtering must work on partial text or be applied sentence by sentence as complete sentences emerge from the stream, rather than waiting for the full response. The response parser must handle incomplete JSON if the model is producing structured output. These are solvable engineering challenges, and the latency improvement is substantial enough to justify the added complexity.
Step 2: Use Model Tiering
Not every NPC interaction needs the most powerful model available. Model tiering assigns different models to different interaction types based on their narrative importance, complexity, and the player's expectations for quality.
A three-tier system works well for most games. The top tier uses the best available model for story-critical NPCs and pivotal conversation moments: quest-giving dialogue, character revelations, emotionally charged exchanges. The middle tier uses a capable but smaller and faster model for general NPC conversations: questions about the world, casual interactions, merchant transactions. The bottom tier uses the smallest available model, cached templates, or rule-based responses for ambient dialogue: background chatter, simple greetings, generic reactions.
The cost difference between tiers is dramatic. A top-tier cloud model might cost 10 to 50 times more per token than a bottom-tier small model, and the bottom tier might be free if running locally. If 70% of NPC interactions fall into the bottom two tiers, the average cost per interaction drops substantially while the player only notices quality differences in the less frequent, less important interactions where they are least likely to pay close attention.
The routing logic that assigns interactions to tiers can be based on the NPC's role (main quest characters vs. background townspeople), the conversation context (first meeting vs. tenth casual greeting), or dynamic factors like whether the player is actively engaged in a quest that involves this NPC. The key is that the tier assignment feels invisible to the player, who should experience consistent quality in every interaction that matters.
Step 3: Set Token Budgets
Token budgets cap the amount of text consumed and produced in each NPC interaction. Without budgets, a single interaction can become unexpectedly expensive if the conversation history grows large or the model generates an unusually long response. Budgets make costs predictable and prevent edge cases from blowing through spending limits.
Set a maximum input token count that includes the system prompt, conversation history, game context, and retrieved memories. When the total exceeds this limit, compress the conversation history by summarizing older messages rather than including them verbatim. A summary like "Earlier, the player asked about local herbs and you recommended visiting the forest clearing" takes far fewer tokens than the original exchange while preserving the essential information.
Set a maximum output token count to prevent the model from generating excessively long responses. For NPC dialogue, 150 to 300 tokens (roughly 100 to 200 words) is usually appropriate for a single response. Longer responses feel unnatural in conversation and cost more without adding proportional value. Most APIs accept a max_tokens parameter that enforces this limit at the model level.
Track token usage across your player base to understand cost patterns. The average cost per interaction, cost per session, and cost per player per day are the metrics that matter for budgeting. Monitor for outliers, individual NPCs or conversation patterns that consume disproportionate resources, and adjust their prompts or tier assignments accordingly.
Step 4: Implement Response Caching
Many NPC interactions follow predictable patterns. The first time a player approaches a character, the greeting is essentially the same regardless of the player's specific input. Common questions like "What do you sell?" or "Where is the inn?" have responses that vary only slightly between players. Caching these responses eliminates the need for model inference entirely, serving them instantly at zero additional cost.
The simplest caching approach pre-generates responses for known interaction patterns and stores them keyed by NPC identifier and interaction type. When a player triggers a cached interaction, the response is served from the cache without any model call. As the conversation moves beyond cached patterns into open-ended territory, the system switches to real-time generation.
More sophisticated caching uses semantic similarity to determine whether a player's input is close enough to a cached query to reuse the cached response. If a player asks "where can I buy weapons?" and the cache contains a response for "do you sell swords?", the semantic similarity is high enough to serve the cached response with minor adaptation. This requires an embedding model to compute similarity scores, but embedding models are small, fast, and cheap compared to full text generation.
Step 5: Compress Prompts
Every token in the input prompt costs money on cloud APIs and consumes context window space that could be used for conversation history or model output. Prompt compression reduces input tokens without sacrificing the information the model needs to generate quality responses.
Start with the system prompt. Many character prompts contain redundant phrasing, overly verbose descriptions, or instructions that can be stated more concisely. Review your system prompts specifically for token efficiency. "You are a blacksmith. You speak plainly. You distrust magic users because your brother was cursed by one" communicates the same information as a much longer paragraph that elaborates on each point with additional context.
Conversation history compression is the other major opportunity. Instead of keeping the full text of every past message, summarize blocks of older conversation into brief recaps. A 20-message history that consumes 2,000 tokens can often be compressed to a 200-token summary that captures the key topics, decisions, and emotional beats of the conversation. The most recent 3 to 5 messages should remain in full text for immediate conversational continuity, while everything older gets summarized.
Game context injection should also be lean. Rather than dumping the full game state into every prompt, include only the context that is relevant to the current NPC and conversation. A blacksmith does not need to know the player's diplomatic reputation with a distant faction. Selective context injection reduces tokens and can actually improve response quality by reducing the amount of irrelevant information the model has to process.
Step 6: Pre-generate Predictable Responses
Pre-generation extends caching by proactively generating responses for interactions the game can anticipate. When the player enters a new area, the system can generate greeting responses for nearby NPCs in the background before the player initiates conversation. When a quest state changes, the system can generate reactions from relevant NPCs before the player visits them.
The key is prediction accuracy. Pre-generating responses that the player never triggers wastes resources. Focus pre-generation on high-probability interactions: NPCs the player is walking toward, characters involved in active quests, and merchants in areas the player is exploring. The game's pathfinding and quest tracking systems can inform which NPCs are likely to be engaged soon.
Pre-generated responses should be treated as warm cache entries rather than permanent. If the game state changes between pre-generation and actual interaction, for example because the player completed a quest objective between approaching an NPC and talking to them, the pre-generated response may be stale. Include a validity check that compares the game state at generation time with the current state, and regenerate in real time if the context has changed significantly.
Response streaming reduces perceived latency dramatically by showing text as it generates. Model tiering keeps average costs low by reserving expensive models for important interactions. Token budgets make costs predictable. Caching, compression, and pre-generation eliminate unnecessary model calls entirely. Used together, these strategies make LLM NPC dialogue fast enough and affordable enough for production games at scale.