The Multi-Agent Cost Playbook: Five Levers to Cut AI Agent Spend

Caching, batching, model tiering, context discipline, local-first — the five levers that cut a multi-agent fleet's bill, and how they compound.

TL;DR

Cost is the quiet tax on multi-agent systems: one agent's token rate × N agents, each carrying its own context, becomes a real bill fast. The good news is the biggest levers are mechanical and compound. Five of them: (1) prompt caching (reuse the stable prefix at ~a tenth the price), (2) batching (~50% off non-urgent work), (3) model tiering (cheap workers, premium orchestrator), (4) context discipline (don't pay to re-read everything, N times), and (5) local-first (no platform tax on top of tokens). Pull all five and a fleet's bill drops by more than half without losing quality.

Two earlier posts covered single levers: routing the right task to the right model and not paying twice for the same tokens. This is the unified playbook — the five levers that actually move a multi-agent fleet’s bill, why they matter more when agents run in parallel, and how they stack.

Why multi-agent makes cost urgent: a single agent’s token rate, multiplied by N agents running at once, each carrying its own context window, adds up fast. The flip side is that the levers are well-documented and composable — so the same parallelism that grows the bill is what makes disciplined optimization pay off N times over.

Lever 1 — Prompt caching: reuse the stable context

The highest-leverage move for most agents, because an agent re-sends a large stable prefix every turn: its system prompt, tool definitions, and project/codebase context. Without caching you re-pay for all of it on every call.

The mechanics are provider-documented: a cache write costs a small premium over base input (on Anthropic, ~1.25× for the short TTL), and every cache read costs roughly a tenth of the normal input price — about a 90% discount on the cached portion. Put the unchanging prefix first, cache it, and only the per-turn delta pays full rate. For a long-running agent that’s the difference between linear and near-flat context cost. One caveat worth knowing: caches have a short time-to-live, so a bursty or idle agent can lose the cache between turns — pacing matters.

Lever 2 — Batching: trade latency for ~50%

Provider batch APIs run requests asynchronously (results within ~24h) for about 50% off input and output, with no quality difference — only timing. The agent insight is that a lot of fleet work isn’t latency-sensitive at all: overnight reviews, bulk file analysis, scheduled audits. That work is batchable by design. Caching and batching together stack on eligible workloads for a large combined discount.

Lever 3 — Model tiering: cheap workers, premium orchestrator

This is the structural multi-agent win. The price spread between a small model and a frontier model is enormous — on the order of 10–100× per token — so running every agent on the biggest model is pure waste. The pattern: route the routine majority to a cheap model and reserve the frontier model for the hard minority (and for the orchestrator that has to reason about the whole job). Most agent work is routine, so this alone typically cuts spend by more than half without hurting quality. The full argument is here.

This is a concrete feature in Munder Difflin, not just advice: the harness’s per-agent model selection (HarnessConfig.defaultModel in src/main/config.ts) lets you assign a Haiku-class model to worker agents and an Opus-class model to the GOD orchestrator today — the cheap-workers/premium-lead pattern configured directly.

Lever 4 — Context discipline: don’t pay to re-read everything, N times

The multiplier nobody budgets for: every agent’s context window is a recurring per-turn cost, and in a fleet you pay it N times over. Keeping each agent’s context scoped to only what its task needs, summarizing or compacting long histories, and not broadcasting full shared state into every agent’s window are direct cost levers — not just hygiene. Cheaper recall is another reason a hive leans on compact, markdown-first memory and semantic recall instead of dragging everything back into context.

Lever 5 — Local-first: remove the platform tax

The structural one. Cloud agent platforms often charge per-seat or orchestration fees on top of the model token cost. A local-first hive pays only for model tokens — the orchestration, the files, the audit log all run on your machine for free. Pair that with an in-app usage view (observability) and you can watch the bill climb in real time and apply levers 1–4 exactly where they bite. Predictable, attributable cost is part of the local-first case.

They compound — that’s the whole point

These aren’t five alternatives; they’re five multipliers you apply at once. Cache the stable prefix, batch the non-urgent work, tier the models, scope the context, and skip the platform tax. Each one bends the curve, and together they turn “we can’t afford a fleet” into “the fleet costs less than the one over-powered agent it replaced.” And remember the real denominator: not cost per token but cost per completed task — because a “cheap per token” agent that needs three retries to finish isn’t cheap at all.

FAQ

What’s the single biggest lever? Prompt caching, for most agents — the stable prefix (system prompt, tools, context) gets written once and read at ~a tenth the price, so only the per-turn delta pays full rate.

Do these levers stack? Yes — they’re multiplicative. Caching cuts the repeated prefix, batching halves eligible non-urgent work, tiering moves the routine majority to a cheap model, context discipline shrinks each window, and local-first removes the platform fee. A fleet applies several at once.

Does cheaper hurt quality? Only if you route the wrong tasks to a small model. Default lean, escalate the genuinely hard minority — match difficulty to tier rather than flattening everything onto one model.

The bottom line

A multi-agent fleet doesn’t have to be expensive — it has to be engineered. The five levers are mechanical, documented, and compounding: caching, batching, tiering, context discipline, and local-first. Pull them together and you spend less, run faster, and stop paying frontier prices for routine work.

Munder Difflin is built for this: per-agent model selection, local-first execution, and a usage view you can actually watch. Download Munder Difflin to run a fleet that doesn’t bankrupt you — it’s free and open source.

Sources: Anthropic — Prompt caching; Anthropic — Pricing (batch & caching).

FAQ

What's the single biggest lever for cutting AI agent costs?

Prompt caching, for most agents. An agent re-sends the same large stable prefix — system prompt, tool definitions, project context — on every turn. With caching, that prefix is written once and then read at roughly a tenth of the normal input price, so only the per-turn delta pays full rate. For a long-running agent that's the difference between linear and near-flat context cost.

Do these levers stack?

Yes, and that's the point. Caching cuts the cost of the repeated prefix, batching halves eligible non-urgent work, model tiering moves the routine majority to a cheap model, context discipline shrinks what each agent carries, and local-first removes the platform fee on top of tokens. They're multiplicative, not either-or — a fleet applies several at once.

Does using cheaper models hurt quality?

Only if you route the wrong tasks to them. Most agent work is routine and a small model handles it fine; you reserve the frontier model for the genuinely hard minority. The skill is matching task difficulty to model tier — default lean, escalate on signal — not flattening everything onto one model.

How does a multi-agent harness help with cost specifically?

It gives you the structural levers in one place: per-agent model selection (cheap workers, premium orchestrator), a way to batch non-urgent fleet work, and — if it's local-first — no per-seat platform tax on top of model tokens. You pay for tokens, not for orchestration.