# Prompt Caching: Cut Agent Cost Without Breaking Quality

Why did the same agent get more expensive after a harmless prompt edit?

The tempting answer is to chase cheaper models before checking whether context layout broke the cache. That answer is not useless, but it is too vague to operate. Prompt caching reuses stable prompt prefixes so repeated agent turns can avoid reprocessing the same context. For agents, it is not just a pricing trick. It is a context-layout discipline: stable rules and tools stay early, volatile turn data stays late, and evals protect quality.

```query
prompt caching
```

![Generated hand-drawn illustration of stable prompt context separated from volatile agent turn data.](/assets/agent-harness-architecture-15-jobs/02-cache-layout.png)

## Direct answer

Prompt caching reuses stable prompt prefixes so repeated agent turns can avoid reprocessing the same context. For agents, it is not just a pricing trick. It is a context-layout discipline: stable rules and tools stay early, volatile turn data stays late, and evals protect quality.

## When this matters

- Agents carry long system instructions, tool definitions, memory, policies, or source packs.
- Cost per turn rises after prompt, tool, or memory changes.
- You need to reduce cost without changing task success or safety results.

## Failure modes to catch

- Timestamps or scratch notes enter the stable prefix and break cache reuse.
- Tool definitions reorder every turn.
- A cache improvement hides a quality regression.
- Teams optimize cost per turn without tracking eval pass rate.

## Prompt-cache decision table

| Gate | Signal | Action |
|---|---|---|
| Stable prefix | system rules, policy, tool definitions | Keep deterministic |
| Volatile suffix | current user turn, live data, timestamps | Move late |
| Cache metrics | read tokens, write tokens, miss rate | Track per turn |
| Quality guardrail | eval pass rate | Do not count cost win if quality drops |
| Regression check | prompt diff and cache read drop | Bisect context builder |

```compare
put in stable prefix:
  tool definitions, durable policy, output contract, long-lived project context

put in volatile suffix:
  timestamps, current user request, fresh tool results, scratch notes, temporary facts

ship only if:
  cost_per_turn_down = true
  eval_pass_rate_not_down = true
  cache_read_tokens_up_or_stable = true
```

## Running example

A coding agent's cost jumps after a prompt cleanup. The monitor shows cache-read tokens dropped. The diff reveals a timestamp added near the top of the system block. Moving it to the turn message restores cache behavior without changing the eval suite.

## Put it to work

Use the prompt-cache decision table above as the first version of your production gate. Replace the placeholders with your own agent names, tools, risk classes, thresholds, and approval rules. Then wire it into traces, monitoring, security review, evaluation, and human approval so it changes runtime behavior instead of sitting in a doc.

## Related control gates

- [AI Agent Control Gates: Stop Bad Agents Before They Act](/agent-control-gates/)
- [AI Agent Monitoring: Metrics, Logs, and Stop Conditions](/agent-control-gates/ai-agent-monitoring/)
- [Agent Tracing: A Practical Schema for Tool-Using AI](/agent-control-gates/agent-tracing/)
- [Agent Observability: Trace What Agents Decide and Do](/agent-control-gates/agent-observability/)
- [AI Agent Evaluation: Gates That Catch Bad Behavior](/agent-control-gates/ai-agent-evaluation/)

## Frequently Asked Questions

### What is prompt caching?

Prompt caching lets repeated requests reuse stable prompt content instead of reprocessing the same tokens every turn. It is most useful when long instructions, tools, or context stay consistent.

### How does prompt caching help agents?

Agents often repeat large policies, tool definitions, and project context. Caching those stable sections can reduce cost and latency, but only if volatile data does not break the prefix.

### What is the quality risk?

A cheaper cached prompt can still be worse if context layout hides fresh evidence or changes tool behavior. Treat eval pass rate as the guardrail metric.

## The Takeaway

Prompt caching belongs in the control layer because cost reduction only counts when correctness stays intact.

## Sources

- [Google Cloud prompt caching for Claude models](https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/partner-models/claude/prompt-caching)
- [OpenAI Agents SDK tracing](https://openai.github.io/openai-agents-python/tracing/)
- [OpenTelemetry GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/)