# AI Agent Monitoring: Metrics, Logs, and Stop Conditions

Why did the agent fail if every API call returned 200?

The tempting answer is to monitor uptime, latency, and error rate like a normal backend service. That answer is not useless, but it is too vague to operate. AI agent monitoring is the practice of tracking agent turns, tool calls, model latency, token cost, retries, loops, policy decisions, and final outcomes. It matters because agent failures often look like successful requests unless the monitor knows what the agent was trying to do.

```query
ai agent monitoring
```

![Generated hand-drawn illustration of agent session state, turn logs, checkpoints, and approval paths.](/assets/agent-harness-architecture-15-jobs/04-session-state.png)

## Direct answer

AI agent monitoring is the practice of tracking agent turns, tool calls, model latency, token cost, retries, loops, policy decisions, and final outcomes. It matters because agent failures often look like successful requests unless the monitor knows what the agent was trying to do.

## When this matters

- A workflow can complete with the wrong output and no exception.
- The agent uses tools repeatedly, retries silently, or streams partial progress to users.
- Cost, latency, approval, and quality need to be managed per turn instead of per endpoint.

## Failure modes to catch

- Looping: the agent calls the same tool until budget is exhausted.
- Silent drift: answer quality drops while uptime stays green.
- Tool mismatch: the agent uses a safe tool for the wrong job.
- Cache regression: stable context moves and cost rises without a product change.
- Approval escape: risky work completes without hitting the human gate.

## Monitoring runbook

| Gate | Signal | Action |
|---|---|---|
| Turn status | done, error, paused, budget-stopped | Alert on unknown or stale states |
| Tool-call rate | calls per turn and repeat calls | Stop repeated calls after threshold |
| Cost meter | input, output, cache read, cache write | Alert on cost per turn spike |
| Policy result | allow, deny, approval | Block missing policy decisions |
| Outcome signal | eval pass, user accept, publish verify | Fail closed on missing outcome |

```schema
monitoring_event:
  turn_id: required
  agent_id: required
  state: queued | running | paused | done | failed | budget_stopped
  tool_calls: count
  repeated_tool_calls: count
  total_tokens: number
  cache_read_tokens: number
  policy_decisions: allow | deny | approval
  eval_result: pass | fail | not_run
  stop_condition: none | loop | budget | policy | eval | timeout
```

## Running example

The monitor sees a turn with 14 repeated search calls, rising token cost, and no new evidence objects. It stops the run as loop risk, preserves the trace, and asks for a narrower query instead of letting the agent spend another ten minutes.

## Put it to work

Use the monitoring runbook above as the first version of your production gate. Replace the placeholders with your own agent names, tools, risk classes, thresholds, and approval rules. Then wire it into traces, monitoring, security review, evaluation, and human approval so it changes runtime behavior instead of sitting in a doc.

## Related control gates

- [AI Agent Control Gates: Stop Bad Agents Before They Act](/agent-control-gates/)
- [Agent Observability: Trace What Agents Decide and Do](/agent-control-gates/agent-observability/)
- [Agent Tracing: A Practical Schema for Tool-Using AI](/agent-control-gates/agent-tracing/)
- [AI Agent Evaluation: Gates That Catch Bad Behavior](/agent-control-gates/ai-agent-evaluation/)
- [Prompt Caching: Cut Agent Cost Without Breaking Quality](/agent-control-gates/prompt-caching/)

## Frequently Asked Questions

### What should AI agent monitoring include?

AI agent monitoring should include turn state, tool calls, model latency, token cost, cache usage, policy decisions, retry behavior, eval results, and final outcome verification.

### How is monitoring different from observability?

Monitoring tells you when a signal crossed a threshold. Observability gives you enough trace detail to explain why the threshold was crossed and what the agent did next.

### What is the first stop condition to add?

Add loop and budget stops first. They are easy to measure and prevent agents from turning a small ambiguity into repeated tool calls and uncontrolled cost.

## The Takeaway

Monitoring is not the dashboard. Monitoring is the set of signals that can stop the agent before a normal-looking request becomes an expensive wrong answer.

## Sources

- [OpenAI Agents SDK tracing](https://openai.github.io/openai-agents-python/tracing/)
- [OpenAI Agents SDK guardrails](https://openai.github.io/openai-agents-python/guardrails/)
- [OpenTelemetry AI agent observability](https://opentelemetry.io/blog/2025/ai-agent-observability/)