LLM Observability: When Basic Telemetry Stops Working

Why did the LLM look healthy while the agent did the wrong work?

The tempting answer is to treat every model call as the whole system boundary. That answer is not useless, but it is too vague to operate. LLM observability tracks model behavior such as prompts, completions, latency, token usage, errors, and quality checks. It becomes incomplete for agents when the important failure is not the model call itself, but the tool, permission, retrieval, or approval decision around it.

Generated hand-drawn illustration of an agent harness split into replaceable control jobs around a model loop.

Direct answer

LLM observability tracks model behavior such as prompts, completions, latency, token usage, errors, and quality checks. It becomes incomplete for agents when the important failure is not the model call itself, but the tool, permission, retrieval, or approval decision around it.

Data note

When this matters

  • You already log prompts and completions but still cannot explain failed workflows.
  • The application has multi-step tool use, not just one model response.
  • You need a migration path from model telemetry to agent-level traces.

Failure modes this page should catch

  • Model call passed, tool selection failed.
  • Token cost is visible, but cache misses are not tied to context layout.
  • Prompt and output are logged, but retrieved evidence and policy decisions are missing.
  • Quality checks run after the final answer, not before the risky action.

LLM vs agent observability matrix

GateSignalAction
LLM spanprompt, model, latency, tokensKeep as base telemetry
Retrieval spandocuments, scores, filtersAttach to the model call
Tool spantool, input, output, permissionTrace every external action
Eval spancriteria, result, failure reasonGate publish or retry
Approval spanhuman, decision, rationalePreserve audit trail

Running example

A support agent uses the right model and returns a fluent answer, but it called the refund tool before checking customer eligibility. LLM observability says the generation was normal. Agent observability shows the missing policy gate.

Copy the working template

Use the llm vs agent observability matrix above as the v1 artifact for this page. Replace the placeholders with your own agent names, tools, risk classes, and thresholds, then link the result back into your monitoring, tracing, security, and evaluation gates.

How this connects to the control-gates library

Frequently Asked Questions

What is LLM observability?

LLM observability is visibility into prompts, outputs, token usage, latency, errors, and quality signals for model calls.

Why is agent observability different?

Agents introduce decisions outside the model call: tool selection, retrieval, permissions, approvals, retries, and workflow state. Those need their own trace fields.

Should teams build both?

Yes. Treat LLM observability as the base layer and add agent spans for tools, policy, evals, and approvals when the application becomes agentic.

The Takeaway

LLM observability tells you how the model behaved. Agent observability tells you how the system acted.

Sources