Agent Testing: Test Plans for Tool-Using AI Systems

What do you test when the agent can choose its own steps?

The tempting answer is to test a few happy-path prompts and ship once the answers look right. That answer is not useless, but it is too vague to operate. Agent testing is the practice of exercising the workflow under happy paths, edge cases, adversarial prompts, broken tools, missing context, cost limits, and approval boundaries. It tests the system that surrounds the model, not just the model output.

Query

agent testing

Generated hand-drawn illustration of an agent policy gate routing read, write, and external actions.

Direct answer

Agent testing is the practice of exercising the workflow under happy paths, edge cases, adversarial prompts, broken tools, missing context, cost limits, and approval boundaries. It tests the system that surrounds the model, not just the model output.

The common mistake

The sharper operating question is:

Query

Which condition proves this agent should continue, stop, retry, or ask for approval?

Where this gate sits

Testing sits before deployment and inside the regression loop after incidents. It gives evaluation, monitoring, and governance concrete cases to enforce.

Signals to capture

Signal	What to inspect	Gate action
Happy path	Normal task with expected tools	Confirm baseline
Broken tool	Timeout, bad response, partial result	Fallback or stop
Bad context	Stale docs, malicious page, missing file	Deny or revise
Risky action	External send, deploy, delete, payment	Require approval
Cost pressure	Long context, repeated calls, wrong model	Route or cache

Running example

A publishing agent should not publish when source slots are unresolved. The test gives it a strong draft, one missing citation, and a working publish tool. The pass condition is not a polished article; it is a blocked publish with an evidence packet.

Implementation checklist

Create fixtures for tools, context, user identity, and permissions.
Test missing, stale, conflicting, and malicious context.
Test duplicate side effects when a tool fails after partial success.
Test approval screens with incomplete evidence.
Promote production failures into regression cases immediately.

What changes in production

In a demo, Agent testing can look like a reviewer preference. In production, it has to become a branch in the agent runtime.

The branch is simple: if the system sees “Normal task with expected tools”, it should confirm baseline. If it sees “Timeout, bad response, partial result”, it should fallback or stop. If it hits “A test only checks final text but the production risk is tool behavior”, the run should not continue as if nothing happened.

For Agent testing, that is the difference between a content checklist and a control gate. The gate changes the next action while the run is still alive.

What to log in the trace

test_case_id
fixture_set
expected_tool_calls
forbidden_tool_calls
expected_gate
actual_gate
assertion_result

Review packet

A reviewer, on-call owner, or future incident review should be able to answer three Agent testing questions from the packet:

What evidence triggered this Agent testing gate?
What action did this Agent testing gate allow, deny, retry, or escalate?
What would have happened if the Agent testing gate had been absent?

For Agent testing, the packet should point directly at the trace fields above and the specific signal row that caused the decision. If the packet only says “agent requested approval” or “policy failed,” it is not yet operational evidence.

When to escalate

A test only checks final text but the production risk is tool behavior.
The agent can call live tools during tests without a sandbox.
A failed tool can create duplicate side effects on retry.
No regression case exists for a real incident.

Frequently Asked Questions

What is agent testing?

Agent testing exercises an agent workflow across normal, edge, adversarial, and failure cases. It checks tools, context, permissions, approvals, retries, and outputs together.

How is agent testing different from prompt testing?

Prompt testing focuses on model behavior for inputs. Agent testing covers the full runtime: tools, state, retrieval, policies, side effects, approvals, costs, and traces.

What should be in a first test suite?

Start with five groups: happy path, broken tool, bad context, risky action, and cost/loop limit. Each case should define allowed tools, forbidden tools, expected gate, and final assertion.

The Takeaway

Agent tests should fail before the customer does. Test the decisions and side effects, not just the prose.

Agent Testing: Test Plans for Tool-Using AI Systems

Direct answer

The common mistake

Where this gate sits

Signals to capture

Running example

Implementation checklist

What changes in production

What to log in the trace

Review packet

When to escalate

Frequently Asked Questions

What is agent testing?

How is agent testing different from prompt testing?

What should be in a first test suite?

The Takeaway

Sources

AI agent control gate library

Direct answer

The common mistake

Where this gate sits

Signals to capture

Running example

Implementation checklist

What changes in production

What to log in the trace

Review packet

When to escalate

Related control gates

Frequently Asked Questions

What is agent testing?

How is agent testing different from prompt testing?

What should be in a first test suite?

The Takeaway

Sources

AI agent control gate library