Back

Toucan AI Pricing Customers
Home Blog

AI Agent Backend Testing: The Pyramid We Use at Toucan

icon-star-blue

AI Agent Backend Testing: The Pyramid We Use at Toucan

Résumer cet article avec :

We broke our first AI agent backend in production on a Tuesday.

Not because the model gave a wrong answer. Because the system around it wasn't testable. A prompt change we considered minor cascaded through three tool calls, bypassed a guardrail, and returned corrupted output to a live user. We had no test to catch it, and no trace to explain it.

That's when we stopped treating our agents as a special case and started treating them like any other critical backend service: with contracts, layered tests, and proper observability.

In Toucan's AI layer, each user question triggers a chain of tool calls across our semantic layer, metric library, and data sources. One request can spawn multiple sub-agent hops before returning a visualization. At that level of complexity, "run it and see" isn't a testing strategy -- it's a liability.

Here's the 3-level testing pyramid we now use in production -- and how we'd structure it if we were starting from scratch today.

What is an AI agent testing pyramid?

An AI agent testing pyramid is a layered testing strategy that separates deterministic backend logic from non-deterministic model outputs. It typically has three levels: unit and contract tests for routing, state, and tool handlers; integration tests that drive the orchestrator with fake model outputs; and scenario replays that re-run recorded real conversations against new code or prompts. The goal is to make everything around the model testable and predictable, even if the model itself is not.

Why AI Agent Testing Is Different From Testing Normal Code

raditional services are built on deterministic code paths. AI agents aren’t:

  • Model outputs are non‑deterministic – The same prompt can yield different answers.

  • Behavior is emergent – Small prompt changes can have big downstream effects.

  • Workflows are long‑lived – One user message may trigger multiple tool calls and sub‑agent hops.

If you try to test this with only end‑to‑end “does it answer correctly?” checks, you end up with:

  • Flaky tests tied to one specific model provider and version.

  • Very slow feedback loops.

  • No clear idea where a failure actually comes from.

We’ve had better results by breaking the problem down: make everything around the model boring and testable, and only treat the model itself as “fuzzy”.

The 3-Level Testing Pyramid for AI Agent Backends

We use a simple mental model for where tests should live:

  • Unit / contract tests – Pure code, no model calls. Schemas, routing logic, reducers, helpers, tool handlers.
  • Integration tests – Orchestrator + real tools + fake model outputs.
  • Scenario replays – Recorded real conversations, re‑run offline against your system.

 

The higher you go, the fewer tests you have—but the more realistic they are.

Unit Tests for AI Agents: Make Your Backend Deterministic

The first step is to isolate as much non‑AI logic as possible:

  • Context and state reducers – How you update internal state when a tool succeeds or fails.

  • Routing and classification logic – How you map user intents to flows or agents.

  • Tool handlers – How you translate typed inputs into queries, API calls, or domain operations.

  • Schemas and validation – How you validate model/tool I/O before it touches business logic.


These pieces should be 100% deterministic, which means they can be covered by normal unit tests:

  • Given input X and state S, the reducer produces state S'.

  • Given intent Y, the router selects flow F.

  • Given tool input I, the handler calls underlying service Z with parameters P.

From a CTO perspective, this is what keeps your “AI” system from being a black box: most of it behaves like ordinary, testable code.

Integration Testing AI Agents With Fake Model Outputs

You don’t want every test run to hit a real LLM. It’s slow, expensive, and makes tests flaky.
Instead, for integration tests we:

  • Stub the model – Replace real model calls with a fake that returns predetermined, typed outputs. If you're using LangChain, FakeChatModel handles this natively. Otherwise, a mock on your SDK client (unittest.mock in Python, vi.fn() in TypeScript) is enough, no external platform needed at this level.
  • Keep the orchestrator and tools real – The same code you run in production, but driven by controlled “model decisions”.

Conceptually:
This lets you test questions like:

  • “If the model says the user wants a chart, do we call the right tools in the right order?”
  • “If a tool fails with a structured error, does the orchestrator fall back correctly?”
  • “Do we respect limits on tool calls and retries?”

You’re not checking whether the answer is perfect English; you’re checking whether the system reacts properly to a given sequence of model decisions.

Scenario Replays: Test AI Agents Against Real User Conversations

Unit and integration tests get you far, but real users will always find novel paths through your system.
That’s where scenario replays come in:

  • You record real interactions: user messages, context, tool invocations, and outcomes (often using your existing checkpointer / tracing layer).
  • You store them as anonymized “scenarios” that can be replayed later in non‑production.
  • You re‑run those scenarios against:
    • New versions of your backend logic.
    • New prompts or model providers.
    • New tools or safety policies.

In practice, tools like LangSmith, Braintrust, or Langfuse handle this out of the box — conversation recording, dataset management, and offline replay. If you're starting from scratch, adopt one of these before building your own infrastructure.


Over time, this gives you a catalog of “golden paths” and “hard cases” you can regression‑test.

From a risk perspective, this is powerful:

  • You don’t just test synthetic examples; you test the exact shapes of requests your users actually send.
  • You can ask, “does this change make anything worse for real users?” instead of relying only on synthetic benchmarks.

How to Test AI Agent Guardrails (Loops, Permissions, Limits)

Guardrails (limits on loops, retries, permissions, and data access) are some of the most important behaviors to test.

Rather than hoping the model “behaves”, we:

  • Treat guardrails as normal code with clear conditions and side effects.
  • Write focused tests for:
    • Maximum number of tool calls per request.
    • How we react after repeated failures for the same step.
    • What happens when a tool is forbidden or a resource is out of scope.

 

For example: the max tool call limit is a counter in your orchestrator loop, not a prompt instruction. If toolCallCount >= MAX_TOOL_CALLS , the orchestrator raises a structured exception. A unit test drives it past that limit and asserts it stops cleanly.

This mindset flips guardrails from “prompt hints” to enforceable policies, which is much easier to reason about as an engineering team.

Why AI Agent Observability and Testing Must Work Together

Testing and observability feed each other:

  • Tests verify that the right events are emitted at the right time (phase changes, tool calls, errors).
  • Observability helps you discover where to add tests (e.g. flows that fail often or are unusually slow).

 

Practically, this looks like:

  • Making sure every request has a trace identifier.
  • Emitting structured events (not free‑text logs) for major decisions and errors.
  • Asserting in tests that these events exist when they should.

 

Langfuse and LangSmith provide this out of the box: trace IDs, structured event capture, and LLM-specific dashboards. Both integrate with OpenTelemetry if you already have APM tooling.

You end up with a system where you can:

  • Look at a test failure and immediately see the associated trace.
  • Look at a production incident and quickly add a new scenario replay based on the exact trace that failed.

Practical Lessons We’d Reuse

If we were advising a team building or hardening their first AI agent backend, we’d suggest:

  • Make most of the system non‑AI. Move routing, state, validation, and tool logic into deterministic code that you can unit‑test normally.
  • Use fake models in integration tests. Drive your orchestrator and tools with predefined “model outputs” to test flows without cost or flakiness.
  • Invest early in scenario replays. Start capturing real conversations and tool traces as soon as possible; they'll become your most valuable regression suite. Tools like LangSmith or Langfuse make this straightforward without custom infrastructure.
  • Treat guardrails as code, not vibes. Implement limits, permissions, and anti‑loop logic in normal code and test them directly.
  • Tie tests to observability. Use trace IDs and structured events in both tests and production so you can move quickly between “what failed?” and “how do we reproduce it?”. Langfuse and LangSmith are purpose-built for this in LLM systems.

 

You can’t make AI agents perfectly predictable—but you can make the system around them predictable, debuggable, and safe to evolve. That’s the part testing can (and should) give you.

Building AI-powered analytics into your product?

Toucan embeds a governed, testable AI analytics layer directly into your SaaS product. Your users ask questions in natural language. Your team ships reliable features without maintaining a custom AI backend.

See how Toucan handles AI agent reliability in production or read our guide on embedding AI analytics in your product.