Back

Pricing Customers
Home Blog

How to Evaluate AI Features Before You Ship (2026) | Toucan

icon-star-blue

How to Evaluate AI Features Before You Ship (2026) | Toucan

Résumer cet article avec :

Most SaaS teams ship AI features the way they've always shipped everything else. Build it, test with a few customers, iterate, release. That worked fine for deterministic software for years. It does not work for AI, and a lot of teams find that out the hard way.

This article walks through what a real AI evaluation process actually looks like, why skipping it quietly kills user trust, and how to build something that catches problems before customers do. The lessons here come straight from Valentin Huang, CEO of Harvestr, and Charles Miglietti, CEO of Toucan, who talked through this openly during a live webinar on AI product pivots, combined with current research on why eval is harder than most teams expect going in.

Both founders made real mistakes here. Neither would do it the same way twice.

TL;DR

  • Shipping AI features without an eval system leads to inconsistent outputs, eroded trust, and features customers quietly stop using.
  • Standard QA doesn't apply to AI. Outputs are probabilistic, not deterministic, so you need a different way of testing them.
  • Start eval before you start building. Decide what "good output" looks like before you write a single prompt.
  • No off-the-shelf tool will do exactly what you need. Plan to build something custom, even if it's simple at first.
  • If your AI feature involves multiple steps (an agent calling tools, chaining reasoning), scoring only the final answer misses most of what actually breaks. Research on agent evaluation found that checking final output alone let 20 to 40% more failing cases pass than evaluation that checked each step along the way.
  • Eval isn't a one-time gate before launch. It becomes a system your product and engineering teams keep using for every AI feature that follows.

Why AI features fail differently than regular features

When you ship a broken filter in your product, people notice fast. The output is wrong in an obvious, reproducible way. You fix it, redeploy, and move on.

AI features fail in a quieter way. The output might be right 80% of the time. Or it's solid on simple inputs but falls apart on edge cases that only show up with specific customer data. Or it slowly gets worse as the model behind it changes, without anyone flagging it. Users don't always know when the AI got something wrong. They just stop trusting it, and they don't usually tell you why.

That quiet loss of trust is the real risk. Once someone decides an AI feature is unreliable, they stop coming back to it. They work around it instead. And no amount of quality improvement after the fact tends to bring that usage back, because the impression already set in.

Valentin Huang put it simply during the webinar: "If you don't have the right quality, you have a problem of trust and transparency. Some users just abandon your AI features entirely."

The mistake Harvestr made early: shipping before the eval system existed

In 2023, Harvestr was in a spot a lot of SaaS teams will recognize. A new technology with real potential, a competitive window that was closing fast, and pressure from every direction to move quickly. They built AI feedback summaries and categorization features. They tested with a handful of customers. People liked what they saw. So they shipped.

The problem showed up afterward. "We didn't have the right evaluation system in place. And without it, you don't know what's working and what's not. You don't know what to improve next." The team was essentially flying blind. They couldn't tell which changes to the model actually helped, which ones hurt, or where the real failure points were hiding.

There's a second part to this that's worth calling out. Harvestr did look for dedicated eval tools at the time. A few existed. None did exactly what they needed for their use case and their data. They spent real time trying to bend existing tools into shape before finally accepting they'd have to build their own.

That decision, building a custom eval setup, ended up being the right one. It just came later than it should have.

What a real AI eval system actually looks like

Harvestr's answer wasn't a framework lifted from a tutorial or a generic LLM evaluation library. It was something built specifically around their features, their customer data, and their own definition of what counts as a good output.

The core loop is straightforward:

  • Pull a set of real customer inputs, not made-up examples.
  • Run the feature against those inputs and save the outputs.
  • Score the outputs, by hand for nuanced cases and at scale for broader coverage.
  • Track those scores over time so changes to a model or a prompt become measurable instead of a guessing game.

The word doing the most work there is "real." Synthetic test cases give you coverage for scenarios you already thought of. Real customer data gives you coverage for the scenarios you didn't, and those are usually the ones that bite you in production.

Charles Miglietti flagged something similar from his side at Toucan: "Building AI agents is not just about prompting an LLM. You need to rethink how you evaluate not just the model but the full workflow, how you orchestrate agents, how you provide feedback loops, how you work on reliability and determinism." Evaluation isn't a single check on whether the model is good. It's something that has to run across the whole AI workflow, continuously.

This connects to something the broader eval research community has converged on independently. A 2026 comparison of agent evaluation platforms found that agents checked only on their final output passed 20 to 40% more test cases than the same agents checked at the step level, meaning a third or more of real failures were invisible to teams that only looked at the end result. For any AI feature that involves more than one step (an agent calling a tool, a workflow chaining two prompts together), checking just the final answer hides exactly the kind of failures Charles is describing.

Final-output check Step 1 Step 2 Step 3 Looks correct Step-level check Step 1 ✓ Step 2 ✗ Step 3 Flagged for review 20 to 40% more failures caught when scoring each step instead of only the final answer

What "good output" means, and why you have to define it first

The most common reason eval systems fail before they even start is that nobody agreed on what success looks like. A team builds a feature, runs it against test cases, and then argues about whether the output is good enough. That argument should have happened before anyone wrote a prompt.

For AI features in a SaaS product, defining a good output usually comes down to three questions.

Is it accurate? Does the output match what's actually correct given the input? For something like AI-driven feedback categorization, that means checking whether the category matches what a human expert would pick. You need labeled examples to answer this honestly. If you don't have any yet, build that labeling process before you build the feature itself.

Is it consistent? Does the same input give you roughly the same answer across runs? This matters more than people expect going in. Users notice when they ask the same question twice and get different answers back. Perfect determinism isn't always realistic with LLMs, but the variance needs limits, and you need to be measuring it.

Where's the trust line? At what point would a user stop relying on this feature? That's a product call, not an engineering one, and it requires talking to customers before launch rather than after. The answer changes depending on what's at stake. A feedback summary that's right 80% of the time might be fine. A financial number that's right 80% of the time is not.

Scoring at scale: where "LLM as judge" actually helps, and where it doesn't

Once you have more than a handful of test cases, scoring every output by hand stops being realistic. This is where most teams reach for an LLM to do the scoring for them, sometimes called "LLM as judge." It's worth understanding what that approach actually buys you and where its limits are, because the framing of "just have AI check AI" undersells how much rigor the good versions of this require.

The technique has real research behind it. A 2023 Microsoft Research paper introducing the G-Eval method tested whether a language model, prompted to reason step by step before scoring, could match human judgment on tasks like text summarization. It found this approach reached a Spearman correlation of 0.514 with human ratings on summarization quality, well ahead of older automated metrics that compare word overlap, and consistently stronger on tasks involving nuance and coherence.

The detail that matters for SaaS teams isn't the specific score. It's the method: G-Eval works because the judge model is forced to write out its reasoning before giving a score, not because asking a model "is this good, yes or no" is reliable on its own. Skipping straight to a single score without that structured reasoning step is a common shortcut, and it's the version that tends to produce inconsistent, hard-to-trust judgments.

Practically, that means an effective LLM-as-judge setup for your own eval system needs a written rubric the judge model walks through, not a vague prompt asking it to rate quality. And it still needs human review on a sample of outputs to catch cases where the judge itself drifts, since judge models carry their own biases and blind spots.

Why evaluating AI analytics features is its own challenge

For teams putting AI into analytics products, whether that's customer-facing embedded analytics or an internal BI tool, there's an extra layer of difficulty: the output is a chart or a number, not a paragraph of text.

A language model can write a fluent, confident-sounding answer that happens to be wrong. It can produce a chart that looks perfectly fine while pulling the wrong metric underneath. The user sees a clean visualization and assumes the data behind it is correct. Finding out later that it wasn't does more damage to trust than a feature that just obviously didn't work in the first place.

This is part of why Toucan built its AI layer on top of a governed semantic layer instead of letting a model interpret raw data on the fly. A natural language question gets translated into a deterministic query against a certified metric store. The AI handles the interface, the semantic layer guarantees the number underneath it is right. That changes what eval actually needs to test: not whether the underlying number is correct (it always is, by design), but whether the query translation matched what the user actually asked.

If your stack doesn't work that way yet, the practical takeaway is simple. Always check AI analytics outputs against results from your existing, trusted reporting. If the AI says revenue was X and your standard report says Y, that's a failure, no matter how well the AI explained itself.

How much rigor your eval process actually needs

Not every AI feature deserves the same depth of testing. The right amount of rigor depends on how bad it is when the feature gets something wrong.

Low stakes (summaries, suggestions, auto-generated labels): a manual review of a sample set before launch, plus spot checks once it's live. You don't need much infrastructure to start here.

Medium stakes (categorization, classification, recommendations): a labeled dataset of at least 200 representative examples, automated scoring against that ground truth, and regression testing every time the model or prompt changes. Track precision and recall, not just a simple accuracy number.

High stakes (anything touching financial numbers, compliance, or decisions customers act on): a full regression suite running on real production data, mandatory human review for known edge case categories, side by side comparison against your existing deterministic outputs, and a clear rollback plan if quality drops below your threshold.

Harvestr started with the lower stakes features and worked up to the riskier ones as their eval setup matured. That order wasn't an accident. You don't want to be learning how to run evals at the same time a feature is making decisions that actually matter to customers.

Low stakes Summaries, suggestions Manual review of a sample plus spot checks in production Medium stakes Categorization, recs 200+ labeled examples, automated scoring, regression tests on every change High stakes Financial, compliance Full regression suite, human review on edge cases, clear rollback plan

Who actually owns eval on your team

Most teams never assign this clearly. It sits somewhere between product, engineering, and whoever's closest to the data, which usually means it gets done inconsistently or not at all.

What tends to work: one person on product owns the definition of good output, meaning the criteria and the labeled datasets, and one person on engineering owns the tooling and automation around it. They work together before every AI feature ships, not separately and not after the fact.

Charles's team learned this firsthand. Building AI agents means rethinking architecture, evaluation, orchestration, and feedback loops all at once, and there's no settled playbook for it yet. "The ecosystem is not very well documented. Best practices are still emerging. What's true today might not be true tomorrow." That instability is exactly why eval needs an owner who stays close to it, not a setup that gets configured once and forgotten after launch.

What to do this week if you don't have an eval system yet

You don't need to build all of this at once. Here's where to actually start.

Step 1. Pick one live AI feature that's felt inconsistent. Not your newest one, the one where users have complained, stopped using it, or where the team genuinely hasn't been sure if recent changes helped or hurt.

Step 2. Pull 50 real inputs from production. Actual customer data, not examples you wrote yourself for the occasion. Mix in cases you know worked well with ones you suspect didn't.

Step 3. Define the correct output for each input by hand, with someone who actually understands the feature. That's your first real eval dataset, and it's more valuable than it sounds.

Step 4. Run the current feature against those 50 inputs and score it. How often does it match what you expected? Where does it fail? Do the failures cluster around anything specific? If the feature involves more than one step, score each step, not just the final answer.

Step 5. Write down the failure patterns and set a quality bar. Below it, you don't ship. Above it, you do. That's the start of a real process, not a checklist you run once.

This takes a few days, not weeks. And it will tell you more about your AI feature than months of scattered customer feedback ever will.

Where Toucan fits into this

If you're an ISV or a SaaS product team trying to put AI in front of customers without gambling on whether the numbers are right, this is exactly the problem Toucan was built to remove from your plate.

Toucan's AI layer sits on top of a governed semantic layer, not a model improvising over raw data. Every question a user asks gets translated into a query against a certified, version-controlled metric store. That means your eval work shifts from "is this number even correct" (it is, structurally) to "did we understand the question correctly," which is a much smaller and more tractable problem to solve.

For teams without a dedicated AI or data science org, that's the difference between spending months building eval infrastructure from scratch and getting a system where the hard part is already handled. You still own the product judgment. You just don't have to build the safety net underneath it yourself.

If your team is weighing whether to build this kind of AI analytics layer in house or bring in something production ready, book a demo and we'll walk you through exactly how the semantic layer keeps AI answers accurate, and what that means for your own eval process. Prefer to explore it yourself first? start a free trial.

Go deeper

FAQ

What is AI feature evaluation in SaaS products?

AI feature evaluation is the process of systematically testing AI outputs against defined quality criteria, before launch and on an ongoing basis afterward. Unlike standard QA, which checks that software behaves the same way every time, AI eval measures output quality, consistency, and accuracy across a range of real world inputs. For SaaS products, this usually means a labeled dataset of representative customer inputs, a scoring method, and a quality threshold below which a feature shouldn't ship. The point is to catch failure modes before users run into them, instead of finding out through churn or support tickets.

When should you start building an eval system for AI features?

Before you build the feature, not after. The most common mistake SaaS teams make is treating eval like a QA step that happens once the build is done. That framing doesn't work for AI. Deciding what "good output" looks like is something you need to settle before writing prompts, picking a model, or designing the experience around it. If you can't describe correct output in concrete terms, you have no way to measure whether you've actually achieved it. Building your eval dataset alongside the feature also saves time later, since you'll have real test cases ready the moment the first version is built, instead of scrambling to create them under pressure to ship.

Do you need a dedicated tool for AI evaluation?

There are tools out there, LangSmith and Braintrust among the most established, but none of them will do exactly what you need out of the box for your specific use case and your specific data. Each tends to fit a different situation: LangSmith works best if your stack is already built on LangChain, Braintrust is geared toward enforcing quality gates before deployment, and tools like Langfuse target teams that want to self-host. The pattern from teams who've been through this is consistent: start with something simple and custom, a spreadsheet, a basic script, or a lightweight internal interface built around your actual features and real customer inputs. You can move to a more sophisticated tool later if you need to. What you can't outsource to any tool is the judgment call on what good output even means. That belongs to your product team, full stop.

How many test cases do you need for an AI eval dataset?

For most SaaS AI features, somewhere between 50 and 200 real customer inputs is enough to catch the common failure modes early on. Diversity matters more than volume: include simple cases, edge cases, inputs from different customer segments, and anything that's historically caused trouble. Fifty well chosen real inputs will teach you more than five hundred synthetic examples you wrote yourself. As the feature matures and new failure patterns show up, you expand the dataset. A good habit is adding 20 to 30 new cases every time you make a meaningful change to the model or the prompt.

What happens if you ship an AI feature without evaluation?

Short term, you get inconsistent outputs. Medium term, you lose user trust in the feature without necessarily noticing it happening. People who run into a wrong or confusing AI answer rarely file a support ticket. They just quietly stop using the feature and find another way to get what they need. Once that pattern sets in, usage numbers drop and tend to stay down even after quality improves, because the impression already formed. Your engineering team also loses the ability to improve with confidence, since without a baseline you can't tell if a change to the model actually helped or made things worse. Eval is what makes real iteration possible.

Is having an AI model score your AI feature's output a reliable approach?

It can be, but only when it's done with structure. Research on the technique, often called LLM-as-judge, found that simply asking a model to rate output quality on its own produces inconsistent results. The version that works reasonably well forces the judge model to reason through specific criteria step by step before producing a score, an approach with published research showing meaningfully better alignment with human ratings than older automated methods. Even then, it works best as a way to score at scale once you already have a smaller set of human-labeled examples to calibrate against, not as a replacement for human judgment on what good output means in the first place.

How is evaluating AI analytics features different from other AI features?

Analytics features produce numbers and charts, not just text, and that changes the risk profile. A language model can write a confident, well structured answer that's numerically wrong, and a clean looking chart tends to make people assume the underlying data is accurate even when it isn't. That gap between how trustworthy something looks and how trustworthy it actually is makes the damage from a wrong analytics output worse than a wrong answer in, say, a summary feature. Solid evaluation here means checking AI outputs against known correct results from your existing, verified reports. Any mismatch is a real failure, no matter how polished the explanation around it sounds.