← Blog

How I Evaluate AI Agents (and Why Most Teams Get It Wrong)

Contents
  1. Why benchmarks lie to you
  2. The three mistakes everyone makes
  3. The framework I actually use
  4. Layer 1: Deterministic checks
  5. Layer 2: LLM-as-judge
  6. Layer 3: Human review on a sample
  7. What I measure (and why)
  8. The product decisions this enables
  9. What I'd do differently if I started over

Six months ago I deployed two AI agents at Voltade. Claudia handles customer conversations on WhatsApp. Clawrence scrapes government tenders and does internal ops. Between them, they process a few hundred tasks a day.

I thought the hard part was building them. It wasn't. The hard part is knowing whether they're actually doing a good job.

Most teams I talk to evaluate agents the same way: they eyeball a few outputs, decide it "looks good," and move on. That works for a demo. It falls apart the moment an agent is talking to real customers or spending real money. I know because I burnt $300 learning that lesson.

Here's what I've learnt about agent evaluation from running these systems in production. None of this is theoretical. It's all stuff that broke and forced me to build something better.

#Why benchmarks lie to you

Before I get into my framework, let me explain why I don't rely on public benchmarks to evaluate agents.

The best coding agent in the world scores 74.4% on SWE-bench Verified (Claude Opus). The best web agent scores 61.7% on WebArena. These numbers are impressive. They're also mostly useless for deciding whether your agent works in production.

Benchmarks measure capability on curated tasks. Production measures reliability on messy, unpredictable inputs with real consequences. An agent that scores 90% on a benchmark but silently drops 5% of customer messages is a liability. Amazon's internal research found that 40% of multi-agent pilots fail within six months, not because the models weren't capable, but because nobody built the evaluation infrastructure to catch failures before they compounded.

The gap between benchmark performance and production reliability is where most agent projects die. Bridging that gap is what this post is about.

#The three mistakes everyone makes

Mistake 1: Treating agents like APIs. Traditional software either works or doesn't. A function returns the right value or throws an error. Agents aren't like that. Claudia can send a response that's technically correct but completely misses the customer's tone. Clawrence can flag a tender that matches every keyword but has nothing to do with our actual capabilities. The output looks right. It isn't.

I ran the numbers on this. Of Claudia's failures over a three-month period, only 23% were hard failures (API errors, timeouts, malformed output). The other 77% were what I call "soft" or "silent" failures: she completed the task, returned valid JSON, but the output was wrong in ways that a simple pass/fail check would never catch. If you're only checking "did it complete without errors," you're catching less than a quarter of actual failures.

Mistake 2: Evaluating once, at launch. You test the agent, it passes, you deploy. Three weeks later, Anthropic pushes a model update. Your prompts interact slightly differently with the new weights. Accuracy drifts. Nobody notices because nobody's measuring anymore.

I track this with weekly canary evaluations. Same 50 test inputs, same rubrics, every Monday. After one model update, Claudia's average empathy score dropped from 4.2 to 3.4 overnight. No code changed. No prompts changed. The model just responded slightly differently to the same instructions. Without the canary suite, I would have noticed when a customer complained, maybe weeks later.

Mistake 3: No concept of "expected behaviour." This is the one that cost me $300. I had no definition of what Clawrence's daily spend should look like. No baseline for how many tokens a competitor analysis should consume. So when the cron ran hourly instead of daily, burning 1.2M tokens per day, there was nothing to compare it against. You can't detect anomalies without a baseline.

#The framework I actually use

I've settled on a layered approach. The core idea comes from something Anthropic calls "eval-driven development": build your evaluation suite before you build your agent, the same way TDD makes you write the test before the code. I didn't start that way, but I've retrofitted it and the difference is night and day.

Each layer is designed to catch a specific class of failure at the lowest possible cost.

The three evaluation layers: deterministic code checks (free, every output), LLM-as-judge ($0.002/eval), and human review (5-10% sample for calibration).

#Layer 1: Deterministic checks

Instant, free, runs on every output. These are the mechanical checks.

For Claudia, that means: did the WhatsApp API call succeed? Is the response under 4096 characters (WhatsApp's limit)? Did she create the Notion ticket when the conversation required one? Does the ticket have all required fields populated?

For Clawrence: did the scrape return valid data? Are the tender dates in the future? Does the extracted company name match a known entity? Is the cost estimate within the expected range for this task type?

I log every check with a structured schema:

{
  "task_id": "cl-20260312-0847",
  "agent": "claudia",
  "task_type": "customer_reply",
  "checks": {
    "api_success": true,
    "response_length_valid": true,
    "notion_ticket_created": true,
    "required_fields_present": false
  },
  "failed_checks": ["required_fields_present"],
  "timestamp": "2026-03-12T08:47:23Z"
}

These checks catch about 23% of all failures. Not glamorous, but they're free and fast. You'd be surprised how many teams skip them entirely.

#Layer 2: LLM-as-judge

This is the layer that changed everything. You take the agent's output, write a rubric describing what "good" looks like, and ask a separate model to grade it.

Claudia's customer responses get graded on three dimensions:

  • Accuracy (1-5): Did she answer the actual question? Not a related question, the actual one.
  • Empathy (1-5): Does the tone match the customer's emotional state? A frustrated customer getting a cheerful response is a failure.
  • Completeness (1-5): Did she address all parts of the query, or just the first one?

Here's the thing nobody tells you about LLM-as-judge: it has real, measurable biases.

Position bias. When you ask a model to compare two outputs (A vs B), it favours whichever one you show first. Research from multiple labs puts this at 10-30% depending on the task. I avoid pairwise comparisons entirely. Single-output rubric grading eliminates this.

Self-preference bias. Models rate outputs from their own model family 5-7% higher than equivalent outputs from other models. This is why I always grade with a different model than the one that generated the output. Claudia runs on Claude. Her outputs get graded by GPT-4o. Clawrence runs on Claude. His outputs get graded by GPT-4o. Never let the student mark their own homework.

Verbosity bias. Longer responses get rated ~15% higher regardless of quality. My rubrics explicitly instruct the judge to penalise unnecessary length: "A concise response that fully addresses the question should score higher than a verbose one that repeats itself."

Calibration drift. The grading model's standards shift over time too. I maintain a set of 20 "anchor" examples with known-good human scores. Every week I run them through the judge and check correlation. If Pearson r drops below 0.85, the rubric needs updating. Right now I'm at 0.91.

The grading prompt structure matters. I use chain-of-thought grading:

You are evaluating a customer service response.

Context: [customer message]
Agent response: [Claudia's response]

Think step by step:
1. What was the customer actually asking?
2. Did the response address that specific question?
3. Was the tone appropriate for the customer's emotional state?
4. Was anything missing or unnecessary?

Then score each dimension 1-5:
- Accuracy: [score] [one-line justification]
- Empathy: [score] [one-line justification]
- Completeness: [score] [one-line justification]

The chain-of-thought step isn't decorative. Without it, scores cluster around 3-4 regardless of quality. With it, the distribution spreads to actually differentiate good from bad outputs. Anthropic's own eval guidance recommends this, and my data confirms it: score variance increased 40% when I added the reasoning step.

#Layer 3: Human review on a sample

I review about 5-10% of outputs manually. Not because I don't trust the automated grades, but because I need to calibrate them. If my LLM-as-judge consistently rates something 4/5 that I'd rate 2/5, the rubric needs updating.

Human review is the calibration mechanism, not the primary grading system. This distinction matters because human review doesn't scale. At a few hundred tasks a day, reviewing 10% means reading 20-30 outputs. That's manageable. At a few thousand, it isn't. The automated layers need to be good enough that human review is a spot check, not a requirement.

I track inter-rater reliability between my scores and the LLM judge's scores. Currently sitting at Cohen's kappa of 0.78, which is "substantial agreement." When I started, it was 0.54. The difference was all rubric refinement.

#What I measure (and why)

I've gone back and forth on metrics and landed on five that actually drive product decisions. The test for whether a metric is worth tracking: has it ever caused me to change something? If not, it's noise.

1. Task completion rate, multidimensional. A single pass/fail isn't enough. Each task gets scored on correctness (code-graded), relevance (LLM-graded), and for Claudia, tone (LLM-graded). A task can score high on correctness but low on tone. That's a different kind of problem than a task that fails outright, and it needs a different fix.

Current numbers for Claudia: 94% hard completion rate (task didn't error), but only 82% "fully successful" when you include quality scores above 3.5 on all three dimensions. That 12-point gap is where most of the product work happens.

2. Token efficiency ratio. tokens_consumed / task_quality_score. This catches a specific failure mode: the agent that loops, retries, or generates massive intermediate reasoning to produce a mediocre result.

Claudia's typical customer reply costs 800-1,200 tokens. When I see a reply that consumed 4,000+ tokens, something went wrong. Either she retried multiple times, or she generated a massive chain-of-thought for a simple question. I bucket these by task type and alert on anything above the 95th percentile.

After adding prompt caching for Claudia's system prompt and conversation history, her average cost per reply dropped from $0.03 to $0.004. That's an 87% reduction. Prompt caching is the single biggest cost lever for agents with stable system prompts.

3. Failure taxonomy. I classify failures into three buckets, inspired by Microsoft's research on multi-agent failure modes:

Breakdown of agent failures: 23% hard failures (crashes, timeouts), 41% soft failures (poor quality output), 36% silent failures (agent does nothing). 77% of failures look like successes to traditional monitoring.

  • Hard failures (23% of all failures): crashes, API errors, timeouts. Standard monitoring catches these.
  • Soft failures (41%): task completes but quality is poor. Claudia answers the wrong question. Clawrence flags an irrelevant tender. The output is valid, just wrong.
  • Silent failures (36%): the agent does nothing when it should have done something. A customer message goes unanswered because Claudia didn't recognise it as a question. A tender expires without being flagged because Clawrence's scraper missed the listing.

Silent failures are the worst because they're invisible. You only catch them by defining expected behaviour upfront and monitoring for its absence. I run a daily check: how many incoming messages did Claudia receive vs how many did she respond to? If the ratio drops below 95%, something's being missed.

4. Drift detection. Weekly canary evaluations. Same 50 test inputs, same rubrics. I plot the scores over time. Any week-over-week change of more than 0.3 points on any dimension triggers an investigation.

Claudia's empathy scores over 12 weeks. Stable around 4.2, then a model update (no code changes) drops it to 3.4. Prompt fix deployed, scores recover to 4.0.

This has caught three regressions so far. Two were model updates (scores recovered after I adjusted prompts). One was a context window issue: Claudia's conversation history was growing so large that the system prompt was getting pushed out of the model's attention window. Average quality score had drifted from 4.1 to 3.6 over two weeks. I added conversation summarisation and it bounced back to 4.0.

5. Cost per task type. The $300 lesson. Every task type has an expected cost range. Clawrence's daily tender summary costs about $0.50. His competitor analysis costs about $2.00. I don't need a complex anomaly detection system. I just need the baseline and an alert when something exceeds 3x the rolling average.

#The product decisions this enables

Raw metrics are pointless unless they change what you build. Here's how I translate eval data into product decisions.

Metric: Claudia's empathy score drops below 3.5 on frustrated customer messages. Product decision: I added an emotion-detection pre-step. Before Claudia drafts her response, she classifies the customer's emotional state. If it's "frustrated" or "angry," she gets an augmented system prompt that prioritises acknowledgement before problem-solving. Empathy scores on frustrated messages went from 3.2 to 4.1.

Metric: Clawrence's tender relevance score is bimodal (lots of 5s and lots of 1s, few in between). Product decision: The all-or-nothing distribution told me the matching was keyword-based at heart. A tender either hit the keywords perfectly or completely missed. I replaced keyword matching with a two-stage approach: broad semantic search first, then LLM-graded relevance. The distribution shifted to a normal curve centred around 3.8, and the 1-rated matches dropped by 70%.

Metric: Cost per task spiking on competitor analyses. Product decision: Clawrence was re-scraping websites he'd already scraped because he had no memory between runs. I added a simple cache with a 24-hour TTL. Cost per analysis dropped from $2.00 to $0.60.

These aren't hypothetical. Every one of these came directly from looking at eval data and asking "what's the simplest change that would move this number?"

#What I'd do differently if I started over

If I were building a new agent today, I'd do eval-driven development from day one. Define the rubrics before writing the first prompt. Build the grading pipeline before the agent pipeline. It sounds backwards, but it's exactly the same logic as TDD: you need to know what "correct" looks like before you can build toward it.

The specific steps:

  1. Define task types and expected outputs. What are the 5-10 things your agent will do? What does a "perfect" output look like for each?
  2. Write grading rubrics. For each task type, what dimensions matter? What's a 5 vs a 3 vs a 1?
  3. Create a golden set. 20-50 input-output pairs that you've manually graded. These become your calibration anchors.
  4. Build the grading pipeline. Automated scoring on every output, before you've even built the agent.
  5. Then build the agent. Now you have a feedback loop from day one.

The harder question is what happens when agents get more autonomous. Right now Claudia and Clawrence do relatively bounded tasks. But the trend is toward agents that plan, use tools, and make multi-step decisions. Evaluating a single response is manageable. Evaluating a ten-step plan where each step depends on the last is a different problem entirely.

I think trajectory-level evaluation is the right direction. Instead of grading each step independently, you grade the whole sequence: did the agent achieve the goal? Was the path efficient? Did it recover from errors or just cascade them? I've started prototyping this for Clawrence's multi-step research tasks, grading the final output and the tool-call sequence separately. Early results suggest that agents can produce correct final outputs through wildly inefficient paths, and you'd never know without looking at the trajectory.

If you're running agents in production and have strong opinions about how to evaluate them, I'd genuinely like to hear about it. This is the part of AI product development that nobody's figured out yet, and I think comparing notes is the fastest way to make progress.