Failure museum

Every guardrail I run in production came out of something breaking. These are the breaks.

A failure taxonomy is only honest if you can name the failures. This page is the named ones: what happened, what it cost, and the guardrail each one left behind. The pattern underneath them is the loop from the guardrails post: ship, watch it fail, name the failure, write the guardrail, add the eval that proves the guardrail works.

The $300 cron

Runaway cost

An OpenClaw agent’s competitor-analysis cron ran hourly instead of daily. Roughly 50K tokens a run, 1.2M tokens a day, $300 on the bill before anyone noticed. There was no baseline for what the agent’s spend should look like, so there was nothing to alarm against.

The guardrail it spawned: WIMAUT, then per-agent budget caps that live in the runtime, not the prompt. Cost baselines per task type with an alert at 3x the rolling average.

WIMAUT: what is my agent up to?

The $180 cake quote

Stale recitation

The agent quoted a customer $180 for a cake the site priced at $250. Right cake, right size, right delivery zone, wrong price: a stale entry in working memory from a week earlier. The owner found out two days later when the customer asked why.

The guardrail it spawned: Tool-first, prose-second. Prices moved out of agent memory into a live database lookup. The agent routes; it doesn’t know.

How I keep production agents on the rails

The empathy cliff

Drift

A model update dropped Claudia’s average empathy score from 4.2 to 3.4 overnight. No code changed. No prompts changed. The new weights just responded differently to the same instructions.

The guardrail it spawned: Weekly canary evals: the same 50 inputs, the same rubrics, every Monday. A prompt fix recovered the score to 4.0; the canaries are why the gap lasted days, not months.

How I evaluate AI agents

The reply that never sent

Silent failure

Sonnet generated a perfectly good 389-character reply, then didn’t call the send_reply tool. The system prompt said, in bold, to always call it. The model didn’t. The customer got silence.

The guardrail it spawned: A deterministic safety net underneath the model. Structural requirements live in code that checks the model did the thing, never in prose that asks it to.

The 20% that is the business

Context-window rot

Drift

Conversation history grew until the system prompt fell out of the model’s effective attention. Quality drifted from 4.1 to 3.6 over two weeks, slowly enough that no single conversation looked broken.

The guardrail it spawned: Conversation summarisation plus drift detection on weekly trends. Any 0.3-point move on any dimension triggers an investigation.

How I evaluate AI agents

The re-scraping bill

Runaway cost

Clawrence re-scraped websites he’d already scraped because nothing remembered between runs. $2.00 per competitor analysis, most of it spent re-reading pages that hadn’t changed.

The guardrail it spawned: A cache with a 24-hour TTL. Cost per analysis dropped to $0.60. Not every fix is clever.

How I evaluate AI agents

Studio

Product-sized

Our no-code builder died of the cold-start problem: SME owners can’t describe an app into existence from a blank canvas. The biggest exhibit in the museum is a whole product.

The guardrail it spawned: The lesson became Vobase (we build it for them), and then Volty (they shape something that already works). Pointing at running software and saying β€œthis should work differently” is a different cognitive act from describing what should exist.

Why Studio didn’t work

The museum grows. That isn’t an admission, it’s the operating model: an agent system that hasn’t failed yet is an agent system that hasn’t been watched closely enough.