Failure museum
Every guardrail I run in production came out of something breaking. These are the breaks.
A failure taxonomy is only honest if you can name the failures. This page is the named ones: what happened, what it cost, and the guardrail each one left behind. The pattern underneath them is the loop from the guardrails post: ship, watch it fail, name the failure, write the guardrail, add the eval that proves the guardrail works.
The $300 cron
Runaway cost
An OpenClaw agentβs competitor-analysis cron ran hourly instead of daily. Roughly 50K tokens a run, 1.2M tokens a day, $300 on the bill before anyone noticed. There was no baseline for what the agentβs spend should look like, so there was nothing to alarm against.
The guardrail it spawned: WIMAUT, then per-agent budget caps that live in the runtime, not the prompt. Cost baselines per task type with an alert at 3x the rolling average.
WIMAUT: what is my agent up to?
The $180 cake quote
Stale recitation
The agent quoted a customer $180 for a cake the site priced at $250. Right cake, right size, right delivery zone, wrong price: a stale entry in working memory from a week earlier. The owner found out two days later when the customer asked why.
The guardrail it spawned: Tool-first, prose-second. Prices moved out of agent memory into a live database lookup. The agent routes; it doesnβt know.
How I keep production agents on the rails
The empathy cliff
Drift
A model update dropped Claudiaβs average empathy score from 4.2 to 3.4 overnight. No code changed. No prompts changed. The new weights just responded differently to the same instructions.
The guardrail it spawned: Weekly canary evals: the same 50 inputs, the same rubrics, every Monday. A prompt fix recovered the score to 4.0; the canaries are why the gap lasted days, not months.
How I evaluate AI agents
The reply that never sent
Silent failure
Sonnet generated a perfectly good 389-character reply, then didnβt call the send_reply tool. The system prompt said, in bold, to always call it. The model didnβt. The customer got silence.
The guardrail it spawned: A deterministic safety net underneath the model. Structural requirements live in code that checks the model did the thing, never in prose that asks it to.
The 20% that is the business
Context-window rot
Drift
Conversation history grew until the system prompt fell out of the modelβs effective attention. Quality drifted from 4.1 to 3.6 over two weeks, slowly enough that no single conversation looked broken.
The guardrail it spawned: Conversation summarisation plus drift detection on weekly trends. Any 0.3-point move on any dimension triggers an investigation.
How I evaluate AI agents
The re-scraping bill
Runaway cost
Clawrence re-scraped websites heβd already scraped because nothing remembered between runs. $2.00 per competitor analysis, most of it spent re-reading pages that hadnβt changed.
The guardrail it spawned: A cache with a 24-hour TTL. Cost per analysis dropped to $0.60. Not every fix is clever.
How I evaluate AI agents
Studio
Product-sized
Our no-code builder died of the cold-start problem: SME owners canβt describe an app into existence from a blank canvas. The biggest exhibit in the museum is a whole product.
The guardrail it spawned: The lesson became Vobase (we build it for them), and then Volty (they shape something that already works). Pointing at running software and saying βthis should work differentlyβ is a different cognitive act from describing what should exist.
Why Studio didnβt work
The museum grows. That isnβt an admission, itβs the operating model: an agent system that hasnβt failed yet is an agent system that hasnβt been watched closely enough.