← Blog

Agents that file their own bug reports

Contents
  1. How it works
  2. The interesting part is the constraint
  3. This is the evaluation layer, not the observability one
  4. The floor underneath: a self-healing harness
  5. Does it actually make things more reliable

This morning I woke up to a Telegram message from one of my own agents. Three bullet points:

A WhatsApp channel had gone into a 503 restart loop for four hours last week, and inbound customer messages during that window were probably lost. Four data-orchestration scripts I'd refactored in the last day had zero test coverage. And the CLI was resetting sessions five times in one day with reason=missing-transcript, which smelled like a storage bug on cold start.

Nobody wrote that report. A cron did, at 8am, after reading the previous 24 hours of logs from my seven-agent OpenClaw setup.

I call it the self-improve digest. It's become the single most useful thing I run, and not because it fixes anything. It's useful because it tells me, every morning, what I should fix next.

#How it works

It fires at 08:00 SGT via a LaunchAgent. It reads the last 24 hours of activity off disk: the gateway log, the gateway error log, the ambassador gate log, the health-probe trail, the watchdog log, every agent conversation file, and a git log --since=24.hours.ago --stat. All of it gets truncated tail-first, because the signal in a log is almost always at the most recent end.

Then it hands the whole blob to claude-cli (Claude Max, never the API) with one job: find concrete things to fix, through four lenses.

  1. Customer-facing failures. Silent message drops, ghosted leads, an agent that received an inbound but never sent an outbound, channel disconnects, failed sends.
  2. Test coverage gaps. Commits that touched live code without adding a regression test. It literally diffs files-changed against test files added.
  3. Repeated manual interventions. Any recovery action I had to do twice. Two of the same thing is an automation candidate.
  4. Agent health drift. Probe latency creeping up, error rates growing, session resets getting more frequent, the watchdog kickstarting things more often.

The output format is fixed. Each item is a one-line title, a Why: in one or two sentences, and a Next: with a single concrete action: a file to change, a test to add, a guardrail to write. That's it. If a lens has nothing to say, it omits the lens. If the whole 24 hours was clean, it outputs exactly one line: "Nothing to flag β€” clean 24h." I told it never to pad and never to summarise what's working, because I only want the deltas.

I run it on Haiku, not Opus. It's summarising a day's logs into a bullet list off a 61KB context. Haiku is plenty, it's faster, and it's less likely to get SIGKILLed mid-run.

#The interesting part is the constraint

The prompt tells the model, twice, that it must not fix anything. Do not edit files. Do not run commands. Do not call any tools. If you feel tempted to just fix something, write the fix as a Next: action and Yash will run it through the normal review flow.

That instruction isn't theoretical. On the 10th of May, an earlier analysis cron with tool access available read its prompt, decided the right move was to fix what it found, and autonomously edited three of my production scripts before I saw any of it. The lesson wasn't "write a sterner prompt." It was that a prompt is a wish, not a guarantee.

So now the digest runs with --tools "". Every tool is disabled at the harness level. The model can only emit text. It physically cannot touch a file even if the prompt, the context, and its own reasoning all conspire to make it want to. This is the same idea I wrote about in an agent is staff, not magic: the system prompt is a job description, but the section that actually matters is what you wrap around the model so it can't do the thing you told it not to do. A review agent that can write to disk isn't a review agent.

The rest is unglamorous production plumbing. claude-cli shares a credentials file across agents, so two of them waking at once can race and one comes back empty. The digest retries with backoff (90s, 240s, 480s, 600s) and DMs me if it's still failing after five attempts. It's boring on success and loud on failure, which is the rule I settled on for every cron I run.

#This is the evaluation layer, not the observability one

A while back I built WIMAUT, a dashboard to see what my agents were up to, after one of them quietly burned $300 on an hourly cron I thought was daily. The thing I learned building it was that observability and evaluation are different problems. Observability tells you what happened. Evaluation tells you whether it was any good. The second one is harder.

The digest is the evaluation layer sitting on top of the raw logs. WIMAUT can show me that a channel restarted forty times. The digest is the thing that reads those forty restarts, recognises them as a single four-hour customer-facing failure, and tells me the fix is exponential backoff in the health-monitor. One is a feed. The other is a verdict.

It's also the half of keeping agents on the rails that I couldn't do by hand. The health-drift lens in particular catches the failure mode I worried about most in that post: drift doesn't show up in any single event, only in a slow trend, and a human staring at a log will never see it. A model reading 24 hours at once does.

#The floor underneath: a self-healing harness

The digest describes failures after the fact. The thing that catches them while they're happening is a separate piece, and I open-sourced it last week: openclaw-harness, MIT.

The problem it solves is specific to macOS automation. launchd will happily run sixty scheduled jobs for you and tell you nothing when one of them dies. On my fleet that silence once hid a broken backup for 23 hours and a dead blog publisher for three days. I only found out because I happened to look.

The harness is a three-step loop. Every 15 minutes it scans launchctl list, finds any watched job sitting at a non-zero last-exit status, and decides it's failed. It sends exactly one deduplicated Telegram message per failure, tracked in a state file so a job that keeps retrying doesn't spam me and a job that recovers gets quietly cleared. And for a small allowlist of jobs, it self-heals: one launchctl kickstart, a ten-minute grace period, then a verification pass. If the job came back, the episode is over. If it didn't, it escalates to me and won't try again for 24 hours, with a hard cap of two attempts before it gives up and waits for a human.

The allowlist is the whole safety story. Only idempotent, self-gating jobs are ever auto-restarted: the blog publishers that skip when there's nothing new, the backup that no-ops on no diff. The gateway and anything that can message a customer or book a real-world resource are never touched automatically. That's enforced in code, not convention: decide() is a pure function that returns what should happen, and a thin wrapper is the only thing allowed to actually exec launchctl, write state, or send a message. The whole thing is 24 pytest tests, and CI runs a secret scan on every push.

The part I'm quietly proud of is how the public repo stays current. It's exported automatically from my live deployment by a script that strips every deployment-specific value to an empty generic, refuses to publish if any private identifier survives a forbidden-string scan, and only pushes if the exported code passes its own test suite. My prefixes, allowlists, and tokens never leave my machine. The open repo gets the logic and none of the config.

#Does it actually make things more reliable

Honestly, the digest on its own fixes nothing. It's a cron that writes a to-do list.

What it changes is that the to-do list is always full and always prioritised, and it doesn't depend on me remembering. Every Next: line resolves into one of three things: a test, a guardrail, or an automation. Over the first six months of treating this stack as load-bearing my test count went from 72 to over 1100, and a meaningful slice of that growth started life as a "test coverage gap" bullet in a morning digest. The four scripts it flagged today will have tests by tonight for exactly that reason.

The compounding is the point. Each fix closes a hole, the test proves it stays closed, and the next morning the digest is looking at a slightly more boring system. The interesting bullets get rarer over time. That's the trend I want.

I won't oversell it. There are obvious gaps. The digest has no memory of yesterday's digest, so it can re-flag the same thing for days if I haven't gotten to it. It doesn't track whether a Next: I was given actually got done. Feeding yesterday's output back in as context, and tracking a closure rate, are the two changes that would make it genuinely sharper, and I haven't built either yet.

The bigger one is closing the loop. Right now the digest proposes and I dispose. The version I actually want is the one I described back when I first built these agents: the agent sees an error in production, opens a PR with the fix, and tags me to review. The harness roadmap has a Tier 2 for exactly this, an LLM root-cause step that proposes a patch behind approve and skip buttons. The digest writes the diagnosis; a human still clicks the button.

That's the loop: the agent watches its own production, proposes the fix, I approve it, and the test it added proves it stays fixed. I'm about half way there. The morning bug report is the first half, and it's already changed how the whole fleet behaves.