← Blog

OpenClaw, six months later: what production actually demanded

Contents
  1. What changed
  2. Deterministic flows beat LLM-as-judge
  3. Atomic state writes, every time
  4. Single source of truth, ruthlessly
  5. DM-on-failure for every cron
  6. Tests mandatory, no exceptions
  7. Bot identity boundaries that never bend
  8. Secret refs, not plaintext
  9. Scope discipline with AI collaborators
  10. What doesn't work
  11. What's still rough

I did a full stability audit of OpenClaw today. The results were a mix of stuff I knew was bad, stuff I'd convinced myself was fine, and a few false alarms that wasted half an hour. The audit forced me to articulate something I'd been doing on instinct: what changes when an AI agent stack goes from personal toy to actual production.

When I first wrote about OpenClaw in December, I had two agents and a vague sense that this could be useful. Six months later there are seven, with about thirty cron jobs hanging off them, and the stack is now load-bearing.

Claudia handles all our customer-facing WhatsApp groups. She runs the Xero invoice approval flow. She sends the daily morning briefs and the AR brief on Fridays. If she misroutes a customer, it lands in front of a real person.

Clawrence runs the internal stuff. Laser-chase at 11am on weekdays (snarky pings to anyone who hasn't filed their daily update). The volts leaderboard at noon (our internal kudos system). Talenox team-pulse at 9am (birthdays, anniversaries, who's-out-today, pending-leave nags, work-pass expiry alerts). Claims used to flow through him too, before we moved that back to Aspire direct.

Then gym-booker quietly does its thing every morning at 8:57, booking my CrossFit slots two days out, with fallbacks if the 6:30am session is full. Happy is my personal PA, mail and calendar and notes, plus the morning brief and the daily health probe. Lawrence is the gym-only agent for me, Dani, and Mer.

That's seven agents, all running on a single Mac Mini in our office. None of them are "trial deployments". They all do work that someone would otherwise have to remember to do.

#What changed

The thing that changed was scope of consequence. When OpenClaw was a personal toy, a silent crash meant I'd notice in a few days, restart it, and move on. Now a silent crash means a customer doesn't get the invoice that was meant to go out, or Leonard doesn't see a pending leave application, or a birthday gets missed.

The audit today found four kinds of issue:

  1. Things I'd half-fixed and forgot
  2. Things I never noticed because they failed silently
  3. Things that looked fine but had subtle race conditions or atomicity holes
  4. Things the audit agent was wrong about (took me an hour to verify those false positives)

The first three are the interesting ones. The fourth is a story for another post.

#Deterministic flows beat LLM-as-judge

The single biggest pattern shift was moving deterministic work out of LLM hands.

When I first set up the AR brief cron, it was an LLM prompt: "look at the unpaid invoices, write a friendly nudge for each customer, send to thread 5025." It worked, technically. But the LLM would sometimes hallucinate a different invoice amount, or skip a customer it thought was friendly, or use slightly different wording each week.

The replacement is boring and correct: a Python script pulls invoices from Xero, formats them with _fmt_money, posts a deterministic template via tg_send_doc.sh. No LLM in the loop. The script does what it says, every Friday at 9:30am. If the auth expires, it DMs me. If Xero returns a 500, it DMs me. If everything works, it posts the brief and exits.

The model is good at fuzzy stuff (intake, classification, normalisation). It's bad at being a load-bearing component of a deterministic pipeline. I now treat "do you actually need an LLM here?" as the first question for any new cron. Most of the time, no.

This is the same pattern I wrote about in how I evaluate AI agents. The benchmark for production agent work isn't "does the LLM do the thing correctly most of the time". It's "is the failure mode visible, and is the deterministic backbone solid enough that a single LLM hiccup doesn't break the user's day".

#Atomic state writes, every time

This one took an audit to surface. The Xero invoice approval queue at ~/.openclaw/scripts/xero-pending-approvals.json was doing this:

def _write_pending(data: dict) -> None:
    PENDING_APPROVALS_PATH.write_text(json.dumps(data, indent=2))

A naΓ―ve overwrite. If two CLI invocations land at roughly the same time (Claudia creating an invoice in thread X while I'm dispatching an approval reply in thread Y), they both read the file, both modify their copy, both write back. Last-write-wins. One pending invoice silently disappears.

The fix is the pattern I now use for every state file:

def _write_pending(data: dict) -> None:
    tmp = PENDING_APPROVALS_PATH.with_suffix(".json.tmp")
    with open(tmp, "w") as f:
        f.write(json.dumps(data, indent=2))
        f.flush()
        os.fsync(f.fileno())
    os.replace(tmp, PENDING_APPROVALS_PATH)
    os.chmod(PENDING_APPROVALS_PATH, 0o600)

Plus an fcntl.flock around the read-modify-write so two processes can't race. Plus tests that pin the atomic pattern.

That's 25 lines instead of 1. The 1-line version had been fine for six months. The 25-line version is fine forever.

#Single source of truth, ruthlessly

The audit found that the Voltade team roster was hardcoded in at least four places:

  • TG_HANDLES dict inside talenox-team-pulse.py
  • "Voltade team members" list in Claudia's system.md
  • ~/voltade-team.md reference doc
  • dri-routing.md for customer assignments

Every time someone joined or left, that was four places to update. We onboarded two people today (welcome Raphael and Jing Huan) and I realised mid-onboarding that I'd already drifted between those files.

The fix is voltade-threads.json, a single canonical JSON with the team chat ID, every topic thread, every team member with their Telegram handle and short name. A helper module exposes lookups: vt.handle_by_talenox_name("Raphael Xujie Yip") returns "@yippeh". vt.is_team_member("@leoksloo") returns true. The Talenox cron reads from it. Future agents will read from it.

The principle: if a piece of data is referenced by two crons, it lives in a shared file. Two crons each carrying their own copy is technical debt that compounds.

#DM-on-failure for every cron

This is the unglamorous one but it's the one that catches everything else.

If a cron fails silently, you don't find out until a customer complains, or a teammate asks "didn't Claudia used to post X?". By then it's been broken for days.

The rule I now apply: any new cron has to DM me on failure via the Lawrence bot. Not log to a file (I won't read it). Not send to a thread (gets ignored). A direct message to my Telegram, the same channel I see WhatsApp on.

This is the WIMAUT discipline applied at the cron-script level. Most of my crons are now boring on success and loud on failure. The loud part is the load-bearing one.

#Tests mandatory, no exceptions

The OpenClaw test scaffold lives at ~/openclaw/tests/unit/py/. When I started, it had about 72 tests across the agent harness. Today it's 1124.

Every new behaviour goes in with tests in the same commit. Not "I'll add tests later" (I won't). The auto-sync cron actually refuses to push if tests fail, which has saved me twice from shipping broken code that the auto-sync would have otherwise quietly committed.

The tests aren't just guards against regression. They're how I think through edge cases before deploying. The cake-reminder cron I built today (a DM to me, Varick, Leonard, and Carl three days before any team birthday) has 11 tests covering the trigger, the date arithmetic, the multi-recipient dispatch, the DM-failure fallback path, the HTML escaping. I'd never write all those branches correctly on the first try. The tests force me to.

#Bot identity boundaries that never bend

This came out of an actual incident: Claudia once posted a confirmation message that should have come from the claims bot. Wrong identity, customer-visible. Not "wrong tone", "wrong bot".

The hard rule now: every bot uses exactly one Telegram token, no exceptions, no "I'll just borrow this one for testing". The token lives at a single canonical path, referenced via secrets.providers.local (not plaintext in config), and read by the bot's scripts via a shared helper.

If a task requires posting as a bot that isn't yours, you stop and tell the user "I can't post as that bot". You don't try to find a workaround. The audit caught a hardcoded token in pakgu-pei-call-reminder.py, lines 22 and 23, which is the kind of thing that ends with the wrong bot announcing the wrong thing at the wrong time.

The reason this matters is that confirmations posted under the wrong identity are confusing to the team ("why is the claims bot announcing an invoice?") and, more importantly, they're a real authentication-boundary violation. If you cared about anything from a security perspective, you'd care about this.

#Secret refs, not plaintext

Until today, all seven Telegram bot tokens were stored plaintext in ~/.openclaw/openclaw.json. They'd also leaked into five rolling backup files, one of which was world-readable on this Mac. Local exposure only, but still: not okay.

The fix is the secrets.providers.local pattern. Tokens live in ~/.openclaw/credentials/secrets.json (0600). The main config holds references:

"botToken": {
  "source": "file",
  "provider": "local",
  "id": "/telegram_botToken_clawrence"
}

Migrating took a for loop calling openclaw config set seven times, plus a daemon restart, plus a /getMe sweep to verify all seven bots came back. Took 15 minutes. Should have done it a month ago.

The principle: no production secret should be one cat away from being in a screenshot.

#Scope discipline with AI collaborators

This is meta, but it's the thing that lets me move fast without things breaking.

Most of the work in this audit was actually done by me directing Claude Code, not me writing code directly. The audit itself was four parallel research agents reporting back into a synthesis. The fixes were sequenced commits, each with its own task ID, each tested before the next started.

What makes that work is being explicit about scope. Before any change that touches more than two files, I restate back what I'm about to do: goal, branch, files, what "done" looks like. The agent waits for confirmation. This sounds slow. It's actually faster than letting an agent freelance and then untangling what it did.

This is what teaching Claude how to write like me is to writing: a discipline that makes the agent useful rather than dangerous. For code, the rules are simpler. Verify before claiming. Tests are mandatory. Don't pivot. Don't add features I didn't ask for. DM me on failure.

The CLAUDE.md file in this repo runs to about 200 lines of these rules. It's the most leveraged file in the stack.

#What doesn't work

A few patterns I've abandoned:

LLM-as-judge for anything deterministic. Yes you can prompt an LLM to "look at the leave records and tell me who's out today". Or you can parse the JSON. The parse always wins.

Hardcoded everything. Hardcoded chat IDs, thread IDs, user IDs, paths, tokens. Each one is a future bug. The audit found drift in places I didn't know had been touched.

Silent failure recovery. "Just catch the exception and continue" feels safe. It's not. It hides bugs until they compound. A bare except: pass is the developer telling future-developer "you'll figure it out". Future-developer will not figure it out at 11pm on a Friday.

Memory files as a substitute for tests. I tried writing "remember to do X" as a memory entry instead of writing a test for X. The memory entry is a vibe. The test is enforcement.

#What's still rough

We're still early on a lot of things. The ambassador qualifier has been live for two weeks and is still learning what counts as a real lead. The Xero auto-rotate isn't there yet (the refresh token rolls every 60 days and I get to it manually). The WhatsApp socket goes stale every 35 minutes and the health-monitor auto-recovers; I haven't traced why.

Token rotation post-audit is also still a manual job. After today's audit, seven Telegram bot tokens technically need rotating via BotFather since they sat in a world-readable backup for some unknown duration. The exposure was local-only but the textbook move is to rotate. That's a tomorrow problem.

What this audit did, more than fix anything specific, was force me to make the implicit rules explicit. The rules existed. They were just in my head, applied inconsistently. Writing them down as a project_openclaw_production_status.md memory file means the next time Claude Code touches this stack, it sees them first. The agent collaboration gets sharper because the agent now knows what production-grade looks like for this specific stack.

If you're running an OpenClaw-shaped stack and you're past the demo phase, the cheapest stability investment isn't a new monitoring tool. It's an honest audit, a memory file describing what production-grade means for you, and a few atomic-write helpers in your state-touching scripts. The rest follows.