← Blog

Seven agents on a Mac Mini: four months of breaking my OpenClaw harness

May 15, 2026 · 12 min read

Contents

Scope discipline: one agent, one job
Deterministic flows: tool first, prose second
What broke until I learnt this
Why scope discipline makes deterministic possible
What is still hard
Where this goes next

There is a Mac Mini in the corner of my flat in Singapore. It is plugged into the wall, on the home Wi-Fi, with a Tailscale tunnel for remote access. On it I run OpenClaw, and inside OpenClaw I run seven personal AI agents.

The roster, as of today:

Agent	Role	Scope
Sir Lawrence 🫖	Gym booker	Class bookings at my gym for me and two friends
Happy 🎯	Personal PA	Email, calendar, Apple Notes, morning brief, trading bot
Vibby	Vibe Makers ops	Outreach, marketing copy, lead handling for the academy
Pakgu	Dialogic Academy ops	Debate / public-speaking enquiries, partner schools
Clawrence ⚡	Voltade internal ops	Strategy, builds, analyst work
Claudia 🦞	Voltade customer-facing	Monitors customer groups, helps the team
Ambassador	Voltade outbound	Currently unfilled, scaffold only

I started this around January. Lawrence (workspace bootstrapped 2026-03-13) was the first. By mid-March I had a multi-tool agent that could book my gym classes, summarise my inbox, read my calendar, send me a morning brief, and ping me when a node went down. It was magical for about two weeks. After that I spent four months breaking and rebuilding it.

I think I have finally cracked it. The two insights that fixed everything are intertwined: scope discipline (one agent, one job) and deterministic flows (tool first, prose second). Both took months to absorb, and the second only works because of the first.

This is a long post about what each agent does, what broke, and the patterns that finally stuck. If you are building a personal AI harness, my guess is you will repeat my mistakes, but at least you can do them faster.

#Scope discipline: one agent, one job

For the first six weeks, Lawrence did everything. The morning brief, the gym booking, email triage, calendar lookups, the trading bot status, the agent-health probe, and the node-disconnection alerts. The system prompt was 2,400 words. The tool list had eleven entries. It worked, until it didn't.

The way it failed was subtle. Individual tasks worked. The agent could book me into a 7:30am FIT class on Monday cleanly. It could summarise my inbox cleanly. But when I asked it to do anything that touched two domains, the responses got worse. Not catastrophic, just worse. It would give me a morning brief and lose track of which Coinbase position I was asking about. It would confidently recite the gym's class lineup from memory, sometimes right, sometimes wrong.

I had a theory about why, but I kept patching it. More system-prompt instructions, more "always check this, never assume that," more decision trees. By April the system prompt was 4,100 words and the agent felt slower and less reliable, not more.

The break came on 13 May. Someone asked Lawrence what classes were on at the gym on Saturday. The agent listed the lineup from memory. The list happened to match reality. I would not have noticed except I knew the schedule had quietly shifted a few weeks earlier and Lawrence had not seen the update. I went to look at the system prompt and realised I had two contradictions and three out-of-date defaults, all of which had been valid two months ago.

The honest read was that the agent had drifted because the surface area was too large. A 4,000-word system prompt is not where you check whether something is fresh. There were too many domains for me to keep current, and the agent had no way to know which parts of its own memory were stale.

The fix landed on 2026-05-10, three days before the lineup incident. I split Lawrence. New agent called Happy (named after Happy Hogan, casual-but-capable PA energy) took over inbox, calendar, notes, trading-bot status, agent-health probe, and the morning brief. Lawrence got narrowed to one thing: gym bookings for three users. Roughly:

## Scope (as of 2026-05-10): GYM ONLY

You are the gym booker agent. Your one job is class bookings at
my gym for Yash and two friends.

You do NOT handle:
- Email, that's Happy now
- Calendar, Happy
- Apple Notes, Happy
- Trading bot status, Happy
- Daily morning brief, Happy delivers it now
- Voltade work, that's Main / Clawrence / Ambassador

If a user asks for any of the above, decline politely and point
them to the right agent. Don't apologize, don't explain the
architecture, don't re-explain every time. Just:
"Not me, Happy handles that for Yash" or "Not my lane."

Lawrence's system prompt dropped from 4,100 words to about 1,800. Happy's prompt is about the same size, but covers a different domain.

The result was immediate. Both agents got more reliable inside their scope. Cross-domain handoffs went from "agent guesses" to "agent says 'not my lane' and points." More importantly, the human cost of running the fleet went down. When I update Lawrence's gym defaults, I do not have to remember whether the change also lands somewhere in the email scope. There is no email scope.

The lesson generalises. The agents that have stayed sharp (Vibby for the Vibe Makers academy, Pakgu for Dialogic Academy, Claudia for Voltade customer groups) are the ones with one domain each. The agents that gave me trouble were the ones I asked to span domains. Multi-purpose agents look efficient on the day you build them. They quietly become a maintenance liability the moment any domain shifts.

There is a real cost. More agents means more handoffs, more bot tokens to manage, more cron files. The fleet now has six telegram bot tokens, four schedulers, and a shared routing file at ~/.openclaw/workspace-shared/telegram-routing.md that says who handles what. The architecture is a graph, not a tree. But each node is small enough that I can read its memory in one sitting and know what it does.

#Deterministic flows: tool first, prose second

Scope discipline fixes the question of what each agent owns. The second insight is about how the agent makes decisions within its scope. The pattern that survived four months of debugging is: the LLM should never make a decision that a tool can make for it.

Concretely: when a user asks Lawrence "what classes are available on Saturday," there are two ways to handle this.

The bad way is for the agent to look at its own memory, find a list of classes it has cached, and recite them. The agent does this fluently. The user gets an answer in 200ms. The answer is wrong about 12% of the time because the gym shifts instructors and times more often than I update the prompt.

The good way is for the agent to call gym-actions.py classes --date 2026-05-17, get a fresh response from the Mindbody API, and quote it. Slower by a second or two. Always right. The agent never sees the underlying class list at training time, so it cannot drift.

Lawrence's memory file now has this rule in capital letters, near the top:

## ⛔ NEVER fabricate the gym schedule

When a user asks "what's available on Sat", "what classes run Mon
morning", "what time is FIT on Friday", ALWAYS call
`gym-actions.py classes --date <YYYY-MM-DD>` first. Do not list
instructors, times, or class types from memory, ever. The gym
changes their lineup. You guessing right once is luck; guessing
wrong once damages trust. Past incident (Yash, 2026-05-13):
listed Sat lineup from memory, got pulled up on it, even though
the list happened to match. Rule is: tool first, prose second.

That rule generalises to the whole harness. The shape is:

Define a decision tree at the top of the agent's memory. "Classify the user's ask into one of: permanent-weekly / temporary-single-date / reschedule-booked / cancel / pause / list-classes."
Map each branch to exactly one tool call. The tool name lives in TOOLS.md.
The tool, not the agent, owns validation, state, and side effects.

When I built Lawrence originally, the decision-tree branches were paragraphs of prose. "If the user wants a permanent change, edit the defaults in gym-override.py. If they want a one-off, do this thing. If the booking already exists, you'll need to..." The agent would re-derive the routing on every turn. Sometimes it routed correctly. Sometimes it pasted the user's natural-language input into a parameter that did not exist.

The fix was to push routing into the tool layer. gym-onboard.py is a state machine. It knows that a user is in awaiting_creds state, or awaiting_schedule, or complete. The agent does not get to choose which step is next. It calls gym-onboard.py status --caller <chat_id> and the tool tells it.

# What the tool returns, lightly summarised
{
  "state": "awaiting_creds",
  "next_action": "set-creds",
  "user_id": "166637821",
  "display_name": "..."
}

The agent parses this and routes accordingly. There is no LLM call where the model is asked to decide "is this user past the credentials step?" The tool answers that, deterministically, from the database.

This is the boring half of the insight. The interesting half is that every part of the agent's decision path benefits from this treatment. Memory state, scheduling state, alert state, conversation state. Anywhere I had the LLM derive an answer from prose, I have now pushed it into a tool with a typed response.

#What broke until I learnt this

Three failure modes that surfaced repeatedly. All variants of "LLM makes a decision a tool should make."

Parallelised tool calls breaking state machines. On 10 May, a new user messaged Lawrence to onboard. Lawrence ran three onboarding previews in parallel: start, set-creds, set-schedule. The state machine rejected the second and third because set-creds requires a user record (which start --yes had not committed yet) and set-schedule requires awaiting_schedule state (which set-creds --yes had not committed yet). Lawrence saw the error, told her "let me try again," and looped. The onboarding stalled for ten minutes until I noticed.

The fix went into the memory file the same evening: "NEVER parallelize the onboarding tool calls. Each step's --yes commit must land before previewing the next." The state machine enforces it; the rule in memory just tells the agent to stop trying.

Split-turn commits losing data. Same week, a different new user sent her gym credentials. Lawrence previewed them ("I'm going to save these as your Mindbody login, ok?"). She replied "yes." Lawrence's session memory was fresh on her "yes" message and had nothing to commit. It asked her for the credentials again. She sent them again. Same loop. She sent them three times before I realised the session was not carrying preview state across turns.

The fix was to remove the human-in-the-loop "preview then commit" pattern entirely for onboarding. The new rule: "As soon as the user provides the data, parse it and call set-creds --yes directly with those values. One turn, one tool call." This is less safe in theory (no human confirmation) but durably correct, because the alternative (split confirmations) was always broken. Onboarding is also bounded: if the agent commits the wrong password, the next Mindbody booking fails fast and the user retries.

Recitation drift. The 13 May gym-schedule incident. Already covered.

The pattern across all three: the LLM was being asked to hold state across a flow when a tool could have done it. Once I moved state into tools, the failures stopped.

#Why scope discipline makes deterministic possible

Here is the part that took me longest to see. The two insights are not independent. Deterministic flows only work when each agent's scope is small enough to fit a clean decision tree.

When Lawrence had eleven tools across seven domains, the decision tree at the top of memory was unwieldy. The agent could not route a user's free-text message to the right branch reliably, because the branches were too many and too varied. Half the tools shared the same first word ("gym") which confused embeddings and confused the model. The agent would call gym-override.py when it should have called gym-actions.py because both look like "gym stuff."

Once Lawrence's scope dropped to bookings only, the decision tree fit in fifteen lines. Six branches, each mapped to one tool. The agent stopped misrouting because there was less to misroute between.

In other words: deterministic flows are a pattern that scales inversely with scope size. The harder you push deterministic flows, the more pressure there is to keep each agent's scope small. The two insights co-evolved in the harness. I do not think you can have one without the other.

#What is still hard

Three things I have not solved.

Inter-agent state. When Happy delivers the morning brief, it includes a gym summary. That summary lives in Lawrence's data. Happy currently calls a script that hits Lawrence's database directly, which is fast but creates a coupling. The right fix is probably an HTTP boundary between agents, but I have not built it because the coupling has not bitten yet. It will.

Onboarding new users at the agent level. Both new users above were onboarded by Lawrence. The state machine made it correct, but the experience was clunky. Future users get a four-step flow over WhatsApp messages, and each step has its own preview-and-approve cycle. I do not love it. The right shape is probably one ten-second voice message where the user describes their schedule and the agent fills it in, with a single confirmation at the end. That requires speech-to-text and a more sophisticated parser. Backlog.

Cost discipline across the fleet. Each agent runs on Claude Opus or Sonnet via my Claude Max subscription, so the marginal cost is low. But when something goes wrong (a webhook misfires, a cron retries), the cost can spike. I wrote about this with WIMAUT, where an OpenClaw cron silently burned $300 because no one was watching. The agent-health probe runs daily now and pages me on anomalies, but I still do not have per-agent cost attribution. If Vibby is suddenly using 10x the tokens, I find out at end-of-month. The pattern from the model-selection post probably belongs at the agent level too: triage / main / planning tiers per agent, with logging.

#Where this goes next

The fleet has stabilised enough that I do not check on it every day. I get a morning brief from Happy at 08:00 SGT (consolidated status across all agents). I get a failure alert if anything goes wrong. Most days, that is the entire interaction.

The next things on my list are voice (most of these flows want to be voice-first, not text) and a real evaluation harness for each agent (the adaptive-software and evaluation thinking from Voltade ported into the personal fleet). I want to know, mechanically, that Lawrence's success rate on gym bookings is 99.7% and not 92.3%. Right now I find out by feel.

If you are building a personal AI harness and want to copy the pattern: start with one agent, one job. Push every decision into a tool. Resist the urge to make any agent multi-purpose, no matter how clever the system prompt gets. The system prompt is the wrong place to put logic that needs to stay current.

And honestly: do not build seven agents. I have one because Lawrence needed friends, but the right number for most people is two or three, well scoped, well instrumented. Mine has crept up because I kept finding small jobs worth automating, and OpenClaw made it cheap to add another. The harness rewards growth and punishes laziness. Mine has been both, and now it is mostly the first.