Frameworks
I notice myself reapplying the same five or six patterns across Voltade, Vobase, and my personal agent fleet. Naming them helps me reuse them and helps a team adopt them without me explaining the rationale every time. They are not theory. Each one came out of something breaking in production.
1. Three-layer agent eval
Problem. A single eval number (LLM-as-judge, or "did the agent complete the task") collapses everything that matters into one signal and hides drift.
Pattern. Three layers, applied in order, with explicit cost and confidence trade-offs:
- Deterministic checks, the cheapest layer. Did the agent call the right tool? Did the schema validate? Did the output stay inside the allowed enum? Runs on every wake. Catches structural failures fast.
- LLM-as-judge with calibration, the middle layer. A small set of human-labelled examples calibrate the judge model, then it scores at scale. Used for things deterministic checks cannot read (tone, helpfulness, hallucination).
- Human sample, the slow layer. A weekly sample of N conversations gets human review. The signal that catches what the other two miss, and the source of truth when the judge is wrong.
Failure it fixed. Agents that scored 95% on a single metric were still embarrassing customers on the failures the metric did not see.
Read more. How I evaluate AI agents.
2. Behaviour spec (job description for an agent)
Problem. The system prompt is a 1,200-word wall that does not answer the question an SME owner actually wants answered: what jobs has this agent been hired to do, what jobs has it been told to refuse, and who does it work for when those things conflict?
Pattern. A short, structured behaviour spec separate from the system prompt. Three sections:
- Hired to do. The two or three jobs this agent is responsible for, written like staff job-description bullets.
- Hired to refuse. The classes of request that route elsewhere (refunds over X, legal, anything outside scope). Each gets a routing target.
- Conflict rules. When the customer wants one thing and the owner wants another, what wins by default and what triggers an override.
Failure it fixed. Owners did not trust the agent because they could not articulate what it would and would not do. The behaviour spec made that artefact explicit, reviewable, and editable.
Read more. An agent is staff, not magic.
3. Scope-and-state
Problem. Multi-purpose agents drift. A 4,000-word system prompt with eleven tools across seven domains is unmaintainable. The agent makes the wrong decision because there is too much to keep coherent.
Pattern. Two rules, applied together:
- Scope discipline. One agent, one job. Cross-domain work means another agent and a routing layer, not a bigger prompt.
- Tool-first, prose-second. Every decision that a tool can make for the agent is pushed into a tool with a typed response. State machines live in tools, not in prose instructions.
The two are dependent. Deterministic flows only work if the agentβs scope is small enough that a decision tree fits in fifteen lines.
Failure it fixed. Three of them, all variants of "LLM made a decision a tool should have made": parallelised onboarding tool calls, split-turn commits that lost state, and recitation drift on data that changes.
Read more. Seven agents on a Mac Mini.
4. Self-learning loop
Problem. Adaptive software is a slide until the agent stops repeating mistakes the staff have already corrected. Templates and AI editors alone do not get you there.
Pattern. Four stages, end to end:
- Wake events as the substrate. Every observable thing the agent does emits a typed event. Staff actions during the wake (notes, approvals, supervisor mentions) are on the same stream.
- Staff-signal detection, a pure function that scans the stream for teaching moments. Four shapes: supervisor mention, approval rejection, internal note during a wake, reassignment with a reason.
- Change proposals as the unit of learning. Each signal becomes a markdown patch against the agentβs working memory or skills file. Same pipeline staff use to propose any other change.
- Applied skills. Some change types auto-apply with full audit; others require approval. Skills live in
modules/agents/skills/as markdown files with frontmatter.
Failure it fixed. Agents that the owner had to correct on the same mistake every week. The loop converts those corrections into durable behaviour without anyone touching a system prompt.
Read more. How Vobase agents learn.
5. Cost tiering (triage / main / planning)
Problem. Defaulting to the smartest model for every task burns money on jobs that did not need to be smart. Defaulting to the cheapest model is worse, because the things that do need to be smart get done badly.
Pattern. Three model tiers, picked per job by the agent at routing time:
- Triage (Haiku-class). Classifying inbound, summarising, log-line work, anything where speed matters more than smarts.
- Main (Sonnet-class). Customer-facing inference. The right balance of latency, capability, and cost for live conversation.
- Planning (Opus-class). Multi-step reasoning, decision documents, ambiguous user intent. Used sparingly; the cost only pays back on hard jobs.
Failure it fixed. An OpenClaw cron silently burned $300 because I had defaulted to the most expensive model and a retry loop multiplied the laziness. Tiering by job is the antidote.
Read more. Cost is a product feature.
6. Adaptive software shape (template + adaptive layer + AI editor)
Problem. SaaS is too rigid; blank-canvas AI builders have a cold-start problem (most SME owners cannot describe their app from scratch). Both fail in the middle market.
Pattern. Three product layers:
- Vertical template, 80% ready on day one. Not generic SaaS; a product that already looks like a bakery, a clinic, or a childcare centre.
- Adaptive layer, the widget the owner talks to. "Add a column for deposit amount." "Flag orders over $200 for manager approval."
- AI editor, the part that actually edits the live app. Not a report skin, the real product.
The shape only feels adaptive when a fourth layer (the self-learning loop, framework 4 above) closes the cycle. Owners notice the agent stop making the mistake; that is the moment trust changes.
Failure it fixed. Studio (our Lovable-style blank-canvas attempt) did not get pickup from SMEs. Owners could not describe an app from a blank prompt; they could point at something running and say "this should work differently."
Read more. Adaptive software, The death of SaaS.
If you copy any of these, please do. None of them are original to me; I have only spent enough months breaking them in production to give them names. The stack I use to ship them is on the stack page.