The Trust Budget
Contents
The agent I trust most can see my entire net worth. Every account, every balance, every card transaction. It runs on a Mac Mini in my office and I let it read all of it without a second thought.
The agent I trust least manages $100 of crypto in a sealed box I check obsessively.
The difference between them isn't capability. They run on the same model. The difference is how much I let each one actually do, and I've started thinking about that as a budget I allocate rather than a setting I flip on.
#Autonomy is a function of two things
When I decide how much rope to give an agent, I'm really pricing two variables.
How reversible is the action. If the agent does the wrong thing, can I undo it cheaply, or is it gone? Reading data is perfectly reversible. Sending a message to a customer is half-reversible (you can follow up, but they already read the first one). Moving money is not reversible at all.
How big is the blast radius. If this goes wrong, what's the worst case? A summary that's slightly off affects one person's afternoon. A runaway cron affects my bill. A bad write to a production database affects every customer at once.
Multiply those together and you get the trust budget for that action. Cheap and reversible gets full autonomy. Expensive and irreversible gets a human in the loop, or doesn't get near the agent at all. Everything else sits somewhere in between.
The thing I keep relearning: this is set by the environment, not by the model.
#The scariest-sounding agent is the safest
Happy, the agent that watches my finances, sounds like the most dangerous thing I run. It can see everything. If you'd described it to me two years ago I'd have assumed it needed heavy guardrails.
It needs almost none. Because it can read everything and write nothing. Its blast radius is bounded to information. The worst case is that it tells me something wrong about my own money, which I'll catch the next time I look. There's no irreversible action available to it, so I don't ration its autonomy at all.
That's the counterintuitive part. The sensitivity of what an agent can see tells you very little about how much you should trust it. What matters is what it can do, and whether you can undo it.
#You don't make the agent trustworthy. You make the mistake cheap.
The crypto bot is the opposite case. I wanted it fully autonomous: scan the market, size positions, place orders, no human in the loop. That's a lot of irreversible, real-money actions.
I didn't earn that autonomy by making the agent smarter. I earned it by shrinking the blast radius. Coinbase lets you create isolated portfolios with separate balances, so I made a "Trading Bot" portfolio, moved exactly $100 into it, and scoped the API key to that portfolio only. Every call passes the portfolio ID. Even with a bug, the worst case is losing $100. On top of that: a hard floor at $60 where it stops entirely, a daily drawdown limit, server-side stop-losses that execute on Coinbase's machines even if my Mac is asleep.
Then it lost six trades in a row, and the total damage was $4.15.
That's the whole point. I gave a money-handling agent full autonomy not because I trusted its judgement (its judgement was wrong six times out of six) but because I'd made being wrong cheap. The guardrails are the trust. The agent is just the thing operating inside them.
#The middle of the grid is where the work is
The two examples above are easy because they sit at the extremes. Read-only is trivially safe. A sealed $100 box is trivially contained. Most real agents live in the messy middle.
Claudia handles customer conversations on WhatsApp at Voltade. Those actions are half-reversible and the blast radius is reputational, which is real but hard to price. So she doesn't get the crypto bot's full autonomy or Happy's. She drafts and sends within bounds, but the moment a conversation needs a commitment (a ticket, an escalation, anything that creates an obligation) there's a gate. I wrote separately about keeping agents like her on the rails, and most of that work is exactly this: deciding which specific actions cross from reversible to not.
The cautionary tale is Clawrence, an ops agent doing internal work. On paper it was the safe one. It scraped data and wrote summaries. No money, no customers. Then a cron meant to run daily ran hourly, and it burnt through $300 in tokens before I noticed. The action was reversible and the task was boring, but I'd handed it an unbounded resource with no baseline to compare against. The blast radius wasn't the task. It was the resource the task consumed. I'd budgeted trust for the work and forgotten to budget it for the bill.
#Verification scales with irreversibility
There's a corollary I've started applying deliberately. The less reversible an action, the more I verify it myself, regardless of how good the agent is.
When I built a recent app almost entirely by directing Claude, I let it own nearly everything. The two places I read the diff by hand were the database row-level security and the gate that controls who can see private content. Not because those were the hardest code. Because getting them wrong isn't a bug you patch in the next release, it's a breach that already happened. Irreversible plus large blast radius, so I spent my own attention there and nowhere else.
This is how I keep verification affordable. You can't hand-check everything an agent does, and you shouldn't try. You spend your attention where mistakes can't be taken back.
#What this means if you're building agent products
Most agent features are dressed up as capability decisions ("can the agent do X?") when they're actually autonomy-allocation decisions ("how reversible is X, and how big is the blast radius if it's wrong?").
That reframes where the product work is. It's not in the prompt. It's in the environment you put the agent in: the scopes, the sandboxes, the caps, the gates, the read-only boundaries. A more capable model dropped into an unbounded environment isn't safer, it's more dangerous, because it'll act more confidently on a wider range of things. Capability and autonomy are independent axes, and the second one is the one you actually control.
So when someone tells me the models are good enough now to just let them act, I don't disagree about the models. I ask what the blast radius is when they're wrong, because they will be, and whether I can undo it. If the answer is yes and cheap, give it the whole budget. If the answer is no, no amount of model quality buys back the autonomy. You earn it by changing the environment, not by waiting for a better model.
I'm still figuring out where the lines sit as models improve. They do move. An action I gate today might be safe to automate in a year. But the framework underneath hasn't moved at all, and I don't expect it to: trust is something you spend on actions, priced by what they cost when they go wrong.