← Blog

Model safety ends where the customer's WhatsApp begins

Contents
  1. Safety is the resting state, not a feature you add
  2. Autonomy is earned, narrowly, and the approval queue is how you earn it
  3. A safety gate that doesn't survive a crash is theatre
  4. You can't claim reliability from a number you didn't verify
  5. What I haven't solved
  6. The work

I read a fair amount of AI safety research I have no part in. Jailbreaks, alignment, many-shot attacks, the work of making a model refuse the things it should refuse. It's some of the most important work happening in the field, and none of it is my job.

My job starts about one layer down. It starts the moment one of those carefully aligned models is pointed at a real person's WhatsApp, on a Sunday, on behalf of a small business that will lose a customer if it gets the next sentence wrong.

That's a different problem to the one the labs are solving, and I've come to think of it as the same discipline at a different altitude. The lab makes the model safe. I have to make the deployment safe. An agent can be perfectly aligned, harmless, and honest, and still confidently quote a customer the wrong price for six hours before anyone notices. I wrote about the failure modes that actually bite at length. This post is the layer above that: not the individual guardrails, but the posture they add up to.

#Safety is the resting state, not a feature you add

Every agent on Volty starts in shadow mode. It reads the conversation, drafts a reply, and stops. Nothing goes out until a human approves it from the inbox.

The important word is starts. Shadow isn't a setting someone remembers to switch on. It's the default the system boots into. An agent that does nothing to a customer is the resting state, and you have to deliberately, per agent, by name, opt it out of that state.

This sounds obvious written down. In practice almost everyone builds it the other way around. The agent ships in auto-reply mode and you bolt on a kill switch for when it misbehaves. That's the inverse of what you want. A kill switch assumes you're watching at the moment it goes wrong, and the whole problem with production agents is that you aren't. You find out two days later.

So the gate isn't a feature on top of the agent. It's the floor the agent stands on. Customer-facing actions don't run when the model decides to run them. They suspend, park as a proposal, and wait for a person. The model can be as confident as it likes. Confidence doesn't reach the customer.

#Autonomy is earned, narrowly, and the approval queue is how you earn it

Shadow mode is slower. It puts a human in the loop on every reply, which costs real staff time, and staff time at an SME is the scarcest thing there is. So the obvious next question is when an agent gets to stop asking permission.

The answer I keep coming back to: not globally, and not on a feeling. Per agent, per narrow task, on evidence.

The evidence is already sitting in the approval queue. Every time a staff member approves a draft unchanged, edits it before sending, or rejects it, that's a labelled judgement on the agent's output. The queue I built to keep the agent safe turns out to be the cleanest eval set I have. There's no LLM-as-judge in that loop, just humans deciding three seconds at a time whether the agent was right, and the rate at which they approve a given agent's drafts without touching them is the most honest reliability number I get.

When that rate is high enough on one agent doing one kind of task, I'll grant it autonomy. But only on that surface. The live exception on Volty lets an agent auto-send to its own conversation, the one it's already in, and nothing else. Not other conversations, not bulk sends, not anything that reaches a customer it wasn't already talking to. The blast radius of a mistake stays at exactly one customer, one agent, one thread.

Autonomy as a blast radius earned one ring at a time AGENT drafts OTHER CUSTOMERS never reached autonomously ITS OWN CONVERSATION live mode, once the queue earns it SHADOW DEFAULT drafts stay at the centre, nothing leaves the approval queue is the eval. widen the radius one ring only when it has been earned.

That's the trade I'm actually making. I'm not deciding "is this agent safe." I'm deciding how much damage one wrong message can do before a human sees it, and I only widen that radius when the queue has earned it.

#A safety gate that doesn't survive a crash is theatre

Here's the part that took the longest to get right, and the part nobody talks about.

An approval gate has to survive the system falling over. A staff member approves a draft, and in the half-second between the approval and the message actually sending, the process restarts. What happens?

If the answer is "the message sends twice," or "the message silently doesn't send," you haven't built a safety mechanism. You've built one that works in the demo. The held action has to replay exactly once after a crash: not zero times, not twice. On Volty every held operation carries an idempotency key tied to the proposal, so a retry after a crash re-runs the approved action and deduplicates on its own identity. Approve once, send once, even through a restart.

There's a second clock that makes this worse. WhatsApp gives a business 24 hours from the customer's last message to reply freely, and after that the thread is effectively dead. So a drafted reply can sit in the queue while the window closes underneath it, which is a collision between the safety model and the channel that no helpdesk vendor I looked at solves natively. The reply snapshots the window when it's created, and if the window has closed by the time someone approves, the send is blocked rather than fired into a void. The safe failure is "nothing happens," every time.

None of this is glamorous. It's the difference between a gate that's real and a gate that's a screenshot.

#You can't claim reliability from a number you didn't verify

The two numbers on the front page of this site, 230K interactions a day and 99.65% task success, are only worth anything if I measured them honestly. The same goes for cost, which I treat as a product feature, not an afterthought.

For a while the gateway we route model calls through reported a cost number that double-counted cached tokens for one provider. If I'd trusted it, every cost claim I made would have been quietly wrong. So Volty meters cost from the raw token counts itself rather than the convenient number handed to it.

That's a small engineering decision with a safety shape. Calibration isn't only a model-evaluation problem. At the deployment layer it's the discipline of not believing your own dashboard until you've checked what it's counting.

#What I haven't solved

I'd rather state the gaps than imply there aren't any, in the spirit of the failure museum.

There's no automated judge grading outputs in production yet. The human approval queue is the eval, which is honest but doesn't scale to the point where I'd want autonomy to be the default rather than the exception.

Prompt-injection scoring is designed and not yet enforced. Today the defence against a customer message trying to talk the agent into something is the approval gate catching the result, not the system catching the attempt.

And there's no proactive alerting. Failures are visible if you go and look. Nothing pages me when an agent starts drifting. The thing I learned the hard way is that "visible if you look" and "you'll find out in time" are not the same sentence.

#The work

I find this layer genuinely interesting, and slightly under-discussed. The model-safety conversation is rich and well-funded and mostly happening at the labs. The deployment-safety conversation is mostly happening in the heads of people shipping agents to real customers, one incident at a time, and not getting written down.

It's the same instinct at both altitudes. Default to the safe state. Earn capability rather than assuming it. Limit the blast radius. Don't trust a number you didn't check. The labs apply it to a model with frontier capabilities. I apply it to a CRM that lives on a baker's WhatsApp line. The stakes are different by orders of magnitude. The discipline is the same one, and I think you only really learn it downstream, where a wrong sentence has a name and a phone number attached.