← Blog

How an agent earns the right to change itself

Contents
  1. The agent is always taking notes
  2. A cheap model decides what is even worth thinking about
  3. Confidence times sensitivity, and why I add instead of multiply
  4. The queue is already the eval set
  5. What I haven't solved
  6. The work

The agents I run in production are not static. They start narrow and a little dumb, and they get better at a specific business over weeks, because they propose changes to their own behaviour and some of those changes stick.

That sentence should make you slightly nervous. An agent that can rewrite its own instructions is an agent that can rewrite them wrongly, and quietly, in a way that fires on every conversation afterwards. A bad reply embarrasses you once. A bad lesson embarrasses you every time it triggers.

I wrote about where deployment safety starts: default to the safe state, earn capability rather than assume it, keep the blast radius small. That post was about an action reaching a customer. This one is a layer further in. It is about the agent changing what it knows, which is the same problem at a lower altitude, and the design that lets me allow it without losing sleep.

#The agent is always taking notes

The loop starts with signals. Every conversation throws off evidence about whether the agent did its job, and the system captures five kinds of it:

  • a staff member silently taking over a conversation, which means the agent was not trusted to continue
  • a coaching note someone leaves for it on purpose
  • a rejected draft, where a human declined a reply or a change the agent proposed
  • agreement across channels that the agent had something right
  • the agent's own reflection that it could have done better

None of these is a lesson yet. They are raw observations, and most of them are noise. A staff takeover might mean the agent was wrong, or it might mean the customer asked for the owner by name. The job of everything downstream is to decide which signals are worth learning from, and then how much to trust the lesson that comes out.

#A cheap model decides what is even worth thinking about

Before a signal reaches the expensive part of the system, a small fast model triages it. It answers one question: is this worth a closer look, and how confident are we. Most signals die here, which is the point. Without a filter, a single busy afternoon would propose hundreds of changes, and a system that proposes everything has the same value as a system that proposes nothing.

There is a debounce on top of it: the same conversation and the same kind of signal cannot fire again for a few minutes. One customer correction should teach one lesson, not forty.

This is the unglamorous half of self-improvement. The interesting question is what the agent learns. The hard question is what it is allowed to even consider, and the answer is: much less than it generates.

#Confidence times sensitivity, and why I add instead of multiply

When a lesson survives triage and becomes a concrete proposal, one decision matters more than the rest. Does it apply on its own, or does it wait for a human.

Two numbers feed that call: the agent's confidence in the change, and the sensitivity of what the change touches. A note on a contact is low sensitivity. An email address, a price, a billing field is high. The obvious move is to multiply the two. I add them.

The auto-apply bar is a base of 0.7, plus a margin that scales with sensitivity, up to a ceiling where nothing applies without a person. A low-sensitivity note auto-applies once the agent is reasonably sure. A high-sensitivity change has to clear a much higher bar, and the most sensitive changes never clear it at all.

Multiplying breaks this. If you score a proposal as confidence times one-minus-sensitivity, a confident, correct change to a sensitive field collapses to near zero, and you have conflated two things that are not the same. "I am not sure" and "this is dangerous" are different axes. One is about the agent's belief, the other about the cost of being wrong. Gating them separately, additively, keeps a confident agent useful on the safe surfaces while still making it ask permission where a mistake is expensive.

Sensitivity is resolved per field, not per change. A single proposal can touch several fields at once, each carrying its own sensitivity, and the most restrictive one wins. So the agent can quietly fix a typo in a note while the address change in the same breath waits for a human. The safe part of a proposal does not have to wait for the dangerous part.

One category is exempt from all of this. A durable, reusable skill, the kind of procedural knowledge the agent would apply across many conversations, always goes to a person regardless of confidence. A one-off memory is cheap to get wrong and easy to walk back. A skill that fires everywhere is neither.

#The queue is already the eval set

Every human decision in this loop is a label. Approve, edit, reject, take over. I built the approval queue to keep the agent safe, and it turns out to be the cleanest evaluation data I have, because it is humans deciding three seconds at a time whether the agent was right, with no model grading itself. The rate at which a given agent's proposals are accepted untouched is the most honest reliability number I get, and it is the same number I use to decide when to widen its autonomy.

#What I haven't solved

In the spirit of the failure museum, the gaps.

There is no automated judge grading the lessons themselves in production. The human decisions are the eval, which is honest and does not scale.

The cheap triage model has its own false negatives, and they are invisible by construction. A real lesson it dismisses is simply never proposed, and nothing tells me it happened. The filter that keeps the system sane is also the one place a good idea can disappear without a trace.

And sensitivity is hand-assigned per resource. It should probably be learned, like everything else in this loop, and it isn't yet. I am hand-fitting the one dial that governs how much the agent is trusted, which is exactly the kind of manual step this whole system exists to remove.

#The work

It is the same instinct as everywhere else. Default to the safe state. Make good changes easy and reversible. Make bad changes wait for someone who will have to answer for them. An agent that can improve itself is only worth building if it cannot quietly make itself worse, and almost all of the design is in service of that one asymmetry. The learning is the easy part. Deciding what the agent is allowed to learn without asking is the product.