← Blog

Cost is a product feature

May 10, 2026 · 6 min read

Contents

What "the smartest model" actually buys you
A rough framework
When to switch tiers
What I think most teams get wrong
The actual hot take

I default to Opus 4.7 for everything I do personally. Claude Max is paid for, the requests are flat, the model is the smartest, why not.

That's also why most teams are wrong about model selection.

When you're paying per token, the calculus changes immediately. I noticed it the first time my Anthropic bill jumped a few hundred dollars in a week (the WIMAUT post covers that one in detail). It wasn't because the agent was doing more. It was because I'd defaulted to the smartest model for a job that didn't need to be smart, and a cron multiplied the laziness.

That's the thing I keep saying to engineers on the team. Cost is a product feature. You're shipping it whether you measure it or not.

#What "the smartest model" actually buys you

The smartest model gets you a few things and sells you a few others.

It gets you: better reasoning on novel tasks, better instruction-following on long prompts, fewer "the model just decided to skip a step" moments. (Sonnet still does this. The 20% post had a real example where Sonnet generated a perfectly good 389-character reply and then forgot to call the send_reply tool. The system prompt told it in bold to always call it. The model didn't. I had to write a safety net underneath it.)

It sells you: latency you can't recover, dollars per call you can't afford to spread thin, and a strange complacency where every problem gets thrown at the biggest model and nobody asks whether the problem actually needed it.

The trap is that the smartest model is also the most forgiving. You can write a sloppy system prompt and Opus will probably still figure out what you meant. Haiku won't. And that pressure, the pressure to write a tighter prompt, is good for the product. You learn what your agent's job actually is when you can't paper over the prompt with raw model intelligence.

#A rough framework

I think about it as three jobs.

Triage. Things you do thousands of times a day, where the answer space is small and the latency has to feel like nothing. Intent classification. Routing. Sentiment. Did the customer pay or are they asking a question. These should run on the cheapest model that gets the answer right, and you should benchmark that explicitly. Haiku, today, in 2026, is genuinely good at this if your prompt is tight. I budget under 500ms.

The main job. The thing the agent actually exists to do. For Envoy that's reading a WhatsApp conversation, deciding what the customer wants, calling the right tool, sending a reply. For Clawrence it's reading a tender notice and deciding if it's relevant. This needs to be smart enough to be useful but predictable enough that you can reason about its behaviour. I run Sonnet here. I budget under 3 seconds for chat-reply latency, because past that customers drop the thread.

Planning. Any step where the agent has to decide what to do next out of more than five options. Multi-step orchestration, choosing which sub-agent to call, decomposing an ambiguous request. This is where the smartest model earns its money. Opus, sparingly. Latency budget is bigger here (eight seconds is fine, no one expects an instant answer when they've asked for something complicated) but cost matters because each call is multiplying through the rest of the chain.

That's not a clean three-tier framework. The boundary is fuzzier than that in practice. Sometimes the main job is so simple Haiku is enough. Sometimes the planning step is so constrained you don't need Opus.

But the question I ask myself before writing any new agent now is: which of those three jobs is this, and why am I picking the model I'm picking.

#When to switch tiers

The thing I got wrong for a long time was treating model selection as a one-time decision. It isn't. It's a knob.

I had a classifier on Envoy that decided whether an inbound WhatsApp message was a new order, an existing-order update, or a general question. I wrote it on Sonnet, because I'd written everything else on Sonnet and it was easy. The classifier worked fine. Took about a second and a half per call. Cost more than I'd like.

When I rewrote it on Haiku I had to tighten the prompt: drop the reasoning chain-of-thought, give it more concrete examples, list the categories explicitly with one-sentence rules. The new version returns in a few hundred milliseconds and costs about a fifth as much. The accuracy on a hand-labelled eval set was the same.

The other direction also matters. We had a tool-selection step in the cake-intake agent that was on Sonnet and kept choosing the wrong tool when the customer's message was ambiguous. (Sonnet would gamble. The product needed it to either pick the right tool or hand off to a human.) I tried Opus on that one step. The wrong-tool rate dropped sharply. I left it there. The step runs once per conversation, not once per message, so the cost is bounded.

The mental move is the same in both cases. Pick the cheapest model that does the job. Promote it when accuracy isn't enough. Demote it when accuracy is fine and the cost is hurting.

#What I think most teams get wrong

Two things, mostly.

The first is treating model selection as an engineering decision. It isn't. It's a product decision wearing engineering clothes. Whether your support agent feels instant or sluggish, whether your platform's per-conversation cost lets you sustain free tiers, whether your pricing model survives a 10x volume month. Those are PM questions. They should not be delegated to whichever engineer wrote the agent first and picked Sonnet because it was the default.

The second is not measuring it. Most teams running AI features today couldn't tell you their cost per resolved conversation, their p95 latency per agent step, or which model is doing which job. The money disappears into the API line item on the bill. The latency disappears into "yeah it feels fast enough." That's how you wake up at a few hundred dollars over budget without doing anything different.

I track both now. WIMAUT does the cost side. There's a separate eval harness that times steps per agent per model. It isn't pretty. It's enough to make the decision the next time the question comes up.

#The actual hot take

The smartest model is rarely the right one.

Most of the time you're picking it because you haven't done the work to find out which of the three jobs you're doing, and what the cheapest model that does that job actually is. Doing that work is unglamorous. It looks like writing 100-row eval sets and timing things. It looks like rewriting a prompt for a smaller model and arguing with yourself about whether the accuracy difference matters.

But it's the most leveraged work I've done on agent products this year. A 5x cost reduction on a hot path is more valuable than a feature. Latency dropping below the threshold where users notice changes whether the product feels real.

Cost is a product feature. So is latency. So is the model you picked. Treat them like product features and the rest follows.