The system behind 230K interactions a day
Contents
There are two numbers on my CV and the front page of this site: 230K+ AI interactions a day, 99.65% task success. I've been asked what they actually mean enough times that the honest answer deserves a post.
A number without the system behind it is marketing. This is the system.
#What counts as an interaction
230K a day does not mean 230K customer conversations. It means 230K model-backed steps across the platform.
When a WhatsApp message lands in Envoy or a Vobase deployment, it doesn't trigger one model call. It triggers a pipeline: a triage step classifies the intent, the main agent reads the conversation and decides what to do, tools get called, and a judge may grade the output after the fact. One inbound message is routinely four or five interactions. Add the scheduled jobs, the internal ops agents, and the drafts staff never see because a deterministic check killed them first, and the volume adds up quickly across 100+ accounts.
I count it this way deliberately. The platform's unit of work is the step, not the conversation, because the step is where cost, latency, and failure live. Cost is a product feature, and you can't attribute cost to a conversation. You attribute it to the classification that ran on Haiku, the reply that ran on Sonnet, the planning step that ran on Opus.
#What 99.65% counts, and what it doesn't
The success rate is measured at the deterministic layer: did the step complete, and did it pass its checks. The API call succeeded, the reply fit under WhatsApp's 4,096-character limit, the ticket got created with its required fields, the quoted price matches the live database. Mechanical, cheap, runs on every step. It's the bottom layer of the three-layer eval stack.
What it doesn't count is quality. A reply can pass every deterministic check and still miss the customer's tone, answer the wrong question, or bury the one thing they asked. When I measured this properly on one agent, the gap was twelve points: 94% hard completion, 82% fully successful once LLM-graded quality scores were included. The 99.65% is the first kind of number, not the second. If someone quotes you a 99%+ agent success rate and can't tell you which layer it's measured at, assume the flattering one.
The other honest caveat: a completion rate can't see silent failures, the cases where the agent should have acted and didn't. They never enter the denominator. That isn't a footnote, it's 36% of the failures I've measured. Catching them takes expected-behaviour checks (how many messages came in versus how many got answered), which is its own layer of the stack.
So what does the 0.35% look like? Mostly hard failures: API timeouts, a tool returning malformed data, WhatsApp rejecting a send. Each one retries, escalates to a human, or pages me, depending on the failure class. The number I actually manage isn't the 0.35%. It's the trend line.
#The subsystem that keeps it affordable
Three jobs, three tiers: triage on Haiku under 500ms, the main job on Sonnet under 3 seconds, planning on Opus, sparingly. At 230K steps a day the tiering isn't an optimisation. It's the difference between viable unit economics and a science project.
Prompt caching does the other half of the work. Stable system prompts and conversation history get cached, and the average cost of a customer reply dropped 87% when I added it, from $0.03 to $0.004.
Every task type has a cost baseline and an alert at 3x the rolling average. Per-agent budget caps sit underneath: burn too much in an hour and I get paged, burn too much in a day and the runtime pauses the agent. The caps live in code, not prompts. I learnt that one for $300.
#The subsystem that keeps it safe
The short version of the guardrails post: tool-first so the agent routes instead of recites, approval gates so anything irreversible gets a human, scope walls so one agent owns one domain, PII redaction at both the prompt layer and the ingestion layer, and escalation rules that name a person instead of saying "escalate".
Underneath all of it, tenant isolation is enforced in Postgres row-level security, not in application code. Creating a tenant is an INSERT; isolating one is the database's job. Context bleed between tenants is the failure mode with the biggest blast radius, so it gets the strongest enforcement.
#The subsystem that keeps the numbers honest
None of the above matters if you can't tell whether it's working. Every step gets the deterministic checks. A sample gets LLM-as-judge grading against rubrics, with the judge calibrated against human-labelled anchors (currently a Pearson r of 0.91, and a Cohen's kappa of 0.78 between my scores and the judge's). Twenty conversations a week get read by a human.
Weekly canaries run the same 50 inputs through the same rubrics every Monday; a 0.3-point move on any dimension triggers an investigation. That's caught three regressions so far, two from model updates and one from context-window rot. New failure modes get a name in the taxonomy and a deterministic check the following week. The loop is the product.
#What the numbers don't tell you
They don't tell you the quality-weighted success rate, which is lower and tracked separately. They don't tell you how much of the volume is triage (a lot). And they don't tell you whether the system holds at 10x, which is the question Volty exists to answer: pooled multi-tenancy, one deploy, isolation in the database, the same eval stack watching all of it.
The honest summary is that the two CV numbers are the cheapest layer of a more expensive truth. The deterministic layer says 99.65%. The judge layer says less. The human layer is twenty conversations a week of finding out what both of them missed. Running agents in production is the business of closing the gap between those three numbers, and I don't think that work is ever finished.