Runs agents at production scale
230K+ AI interactions a day across 100+ SME deployments, 99.65% task success. Real customers on real WhatsApp lines, not benchmarks.
Hiring screens ask for demonstrated capability. This page is the map: each claim I make, and the published work behind it. Every number comes from one production system I am accountable for, not a benchmark.
230K+ AI interactions a day across 100+ SME deployments, 99.65% task success. Real customers on real WhatsApp lines, not benchmarks.
Three-layer evaluation: deterministic checks on every wake, calibrated LLM-as-judge at scale, weekly human review as the source of truth.
Model-layer safety is the labβs job. Deployment-layer safety, pointing a safe model at a real business without getting burned, is mine.
The hard part of a deployment is rarely the model. It is the approval queue that is slower than the messaging window, and the trust the owner has not extended yet.
A deployment counts when the business runs on it without me. Adoption, self-learning in place, and reliability once I am gone.
Model tiering, cost attribution per interaction, and the observability to catch a runaway cron before the bill does.
Product judgement includes the post-mortem. One product retired in public, and the architecture lesson that came out of it.
If you are screening me and have ninety seconds: the failure museum, then the eval framework. Or curl yashgadodia.com/cv.