WIMAUT: Because Your Agents Won't Tell You They're Burning $300
Contents
A few weeks ago I noticed my Anthropic bill was $300 higher than expected. Took me a while to figure out why. Turns out one of my OpenClaw agents had a cron job running every hour, making API calls, burning tokens. I had no idea it was happening. No alert, no dashboard, no way to know until the bill arrived.
That's the thing about agents. They're useful precisely because they run without you. But that means they also fail without you, waste money without you, and loop endlessly without you. You're completely blindfolded.
That experience is why I built WIMAUT.
#The vibe coding problem
Everyone's running agents now. Claude Code sessions in tmux. Codex tasks in the background. CI pipelines with AI steps. OpenClaw agents on crons. The tooling has gotten good enough that spinning up an agent is trivial. Watching what it does after that? Nobody's solved that.
This is especially bad for non-technical users. The whole promise of vibe coding is that you describe what you want and agents build it. But if you don't understand what's happening underneath, you have no way to know if the agent is making progress, going in circles, or quietly burning through your API budget. You're trusting vibes, literally.
I kept seeing the same pattern at Voltade. Someone kicks off a task, walks away, comes back an hour later. "Is it done?" "I don't know, let me check." Then you look at the logs and realise the agent hit an error on iteration three and spent the remaining 57 minutes retrying the same failing approach.
#What WIMAUT does
WIMAUT stands for "What Is My Agent Up To." It's an observability dashboard for AI agents. I built it at the Codex hackathon in Singapore in 2026, entirely on Codex during the event.
The dashboard gives you a live view of your running agents. Think of it like a process monitor, but for AI tasks instead of system processes. You can see:
- Active task runs with real-time status (running, completed, failed, stuck)
- Token usage per agent, per task, per time window
- Cost tracking with alerts when spend crosses thresholds
- Execution logs so you can see what the agent is actually doing at each step
- Failure patterns surfaced automatically, not buried in logs
The visual style is inspired by pixel-agents. Each agent gets a visual representation on the dashboard, and you can see at a glance which ones are active, idle, or stuck.
It's internal-only right now. I use it to monitor our OpenClaw agents (Clawrence and Claudia) and any Claude Code sessions running on the Mac Minis.
#The $300 lesson, technically
Let me break down what actually happened with that runaway cron, because the failure mode is instructive.
Clawrence, my internal ops agent, has several cron jobs. The GeBIZ scraper, the daily Claude features update, the team summary. I'd set up a new cron for competitor analysis. It was supposed to run daily. I accidentally configured it to run hourly.
Each run made several API calls to Claude, including a long system prompt with context about our competitors, our product, and what to look for. Each run consumed roughly 50K tokens. At hourly intervals, that's 1.2M tokens per day. Over a week, it added up fast.
The problem wasn't the cron configuration. That's just a typo. The problem was that I had no visibility into cumulative token spend per agent. No way to set a budget. No alert when an agent exceeded expected usage. The agent was doing exactly what I told it to do. I just didn't realise how often.
With WIMAUT, that would have been a five-minute catch. The cost dashboard would show a spike. The token usage graph would show an hourly pattern that doesn't match the expected daily schedule. I'd get an alert, fix the cron, move on.
#How I think about agent evaluation
WIMAUT doesn't have evaluation built in yet. But this is where I want to take it, and it's the part I think about most.
Observability tells you what happened. Evaluation tells you whether it was good. They're related but different problems.
For agents, evaluation is harder than it looks. A traditional software system either works or it doesn't. An API returns 200 or 500. But an agent can return a plausible-looking result that's subtly wrong. It can complete a task in a way that technically satisfies the prompt but misses the intent. It can hallucinate confidently. Anthropic's own research on this is instructive: they found that even simple multiple-choice benchmarks are unreliable because formatting changes alone can shift accuracy by 5%. If benchmarks are fragile, production agent evaluation is an order of magnitude harder.
Here's the framework I'm working toward:
#The three grading layers
Anthropic's eval documentation lays out a hierarchy that I keep coming back to. There are three ways to grade agent output, and you should use the fastest one that works:
Code-based grading. Exact match, string match, regex, JSON schema validation. Fast, deterministic, scalable. If Claudia is supposed to create a Notion ticket with specific fields, I can check whether the ticket exists and whether the fields are populated. No LLM needed.
LLM-as-judge. Use a different model to evaluate the output. This is the breakthrough technique that makes soft failure detection possible at scale. The idea: take the agent's output, a rubric describing what "good" looks like, and ask a separate model to grade it. Anthropic recommends using chain-of-thought in the grading prompt ("think before you judge") because it significantly improves evaluation accuracy. OpenAI's cookbook says the same thing: always use a different model for grading than the one that generated the output, and always validate against human judgement before trusting it at scale.
Human review. The gold standard, but expensive and slow. Use it to calibrate your automated evals, not as your primary grading mechanism. Sample 5-10% of outputs for human review, use those to validate that your LLM-as-judge is actually catching what matters.
The practical approach is layered. Code-based checks catch the obvious failures (did the API call succeed? is the output valid JSON?). LLM-as-judge catches the subtle ones (is the tone right? did the agent actually answer the question?). Human review validates that the whole pipeline is working.
#Task completion rate
The most basic metric, but surprisingly hard to measure well. A task isn't just "did the agent produce output." It's "did the agent produce output that actually solved the problem." For a customer support agent like Claudia, that means: did the customer's issue get resolved? Did they come back with the same question? For a code agent, did the code compile? Did the tests pass? Did it introduce regressions?
Anthropic's framework defines success criteria that are specific, measurable, achievable, and relevant. Not "the agent should respond well" but "the agent should achieve an 85% resolution rate on tier 1 support queries, with less than 2% of responses flagged for tone by the LLM grader." That level of specificity is what turns evaluation from a vague aspiration into something you can actually track.
For WIMAUT, I want to track task completion as a multidimensional score. A single "pass/fail" isn't enough. Each agent task gets graded on:
- Correctness. Did the output match the expected result? (Code-graded where possible.)
- Relevance. Did the agent actually address the user's intent, or did it answer a slightly different question? (LLM-graded.)
- Tone and style. For customer-facing agents like Claudia, this matters as much as correctness. Anthropic suggests using a Likert scale (1-5) graded by an LLM, with a detailed rubric. "Rate this response for empathy on a scale of 1-5, where 1 means dismissive and 5 means genuinely understanding."
- Latency. Did the agent respond within acceptable time bounds? (Code-graded, trivial.)
#Token efficiency
Not just "how many tokens did this cost" but "how many tokens per successful outcome." An agent that uses 100K tokens to accomplish a task that another agent does in 10K isn't just more expensive. It's probably doing something wrong: looping, retrying, generating verbose intermediate reasoning that doesn't contribute to the result.
Token efficiency as a metric forces you to think about agent architecture. Is the system prompt too long? Is the agent retrieving too much context? Is it reasoning through steps it should already know from memory?
I track this as a ratio: tokens_consumed / task_quality_score. A high ratio means the agent is working hard for mediocre results. A suddenly increasing ratio across the same task type means something has changed, probably a regression in the prompt or a model update that made the agent more verbose.
#Failure taxonomy
Not all failures are equal. I think about three categories:
Hard failures are the easy ones. The agent crashes. The API returns an error. The task times out. These are standard monitoring territory. Code-graded, deterministic, alerts fire immediately.
Soft failures are trickier. The agent completes the task but the quality is poor. Claudia sends a customer response that's technically accurate but misses the tone. Clawrence flags a tender that doesn't actually match our capabilities. The agent "succeeded" by its own measure but failed by ours. This is where LLM-as-judge becomes essential. You write a rubric for each task type, run every output through a grading model, and flag anything that scores below threshold.
Silent failures are the worst. The agent does nothing when it should have done something. A customer message goes unanswered because the agent didn't recognise it as a question. A cost threshold gets crossed because the agent didn't flag it. You only discover these by noticing what didn't happen, which is much harder than noticing what did. The only way to catch silent failures is to define expected behaviour proactively: "Claudia should respond to every customer message within 5 minutes. If she doesn't, that's a failure." Then you monitor for the absence of the expected event.
WIMAUT currently handles hard failures well. Soft and silent failures are where the evaluation framework needs to go next.
#Drift detection
Agents degrade over time in ways that aren't obvious. A model update changes the behaviour slightly. The context window fills up and older instructions get dropped. A skill that used to work reliably starts failing because the external API it depends on changed its response format.
Anthropic's research flagged this explicitly: even small changes in prompt formatting can shift model accuracy by 5%. For production agents, that kind of drift is constant. Every model update, every context change, every API version bump is a potential regression.
Drift detection means continuously comparing current agent performance against a baseline. If Claudia's average tone score drops from 4.2 to 3.6 over two weeks, something changed. Maybe the model, maybe the prompt, maybe the data. But without tracking the baseline, you'd never notice until a customer complains.
The implementation I'm building toward: run a fixed set of "canary" evaluations on a schedule. Same inputs, same rubrics, tracked over time. If the scores diverge from baseline beyond a threshold, alert. It's the agent equivalent of a regression test suite, except instead of pass/fail it's a continuous quality signal.
#Cost attribution
This is the one that would have saved me $300. Every agent run should have a cost budget. Not just a total spend limit, but an expected cost per task type. If a daily summary usually costs $0.50 and today it cost $5, that's a signal. Either the task was harder than usual (fine) or the agent went off the rails (not fine).
Cost attribution also matters for multi-agent systems. When you have five agents running concurrently, you need to know which one is responsible for the bill spike. WIMAUT breaks down cost by agent, by task type, and by time window.
#Evals in CI/CD
One thing both OpenAI and Anthropic agree on: evaluations should run as part of your deployment pipeline, not as an afterthought. Before you push a prompt change or model update to production, your eval suite should run and gate the deployment. If the new prompt scores lower on your rubrics than the current one, the deploy should fail.
For agent systems, this means treating prompt changes the same way you treat code changes. Write the eval first. Define what "good" looks like. Make the change. Run the suite. If it passes, deploy. If it doesn't, iterate. It's test-driven development applied to AI systems.
I haven't built this into WIMAUT yet, but it's the natural extension. Right now evals would run post-deployment and catch regressions reactively. The goal is to run them pre-deployment and prevent regressions entirely.
#Where this goes
Right now WIMAUT is a monitoring dashboard. Useful, but reactive. You see problems after they happen.
The next version needs to be an agent orchestration platform. Not just "what is my agent up to" but "how do I manage a fleet of agents." That means:
- Budget enforcement. Kill or pause an agent when it exceeds cost thresholds. Don't just alert. Act.
- Automated evaluation. Run quality checks on agent outputs before they reach customers or production systems.
- Agent lifecycle management. Start, stop, restart, update agents from a single interface. Currently I manage Clawrence and Claudia through separate terminal sessions.
- Cross-agent coordination. When multiple agents work on related tasks, they need awareness of each other. WIMAUT could be the coordination layer.
The broader problem is that we're building increasingly complex agent systems with tools designed for single-model interactions. It's like running a fleet of delivery trucks but only having a speedometer for each one. You need a dispatch centre.
#The hackathon
I built WIMAUT at the Codex hackathon in Singapore. The entire thing was built on Codex during the event. No pre-written code, no templates. Just a problem I'd been thinking about since the $300 incident and a weekend to build a solution.
The irony of building an agent observability tool using an AI agent wasn't lost on me. At one point during the hackathon, Codex got stuck in a loop on a UI component and I had no way to see that it was stuck except by watching the terminal. Which is exactly the problem WIMAUT solves.
#Why this matters for AI products
If you're building anything with agents in production, you need observability. Full stop. The "deploy and hope" approach works for demos. It doesn't work when agents are talking to your customers, spending your money, and making decisions on your behalf.
The evaluation problem is even more important, and much less solved. We have decades of tooling for monitoring traditional software. We have almost nothing for evaluating whether an AI agent is doing its job well. That's the gap I'm trying to fill.
The agents are getting more capable every month. The tooling for managing them is years behind. Someone needs to build the Datadog for agents. WIMAUT is my attempt at figuring out what that looks like.