Skip to content
← All posts

Breaking Viola Out of Her Box

I build Viola, a voice assistant, mostly by orchestrating fleets of AI coding agents. The funniest thing I learned doing it is also the most fitting: the AIs building an AI do not, by default, trust AI. Ask an agent to build something that leans on the model's judgment and it quietly turns it into a pile of if-statements. Ask it why something broke and it blames the model first. That one reflex, more than any single bug, was the thing holding Viola back.

Making Viola agentic takes two things. She has to be trusted to decide for herself, and she has to have hands and eyes to act in the world. This is mostly the story of the first, because it was the harder half, and it was hard for a reason I did not see coming.

Viola's whole idea is simple. You say something, the model reads exactly what you said, and it decides what to do. No script, no menu, no magic words. That only works if the code underneath hands the model the real situation and gets out of the way. And over and over, the agents building that code could not help themselves. They put her in a box.

The determinism reflex

A "box" is any layer that tries to outguess the model. A keyword matcher that decides what you meant before the model sees it. A rule that reads the model's own reply and branches on whether it contains the word "file." A little hint slipped into the model's context to nudge it toward an answer. Every one looks helpful in the moment. Every one is a trap, because the second a real person says something outside the word list the author imagined, the model silently gets the wrong treatment.

The concrete ones make the point better than the theory:

  • A regex that grabbed "play X" before the model ever ran. Ask for "play Drake" and it played the top general video result, a reaction clip, because the model was skipped entirely.
  • Speech-cleanup code that edited your words before the model read them. It turned "play something I like" into "play something I."

The pattern was relentless. I would delete a box, and a week later a fresh one appeared to make some test turn green. To an agent, a regex feels safer than trusting the model. It almost never is, and every time, Viola got a little narrower, in ways that only surfaced when a real person said something a little unusual.

The blame-the-model reflex

Same instinct, other direction. When something broke, the first theory was almost always "the model hallucinated." It almost never had.

  • Phone calls were coming through as garbled text. The theory was that the language model was hallucinating. The real cause was the speech-to-text model, too small for phone-quality audio: when it got unsure, it climbed an internal "get more creative" ladder until it invented words, differently each time. Move to a stronger speech-to-text model and the garble vanished, the same words every run.
  • A reply that played twice got blamed on the model doubling itself. It was two message blocks inside one answer that we were simply not filtering.

One pattern is worth saying plainly. My coding agents are mostly Claude, and Viola runs on an OpenAI model by default. When Claude realized ChatGPT was under the hood, it got quicker to blame that model when something broke. It was the easy explanation, and usually the wrong one. The bug was almost always in our own code.

What actually fixed it

Not a clever new layer. A rule, and a guard that enforces it.

The rule is one sentence: the runtime hands the model the raw context and trusts it to decide. No classifiers reading your query, no code parsing the model's words to branch on them, no hint fragments slipped into its context. When the model genuinely needs guidance, that guidance lives in one place, the prompt, where I can read the whole thing at once.

The guard is a test that fails the build the moment a new classifier or keyword-parser or injected hint shows up in the agent code. You cannot merge a box back in without it stopping you. We call it a ratchet: kill a class of bug, leave a tripwire so it cannot come back quietly. It is the single most useful habit I picked up on this project.

Here is the split that took me longest to see. Inside Viola's runtime, the answer was to trust the model completely. But the agents building that runtime are exactly the ones that keep reaching for the box, so those I had to keep honest, with rules and ratchets and a lot of reading the actual output. Trust the model in the product; verify the models that build the product. I wrote up how that fleet actually runs in How We Build Viola.

And under all of it was one dull discipline that did the real work: read the actual thing. Not the token count, not the green checkmark, not the transcript on my phone. The real response, the real trace, opened and read end to end. Every bug above was misdiagnosed until someone stopped trusting the summary and looked at the artifact.

The part I have not solved. Trusting the model has a price. When you refuse to hard-code behavior, you are at the mercy of whether the model follows its instructions. Viola is told to hand big jobs to a sub-agent. She has the tool. Sometimes she just does the whole thing herself and ignores the instruction. The honest fix is not another classifier sneaking the control back. It is better prompting, and accepting that a system built on judgment will sometimes use judgment I did not ask for. I would still take that over a thousand brittle if-statements.

Giving her hands and eyes

Trusting the model gets you an assistant that can decide. It does not, on its own, get you one that can do anything. For that, Viola needed hands and eyes, and we built almost none of that ourselves. We stood on open source, and I would make that call again.

Her eyes are the ability to see a screen or a web page. She captures what is in front of her and looks at it, using open tools like Playwright, which drives a real web browser, and fast screen-capture libraries. The seeing itself is the model's own vision; the open-source pieces put the picture in front of it.

Her hands are the ability to act. Filling in a form and checking out on a site, again through Playwright. Placing a phone call, through the open-source Pipecat voice pipeline running over the Telnyx carrier. Reading a calendar, through the open CalDAV, Google, and iCloud libraries. Each of those is a piece we leaned on instead of reinventing. They connect through the Model Context Protocol, an open standard, so Viola's hands are not a fixed list she is stuck with, but a set of abilities she can reason about and choose from.

None of this would have been realistic to build alone in the time we had, and I am glad it existed. I wrote more about what we borrowed, and why borrowing beat building, in Build, Buy, or Borrow.

Today's assistants let me down, and I want to do better. Whether Viola is better is not for me to say. But she is out of the box now, with real hands and eyes to use, and that feels like the right place to start.