Directing Agents, Not Writing Code: How Chalie Develops Itself

Most people think building an AI assistant means writing a lot of code. For Chalie, the reality is different. The codebase evolves through autonomous agents that research, implement, test, and ship — while I direct what gets built and why. This isn’t a future aspiration. It’s how every feature in the current release was developed.

Here’s how the system works.

The nightly harness: 50+ scenarios that define “correct”

Every night, a test harness spins up a fresh container, installs Chalie from the current branch, and runs over fifty black-box scenario tests against it. Each scenario mimics a real user interaction — sending messages, uploading files, asking questions — and asserts on the outcome through database state, HTTP responses, and log patterns.

The scenarios cover everything from basic health checks to multi-step workflows: memory recall across conversation boundaries, tool routing for weather and email, subagent dispatch and return, calendar sync, home automation control, document lifecycle, and policy enforcement. They’re written in YAML and designed to be deterministic — pass or fail, no ambiguity. A separate judge model observes each run for non-functional anomalies: excessive token usage, slow responses, hallucinated content, redundant tool calls. The scenarios tell you if the feature works; the judge tells you if it works well.

The harness is the source of truth. If Chalie doesn’t pass its scenarios, it doesn’t ship. If a feature doesn’t have scenarios, it isn’t finished.

The improvement loop: subtract before you add

Once a nightly run completes, an autonomous improvement agent wakes up. It reads the test results, identifies the weakest scenarios, pulls up the judge’s diagnostic notes, and picks one concrete opportunity. Then it implements a fix, commits it to a branch, runs the targeted scenarios again, and compares before-and-after scores.

If scores improve and the targeted anomaly disappears, it opens a pull request. If not, it deletes the branch, records what it tried and why it failed, and moves on.

The philosophy is subtractive. The first hypothesis is always that Chalie is doing too much — too many instructions, too much context, too many special cases. Remove noise first. Only add complexity when removal has been ruled out. Every attempt is tagged as deductive or additive, creating a clear record of what approaches work and what doesn’t for each category of problem.

This loop has opened dozens of PRs. Most are small: a single line added, a paragraph removed, a routing decision changed. But each one is backed by measurable evidence — a scenario that was failing and now passes, or an anomaly that appeared in the judge diagnostic and is now gone.

Feature development: test-first, agent-executed

When a new feature is needed — say, voice input, or a new tool integration — the workflow follows a rigid pipeline. Research comes first: agents scan the affected code, read existing test scenarios, and surface potential conflicts. Only after research is complete are the test scenarios written.

This is deliberate. The scenarios are written before any implementation code exists, and they must fail before development starts. This proves the scenarios actually test the new behavior and not something that already works by accident.

Then a coding agent implements the feature in an isolated worktree. A critic agent reviews the implementation alongside the nightly test results. If tests fail, the code gets fixed — never the tests. After a maximum of two fix attempts, if tests still fail, the system stops and escalates to a human. No infinite retry loops. No tests rewritten to pass.

The final gate is static analysis. The code must clear linting, dead-code detection, and quality scanners with zero new issues. Only after all gates pass does the feature get declared complete.

Visual testing: browser in the loop

For anything that touches the UI, a separate testing workflow takes over. It spins up a fresh Chalie container, waits for it to become healthy, then drives the interface through a real browser — clicking buttons, filling forms, toggling themes, verifying layouts. A background agent monitors container health and database state while the main agent exercises the frontend.

This catches an entire category of bugs that backend tests miss: broken dark-mode styling, misaligned components, forms that submit but don’t render feedback, features that work in the API but not in the interface.

The tracking layer: persistent memory and a task board

All of this work needs a record. Two purpose-built tools hold that together.

Taskie is an MCP-native project board. Every improvement cycle creates a ticket, records what was attempted, links the resulting PR (or documents why the branch was deleted), and tracks the before-and-after scores. Past tickets prevent agents from retrying failed approaches — if a particular angle was disproved three cycles ago, the next agent can see that and pick a different strategy.

IDE Memory is a persistent memory server that any MCP-compatible coding agent can connect to. Agents write decisions, conventions, and lessons learned into memory after each piece of work. When a future agent starts a new cycle, it reads from memory first — checking what approaches have been validated, what patterns cause regressions, and what the current conventions are. This means the system accumulates institutional knowledge without anyone manually maintaining a wiki.

Together, they give the autonomous loop something most AI systems lack: the ability to learn from its own history without re-discovering the same dead ends.

What my day actually looks like

I don’t write implementation code. I don’t manually run tests. I don’t debug failing scenarios by reading logs. Here’s what I actually do:

Set direction. I decide what features Chalie needs next, what the priorities are, and what trade-offs to accept. This happens in natural language — a description of what I want, why it matters, and any constraints the agents should respect.

Review pull requests. The agents open PRs with evidence: before-and-after scores, judge diagnostics, and the specific anomalies that were resolved. I read the diff, check the evidence, and merge or request changes.

Make judgment calls. When the system escalates — repeated test failures, conflicting constraints, architectural decisions that could go either way — I step in with a decision and the agents continue from there.

Refine the process. The development skills themselves are just instructions. When I notice a pattern of failures or a category of bugs that keeps appearing, I adjust the workflow — add a new gate, change the improvement philosophy, tighten a constraint. The changes take effect on the next cycle.

Building your own version

This setup isn’t magic, and it doesn’t require proprietary infrastructure. The core ingredients are: a comprehensive test suite that defines correct behavior, a loop that can identify failures and propose fixes, evaluation that compares before-and-after with numeric scores, and persistent tracking that prevents redundant work.

The test suite is the hardest part to get right. Everything downstream — the improvement loop, the feature pipeline, the quality gates — is only as good as the tests that anchor it. If your tests are flaky, your loop will chase noise. If your tests don’t cover the real behavior, your loop will optimize for the wrong thing.

Start with scenarios that mimic human behavior and assert on deterministic outcomes. Build from there.

The numbers

Over the past 46 days: 765 commits, 89 merged pull requests, roughly 7,000 lines of code touched per day. The net delta is negative — about 26,000 lines removed overall. The system is genuinely subtractive.

Most people assume this kind of velocity costs a fortune in tokens. It doesn’t. The pipeline splits into two buckets: Claude Code for the development agents (research, implementation, review, improvement cycles) and Ollama-hosted models as the inference provider inside Chalie itself. The total cost is under $250 per month in subscriptions, burning approximately 13 billion tokens monthly across both buckets. That’s less than the salary of a single junior developer.

The result

On a typical day, multiple improvement cycles run against the current branch. Each targets a specific weakness, implements a focused fix, and either proves its value through the test harness or gets discarded with a clear record of why it failed. Features ship with test scenarios written before the first line of implementation code, reviewed by a critic agent alongside nightly results, and gated by static analysis.

The improvement cycles trend toward subtraction — dead code gets removed, instructions get tightened, special cases get eliminated. New features add lines when they need to, but the default posture is to make the system faster and more reliable by doing less, not more. Capabilities grow while internal complexity shrinks.

The throughput is real. What would take a team of six developers three months of manual work — researching, implementing, testing, reviewing, iterating — collapses into three or four hours of writing specification. The rest is fully autonomous. And the result is more stable than what I could build by hand, because every change is regression-gated before it ships and the system never skips a test out of impatience.

That’s not a pitch — it’s Tuesday.