Building an AI That Improves Itself: The Meta-Harness

What if your test suite could fix the bugs it finds?

That’s the question we started asking about two weeks ago. Chalie has a comprehensive nightly test suite — 56 scenario tests and 60 benchmark tasks across 9 dimensions — but every failing test still required a human to read the evidence, understand why it failed, write a fix, and verify the fix didn’t break something else. We were spending more time triaging test results than building features.

So we built a system that does it autonomously. We call it the meta-harness.

What it does

The meta-harness is a continuous improvement loop. Each iteration:

Runs a full baseline of all benchmarks and scenario tests against the current codebase
Asks Claude Opus to analyze the results, read the codebase, and propose improvement opportunities
Dispatches a local coding agent to implement each opportunity on its own git branch
Evaluates the changes against the targeted tests and benchmarks
Keeps what works, deletes what doesn’t, and creates pull requests for the survivors

No human writes code during this process. The system proposes, implements, evaluates, and ships — or rolls back if things get worse.

Architecture: Three agents, one loop

The meta-harness uses three distinct AI agents, each with a different role:

Claude Opus is the research brain. It maintains a persistent conversation that survives Python restarts (session IDs are saved to disk and resumed via claude --resume). Opus reads test results, explores the Chalie codebase, checks its long-term MCP memory for past attempts, and produces a list of improvement opportunities. Each opportunity includes the exact files and functions to change, what’s failing, and why the proposed fix should help. Opus is explicitly forbidden from proposing certain anti-patterns we’ve learned are dead ends — like pre-injecting tool schemas instead of improving the discovery mechanism.

A local coding agent (Gemma 4 26B, running on our own hardware via Ollama) does the implementation work. It gets a precise task description, a directory map of the codebase, and two passes: an implementation pass and a self-reflection pass where it reviews its own changes against codebase health metrics. If it makes no code changes across both passes — which happens more often than we’d like — the opportunity is immediately marked as failed and skipped.

An Ollama judge (Gemma 4 31B) evaluates every test and benchmark. For scenario tests, it scores each step on four dimensions: correctness, completeness, latency, and efficiency. For benchmarks, it produces a single dimensional score focused on harness quality — how well the system’s prompts, routing, and context assembly amplify the base LLM. The judge was originally Claude itself, invoked via CLI subprocess. We switched to Ollama for speed and cost — roughly 10x faster per verdict.

The CHII scoring system

We evaluate Chalie’s harness across 9 weighted dimensions, producing a composite score called CHII (Chalie Harness Intelligence Index):

Dimension	What it measures	Weight
D1 Context Recall	Can it retrieve the right knowledge when asked?	18%
D2 Skill Routing	Does it pick the right tool for the job?	13%
D3 Cross-Source	Can it synthesize across memory, docs, and tools?	13%
D9 Proactive Intel	Does it surface relevant info before being asked?	11%
D4 Correction	Do corrections propagate through the knowledge graph?	9%
D5 Temporal	Can it reason about time, recency, and sequences?	9%
D6 Methodology	Does it learn and reuse effective approaches?	9%
D7 Identity	Does it maintain consistent personality under pressure?	9%
D8 Efficiency	Does it get there without wasting tokens or tool calls?	9%

Each benchmark task includes specific signals for the judge to look for — “Response contains Elena”, “No hallucinated siblings”, “Does not say I don’t know”. This makes evaluation concrete and reproducible, even with an LLM judge.

The regression guard

Every opportunity runs through a strict pipeline before it can become a pull request:

Baseline scores are captured from the database at the moment opportunities are created — before any implementation begins. This prevents contamination: if opportunity A changes scores before opportunity B reads its baseline, B would be comparing against a shifted target.
The coding agent works in a git worktree, completely isolated from the main branch. Changes are committed and pushed to a feature branch.
Only the benchmark tasks and scenario tests that the opportunity explicitly targets are re-run. No need to run all 116 tests for a change that only affects memory recall.
If any targeted score drops more than 0.01 below baseline, the remote branch is deleted and the opportunity is marked failed. No exceptions.
Passing branches go to Opus for review. Opus decides whether to create a PR or delete the branch, and saves memories about what worked and what didn’t for future iterations.

What we learned the hard way

Prompt changes are the highest-leverage intervention. Our first five iterations tried code-level fixes: adjusting confidence thresholds, tweaking FTS indexing, modifying trait classification logic. All either produced zero code changes or neutral-to-negative results. Iteration seven added five lines of behavioral guidance to the system prompt. Result: 7 of 9 targeted scenarios improved, with three jumping from WARN to PASS. The system prompt is the actual control surface because it determines whether the model invokes capabilities at all.

The local coder is the weakest link. Gemma 4 26B, despite being a strong model, struggles to navigate a large unfamiliar codebase and make targeted multi-file changes. In early runs, it would “review” the code, agree with the proposed changes, and then… do nothing. We added a no-diff safety check: if both passes produce zero code changes, skip immediately. We also rewrote the prompt from exploratory (“You are improving…”) to directive (“Implement the smallest, highest-impact change”). This helped, but the fundamental issue remains — for complex changes, a local model isn’t enough.

Infrastructure bugs dominate the signal. Our worst debugging session wasn’t a code bug — it was a DNS resolution failure. The meta-harness was attaching to the test container using a Docker-internal hostname (chalie-test:8081) that the host machine couldn’t resolve. Every scenario test scored 0.0, and the Ollama judge faithfully reported “hostname resolution error” as a pipeline failure. The scores looked like Chalie had catastrophically regressed, but nothing had actually changed about Chalie — the test runner just couldn’t reach it.

Small improvements are invisible. LLM-based tests have roughly plus or minus 5% noise per run from non-determinism. Our regression guard threshold is 1%. This means improvements smaller than the noise floor get randomly accepted or rejected. Changes need to be architecturally significant — affecting whether a capability is invoked at all, not just the quality of what’s invoked.

Millions of tokens, three lines of code. Our first 16-hour run consumed millions of tokens across Opus analysis, local agent passes, and judge evaluations. The net output: three lines of code changes. The cost-per-line was absurd. But those three lines were in the system prompt, and they moved real scores. The meta-harness isn’t a code factory — it’s a search process. Most of the work is exploring the space of possible improvements and ruling out dead ends.

Results so far

After a week of iteration:

24 opportunities proposed and evaluated
11 passed the regression guard (46%)
13 failed — some from score regression, some from zero code changes, some from infrastructure errors
CHII score improved from 0.085 to 0.350 (the early near-zero was partly from a broken judge pipeline, but real gains followed)
2 PRs created autonomously in the latest cycle
The single most impactful PR across the entire project was Opus-driven: five behavioral principles added to the system prompt

Where it’s going

The meta-harness is still early. Comparing to the research paper that inspired it (Lee et al., Stanford/MIT/KRAFTON, 2026), we’re operating at a fraction of the scale — 3-5 ideas per iteration versus the paper’s 60 candidates, a local model versus the paper’s cloud-tier agent, and no execution trace feedback to the proposer. These are known gaps we’re actively closing.

The bigger insight is that autonomous improvement of AI harnesses is viable even at small scale, even with local models, even with noisy evaluation. The loop works. It just needs to run more efficiently — better targeting, fewer wasted passes, and a coder agent that can actually execute on what the research agent discovers.