June 23, 2026 · Dylan Grech

Ten Models, One Assistant: Which Engine Drives Chalie Best?

We put ten language models — frontier and open-weight — through the same battery of everyday tasks inside Chalie, then scored each on accuracy, speed, cost and how cleanly it drove. The spread wasn't what we expected.

benchmark models ai chalie

When people ask how Chalie works, the honest answer has two halves. One half is Chalie itself — the memory, the tools, the way it plans a job and checks its own work. The other half is the language model underneath, the engine doing the raw thinking. You can swap that engine. So the obvious question is: which engine drives Chalie best?

We finally sat down and measured it. Ten models — a mix of frontier cloud systems and open-weight models you could run yourself — each took the wheel of the exact same Chalie build and worked through the exact same set of everyday tasks. Then we scored every one of them on the same four things. This post is the spread.

One caveat up front. These numbers come from a single run against a pre-release build of Chalie v1.0.0-beta. They measure one specific skill — how well a model drives Chalie — not how “smart” a model is in general. An official, regularly updated benchmark page is coming to the website in the coming weeks, with more models and repeated runs. Treat this as an early look, not a verdict.

How we scored it

Each model ran through a fixed battery of about a dozen real-world tasks — the kinds of things people actually ask an assistant to do. Remembering a detail you mentioned in passing and using it later. Reading your calendar and rescheduling around a clash. Pulling an answer out of a document you uploaded. Building and tidying a list. Looking something up on the live web and actually seeing the page. Triaging an inbox. Saving a multi-step routine so you never have to explain it twice. Crunching numbers in a file. Finding the right person in a crowded address book.

Every model saw the same tasks, the same build, the same conditions. We then combined four scores into a single number out of ten:

  • Accuracy (60%) — did it actually get the task done, correctly and completely? This is the heart of it.
  • Speed (20%) — how long the whole battery took, end to end.
  • Token economy (10%) — how much it “spent” thinking and talking to get there.
  • Cleanliness (10%) — did it drive smoothly, or thrash around with wasted steps, loops, and second-guessing?

The headline

GLM-5.29.02GLM-5.18.83Gemini 3.5 Flash8.07Gemma-4 31B7.35MiniMax-M37.25DeepSeek-V4 Flash7.03GPT-OSS 120B6.03Nemotron-3 Nano3.26Nemotron-Ultra 550B2.86Qwen3-Next 80B2.00
Overall weighted score out of 10. Gemma-4 31B (amber) is the one model here small enough to run on a single high-end workstation.

The top of the board is closer than we expected, and it isn’t a clean “biggest model wins” story.

GLM-5.2 took first place at 9.02. It paired near-top accuracy with one of the fastest, leanest runs of the whole field — the strongest open-weight showing by a wide margin. GLM-5.1 came in right behind it. Gemini 3.5 Flash was the single most accurate driver — it completed every task in the battery — but it was slower and far more token-hungry than the GLMs, which is what pulled its weighted total to third.

If you only care about raw correctness, Gemini is the reference to beat. If you care about correctness per second and per token, the GLM pair quietly wins.

The spread

Averages hide the interesting part. Here’s every model plotted by accuracy against how fast it finished — the upper-right corner is the dream (accurate and quick).

0%25%50%75%100%30m40m50mAccuracy — how much of each task it handled well →↑ Faster (whole suite, minutes)GLM-5.2GLM-5.1Gemini 3.5 FlashGemma-4 31BMiniMax-M3DeepSeek-V4 FlashGPT-OSS 120BNemotron-3 NanoNemotron-Ultra 550BQwen3-Next 80B
Each dot is one model: further right is more accurate, higher is faster. The front-runners cluster tightly in the upper right; a long tail trails off to the left.

Two things jump out. First, there’s a clear front pack — five or six models that genuinely drive Chalie well — and then a cliff. Second, size is not the story. The single largest model in the test, a 550-billion-parameter system, sits down in the bottom-left: it marched through every task without actually satisfying many of them. An 80-billion-parameter model fared worse still. Meanwhile a 31-billion-parameter open model lands comfortably in the front pack. Raw parameter count told us almost nothing about who could operate a real assistant.

What “accuracy” looks like

Because accuracy carries the most weight, it’s worth seeing on its own:

GLM-5.291%GLM-5.186%Gemini 3.5 Flash94%Gemma-4 31B81%MiniMax-M382%DeepSeek-V4 Flash79%GPT-OSS 120B64%Nemotron-3 Nano19%Nemotron-Ultra 550B16%Qwen3-Next 80B14%
Share of each task handled correctly, averaged across the battery. The drop from the front pack to the tail is steep.

The front pack lands between roughly 79% and 94%. Below that, the floor falls away fast: the weaker models don’t just score a bit lower, they frequently fail to complete a task at all — opening a tool and never landing the result, or talking confidently without doing the work. Agentic driving turns out to be a sharp dividing line. A model can be a perfectly good chatbot and still be unable to reliably operate an assistant that has real tools and real consequences.

The open-weight surprise

The result we keep coming back to is Gemma-4 31B. It scored 81% accuracy and landed fourth overall — inside the front pack, within striking distance of frontier cloud models that are an order of magnitude larger and run only in someone else’s data centre. A model this size fits on a single high-end desktop GPU. You could, in principle, run your assistant’s brain entirely on hardware sitting under your own desk.

We read that as much as a statement about Chalie as about Gemma. When the scaffolding around a model is doing its job — planning the work, keeping tools tidy, catching mistakes — a mid-sized open model doesn’t need to be a giant to behave like a capable assistant. The engine matters, but a good chassis lets a smaller engine punch well above its weight.

One footnote for the technically inclined: how a model is served matters as much as which model you pick. The same open weights can behave noticeably differently depending on how aggressively they’re compressed and who’s hosting them. A model that underwhelms through one provider can shine through another running it at full fidelity. If you’re choosing a local or hosted open model, the serving setup deserves as much attention as the name on the box.

What this is, and isn’t

This is a single snapshot, on a pre-release build, measuring one narrow-but-demanding skill: keeping a real assistant on the rails through real tasks. It is not a general ranking of these models, and a low score here is about driving Chalie specifically — not a judgement on a model’s wider abilities. Run the battery again next week and the tail, in particular, will wobble.

That’s exactly why we’re turning this into something living. An official benchmark page is coming to the website soon. It’ll carry the full scoreboard, refresh with every major Chalie release, and grow to include more models over time — so you can watch both the field and Chalie itself improve, out in the open.

For now, the short version: the best engine for Chalie today is a tight race between an open-weight model and a frontier one — and you don’t need the biggest model in the room to get there.