Evolving from Tool Metrics to Methodology Learning

Beyond Tool Reliability

We’ve taken a significant step in how Chalie learns from its own execution. Previously, the CriticService focused on tracking the reliability of specific tools. While useful, it missed the forest for the trees—knowing a tool works is less valuable than knowing how to approach a specific type of goal. I’ve replaced the critic with a new PostLoopReflectionService and a methodology learning framework. Instead of scoring tool success, the system now reflects on the entire execution loop to synthesize procedural guidance for future similar goals.

Methodology Learning v2

The core of this update is the “always-create-new-record” design. To avoid the common pitfall of embedding drift—where updating a single knowledge record’s text eventually mismatches its original vector—we now treat methodology records as immutable snapshots. Each time a loop finishes, we create a new record preserving the original goal’s embedding.

To ensure knowledge actually compounds, the reflection process now uses a KNN search to find the most recent, relevant methodology cluster. We pass the existing guidance into the reflection prompt, allowing the LLM to build upon prior lessons rather than starting from scratch. We’ve tuned the retrieval to prefer the most recent records within a tight 0.05 semantic distance, ensuring the “freshest” compounded knowledge is always at the top of the stack.

Refined Heuristics and Cleanup

This shift allowed for a massive cleanup of the codebase. We’ve deleted the CriticService entirely, along with its associated tests and the InvocationTracker reward logic.

On the retrieval side, I’ve moved away from rigid absolute distance gates (L2 < 0.6) in favor of relative clustering. We also introduced a very_slow decay class (approx. 111-day half-life) for these methodology records, reflecting that procedural knowledge stays relevant much longer than transient facts. The system now focuses on quality at write-time; if a reflection doesn’t meet a minimum quality bar, it isn’t stored, which allowed us to remove the “minimum 3 observations” gate on the read side for faster cold-start learning.