This document defines the cognitive architecture for mode routing and response generation. User input flows through classification, deterministic mode routing (~5ms), and mode-specific LLM generation.
Mode selection is decoupled from response generation. A mathematical router selects the engagement mode using observable signals, then a mode-specific prompt drives the LLM to generate the response. A small LLM tie-breaker handles ambiguous cases.
Most systems route through an LLM — asking it “what should I do?” before asking it “what should I say?” This doubles latency and introduces unpredictability. Chalie separates the two: a fast mathematical router selects the engagement mode from observable conversation signals in ~5ms. The LLM only enters the loop for response generation, shaped by the mode the router already decided. The result is predictable, auditable, and fast — and routing decisions are logged to a PostgreSQL audit trail for inspection and improvement.
Routing (deterministic): Which engagement mode to use — decided by a mathematical scoring function over observable signals (~5ms).
Generation (creative): What to say in that mode — decided by the LLM using a mode-specific prompt (~2-15s depending on mode).
This separation eliminates:
Multiple monitors observe routing quality but none modify weights directly. They log pressure signals. A single RoutingStabilityRegulator (24h cycle) is the only entity that mutates router weights, with bounded corrections (max ±0.02/day) and 48h cooldown per parameter.
The router naturally shifts behavior as memory accumulates:
frontal-cortex-act.md (no soul.md — pure action planning)frontal-cortex-respond.md + soul.mdfrontal-cortex-clarify.md + soul.mdfrontal-cortex-acknowledge.md (no soul.md — lightweight)The ACT loop uses 8 innate cognitive skills. All are non-LLM operations (fast, sub-cortical).
| Skill | Category | Speed | Purpose |
|---|---|---|---|
recall |
memory | <500ms | Unified retrieval across ALL memory layers (working memory, gists, facts, episodes, concepts, user_traits) |
memorize |
memory | <50ms | Store gists (short-term) and/or facts (medium-term) |
introspect |
perception | <100ms | Self-examination: context_warmth, FOK signal, recall_failure_rate, skill stats, world state, decision explanations (routing audit), recent autonomous actions |
associate |
cognition | <500ms | Spreading activation from seed concepts through semantic graph |
schedule |
scheduling | <100ms | Create/list/cancel reminders and tasks stored in Chalie’s own memory |
autobiography |
narrative | <500ms | Retrieve synthesized user narrative covering identity, relationship arc, values, patterns, active threads |
list |
lists | <50ms | Create and manage deterministic lists (shopping, to-do, chores); add/remove/check items, view, history |
focus |
attention | <50ms | Focus session management: set, check, clear. Distraction detection |
| Old Name | Maps To |
|---|---|
memory_query |
recall |
memory_write |
memorize |
world_state_read |
introspect |
internal_reasoning |
recall |
semantic_query |
recall |
User Input → Topic Classifier (embedding-based)
→ Generate embedding (L2-normalised, 768-dim)
→ AdaptiveBoundaryDetector.update(embedding, best_similarity)
├─ Cold start (< 5 msgs): static 0.55 threshold
└─ Active: NEWMA + Transient Surprise → Leaky Accumulator
→ is_boundary? → create new topic : match existing
→ {topic, confidence, switch_score, is_new_topic, boundary_diagnostics}
Boundary diagnostics logged per classification: acc= (accumulator), bound= (dynamic threshold), newma= (drift signal), surprise= (similarity-drop signal).
Classification Result → Load Context:
- Gists, facts, working memory, world state
- Episodes + concepts (vector similarity)
- Calculate context_warmth (0.0-1.0)
Routing Signals → ModeRouterService.route()
→ Score all modes → Select highest
→ If ambiguous: LLM tie-breaker (qwen3:4b, ~2s)
→ {selected_mode, confidence, scores, tiebreaker_used}
If IGNORE → return empty (no LLM call)
If ACT → generate_with_act_loop() → re-route → generate_for_mode()
Otherwise → generate_for_mode(selected_mode)
→ Mode-specific prompt + context → LLM → response
The router collects signals from existing services (all Redis reads, ~5ms total) plus NLP regex patterns (<1ms):
Context Signals (from Redis):
context_warmth (float 0-1)working_memory_turns (int 0-4)gist_count (int, excluding cold_start type)fact_count (int 0-50), fact_keys (list)world_state_present (bool)topic_confidence, is_new_topic (from classifier)session_exchange_count (int)NLP Signals (from raw text, regex):
prompt_token_count, has_question_mark, interrogative_wordsgreeting_pattern (hey/hi/hello/yo/sup/etc.)explicit_feedback (‘positive’/‘negative’/None)information_density (unique tokens / total tokens)implicit_reference (“you remember”, “we discussed”, “last time”)Each mode gets a weighted composite score:
| Mode | Base | Primary Boosters | Primary Penalties |
|---|---|---|---|
| RESPOND | 0.50 | context_warmth, fact_density, gist_density, question+context | cold start |
| CLARIFY | 0.30 | cold context, question+no_facts, new_topic+question | warm context (>0.6) |
| ACT | 0.20 | question+moderate_context, interrogative+gap_in_facts, implicit_reference | very cold, very warm+facts |
| ACKNOWLEDGE | 0.10 | greeting_pattern (+0.60), positive_feedback (+0.40) | has_question (-0.30) |
| IGNORE | -0.50 | empty_input only (+1.0) | everything else |
Per-request ephemeral adjustments (NOT weight mutations):
previous_mode == 'ACT' and ACT was unproductive → act_score -= 0.15previous_mode == 'CLARIFY' → respond_score += 0.05 (user just answered a question)Tracks router_confidence for last 3 exchanges on same topic. If all 3 were below 0.15 (low confidence streak), widens tie-breaker margin by +0.05 for that topic. Resets when confidence recovers.
When top 2 modes are within effective margin, invokes small LLM (qwen3:4b, ~2s):
effective_margin = base(0.20) - (base - min(0.08)) × warmth + semantic_uncertainty
Semantic uncertainty widens margin for:
implicit_reference (+0.05)information_density (+0.03)interrogative_words without question mark (+0.03)The tie-breaker prompt presents only the top 2 candidates with context. Falls back to higher-scoring mode on failure.
router_confidence = (top_score - second_score) / max(abs(top_score), 0.001)
Used for: offline tuning, detecting unstable routing regions, hysteresis trigger.
The ACT loop executes internal actions with safety limits. No decision gate or net value evaluation — the router already decided this is an ACT situation.
The ACT prompt template (frontal-cortex-act.md) is a skeleton with a `` placeholder. The CognitiveTriageService decides which of the 9 innate skills to inject into each ACT prompt — only the relevant skill docs are included, reducing prompt size significantly.
Cognitive primitives (recall, memorize, introspect) are always injected for ACT regardless of triage output. Up to 3 contextual skills are added based on triage reasoning.
Skill doc files live in backend/prompts/skills/{skill}.md — one file per skill. FrontalCortexService._get_injected_skills() loads only the selected files at call time.
Triage output (JSON field "skills": [...]) is validated through a whitelist (_VALID_SKILLS), deduplicated, primitives enforced, and contextual skills sorted and capped at MAX_CONTEXTUAL_SKILLS = 3. The result flows from CognitiveTriageService → TriageResult.skills → context_snapshot['triage_selected_skills'] → tool_worker / generate_with_act_loop → FrontalCortexService.generate_response(selected_skills=...).
Token impact: ACT static template ~300 tokens (was ~2,787). Typical ACT call injects ~300–550 tokens of skill docs (4 files) vs. ~1,200 tokens always before.
frontal-cortex-act.md via ``frontal-cortex-act.md (action planning only)generate_for_mode()def can_continue(self):
if elapsed >= cumulative_timeout: return False, 'timeout' # 60s default
if iteration_number >= max_iterations: return False, 'max_iterations' # 5 default
return True, None
timeout — cumulative timeout reached (safety limit)max_iterations — iteration cap reachedAfter generation, detect router misclassification using user behavior signals from the NEXT exchange:
| Signal | Indicates | Logged As |
|---|---|---|
| User immediately clarifies/repeats | RESPOND was wrong → should be CLARIFY | misroute (missed_clarify) |
| User asks memory-related follow-up | RESPOND was wrong → should be ACT | misroute (missed_act) |
| Negative reward after ACKNOWLEDGE | Should have been RESPOND | misroute (under_engagement) |
| Positive reward after any mode | Routing was correct | correct_route |
Feedback is stored in routing_decisions.feedback (JSONB).
Single authority for weight mutation. Follows TopicStabilityRegulatorService pattern:
routing_decisions table (last 24h)configs/generated/mode_router_config.jsonStrong LLM (qwen3:14b) reviews past routing decisions as a consultant, not authority:
reflection-queueAnti-authority safeguards:
Healthy mode distribution ranges:
| Mode | Healthy Range | Red Flag |
|---|---|---|
| RESPOND | 50-75% | >85% (overconfident) or <40% (under-committing) |
| CLARIFY | 8-20% | >30% (over-questioning) or <3% (never clarifying) |
| ACT | 5-15% | <2% (ACT death) or >25% (over-processing) |
| ACKNOWLEDGE | 3-12% | <1% (ignoring social cues) or >20% (trivializing) |
| IGNORE | <2% | >5% (dropping messages) |
Every routing decision is logged to routing_decisions table:
CREATE TABLE routing_decisions (
id UUID PRIMARY KEY,
topic TEXT NOT NULL,
exchange_id TEXT,
selected_mode TEXT NOT NULL,
router_confidence FLOAT,
scores JSONB NOT NULL, -- all mode scores
tiebreaker_used BOOLEAN,
tiebreaker_candidates JSONB,
margin FLOAT,
effective_margin FLOAT,
signal_snapshot JSONB NOT NULL, -- full signal vector
weight_snapshot JSONB,
routing_time_ms FLOAT,
feedback JSONB, -- filled post-exchange
reflection JSONB, -- filled during idle
previous_mode TEXT,
created_at TIMESTAMP
);
ACT loop iterations continue to log to cortex_iterations table for backward compatibility. Simplified fields (decision gate columns use zero-value placeholders).
[ROUTER] Mode selected: RESPOND (confidence: 0.85, 2.3ms)
[ROUTER] Tie-breaker invoked: RESPOND vs CLARIFY → RESPOND
[MODE:ACT] [ACT LOOP] Iteration 0: executing 2 actions
[MODE:RESPOND] Generating response via frontal-cortex-respond.md
The cognitive drift engine models the brain’s Default Mode Network — generating spontaneous internal thoughts during idle periods. These thoughts emerge from residual activation in the semantic memory network and are grounded by episodic experience.
All queues idle? ──no──→ skip
│yes
Recent episodes? ──no──→ skip (nothing to think about)
│yes
Fatigued? ──yes──→ skip (budget exhausted)
│no
Select seed concept (weighted random)
│
Spreading activation (depth 2)
│
Activation energy > 0.4? ──no──→ skip (weak associations)
│yes
Retrieve grounding episode
│
LLM synthesis → reflection | question | hypothesis
│
Store as drift gist (surfaces in frontal cortex context)
| Strategy | Weight | Source |
|---|---|---|
| Decaying | 40% | Concepts with fading strength (0.2 < strength < 2.0), ordered by weakest first |
| Recent | 30% | Concepts linked to the most recent episode |
| Salient | 20% | Concepts related to the highest-salience episode in the last 7 days |
| Random | 10% | Any active concept with confidence >= 0.4 |
The system currently produces reactive responses (user-prompted) and associative drift thoughts (DMN). The next step is goal-oriented thought — forming intentions and pursuing them across time without user prompting.
Prerequisites:
Shift from complete-turn encoding to per-message encoding where each message triggers its own independent memory cycle.
The Adaptive Layer (services/adaptive_layer_service.py) sits between the context assembly step and the LLM call. It translates the user’s detected communication style into concrete, behavioral response directives that are injected as `` in RESPOND, CLARIFY, and ACKNOWLEDGE prompts.
The memory_chunker_worker extracts 9 communication style dimensions per exchange and merges them into a user trait using Exponential Moving Average (EMA). Cold-start uses a faster 0.5/0.5 EMA for the first 5 observations; stable state uses 0.3/0.7.
| Dimension | Meaning |
|---|---|
| verbosity | Preference for short vs. long responses (1-10) |
| directness | Indirect suggestion vs. clear assertion (1-10) |
| formality | Casual vs. formal register (1-10) |
| abstraction_level | Concrete action vs. abstract reasoning (1-10) |
| emotional_valence | Logical vs. emotional framing (1-10) |
| certainty_level | Hedging/questioning vs. declarative/confident (1-10) |
| challenge_appetite | Seeks validation vs. seeks counterpoints (1-10) |
| depth_preference | Surface/practical vs. deep/exploratory (1-10) |
| pacing | Rapid short messages vs. slow deliberate ones (1-10) |
AdaptiveLayerService.generate_directives() uses a slot system to prevent over-biasing:
_observation_count >= 2| System | Description |
|---|---|
| Micro-preferences | Regex-extracted explicit format requests stored as micro_preference traits. Faster decay (0.015/cycle) than style dimensions. |
| Challenge calibration | challenge_tolerance trait tracks how the user reacts to pushback (positive → increase, negative → decrease). Appetite sets the ceiling; tolerance calibrates within it. |
| Energy mirroring | Per-request comparison of baseline verbosity vs. current message length. Fires when deviation is notable. |
| Interaction forks | Offered when style dimensions are in the ambiguous mid-range (4-7). Conversational choice points (“I can…”), 5-exchange cooldown. |
| Cognitive load regulation | Estimates load from working-memory turn length trends and question density. HIGH/OVERLOAD → simplify-and-structure directive takes first slot. |
| Growth pattern awareness | 30-min background service comparing current style against a slowly-updated baseline. Persistent shifts (3+ cycles) stored as growth_signal:{dim} traits and surfaced sparingly as growth reflections (24h cooldown). |
All adaptive directives carry a trailing line: “When these directives conflict with your identity voice, your voice takes priority.” Identity vectors (identity_modulation) always outrank adaptive directives.