May 11, 2026

Chalie v0.6.2: ONNX Voice Rebuild and Installer Fixes

Chalie v0

Chalie v0.6.2 ships a rebuilt voice subsystem powered by ONNX models, along with installer improvements for several Linux distributions, speech quality fixes, and a token-optimization change to find_tools.

The installer now includes branches for Alpine, Arch, and openSUSE to properly install libsndfile and ffmpeg system dependencies that were previously skipped. It also verifies the four expected voice model files after download, warning if any are missing instead of silently reporting success.

The voice stack replaces the moonshine-voice combo with single-purpose ONNX libraries: kokoro-onnx for TTS and moonshine-onnx for STT. espeak-ng is bundled as a wheel, removing the need for a system package. Four model files are baked into the image so the first request avoids a network fetch. TTS returns a single WAV blob—no streaming, no per-sentence chunking. STT chunks audio longer than 60 seconds into 60-second windows for up to 10 minutes of transcription. Error responses distinguish a transient loading state (503, Retry-After 3 s) from terminal model-missing errors, and the frontend auto-retries the former while surfacing hints for the latter.

Speech quality fixes: list items spoken by espeak now pause correctly by inserting a period before when the item does not already end with punctuation. Transcription double-wrapping caused by an extra batch dimension from load_audio is fixed by stripping the leading axis. URLs are spoken as their hostname with dots, e.g., “google dot com”, rather than being silently dropped.

A follow-up fix removes the done: true property from TTS error sentinels so the frontend does not incorrectly terminate, and drops the dead sample_rate parameter. Documentation was updated to reflect the TTS wire format changes. The find_tools SUMMARY was shortened from 191 to 78 characters to reduce per-turn LLM token cost.

Installer now supports Alpine, Arch, Manjaro, and openSUSE, and verifies voice model files post-download.
Voice rebuilt on kokoro-onnx + moonshine-onnx with models bundled in the image, eliminating first-request network delay.
TTS simplified to single WAV blob; STT supports up to 10 minutes via chunking and exposes transient vs. terminal 503 errors with auto-retry.
List items are spoken with pauses via automatic period insertion before

Transcription fix: stripped redundant batch dimension from load_audio to prevent 500 errors.
find_tools SUMMARY trimmed from 191 to 78 characters, lowering per-turn decision cost.