Use speech models instantly with no complex setup. Built on MLX for high-performance streaming and real-time processing entirely on-device. 100% Offline. No data leaves your machine.
Zero configuration. Just uv run and you are streaming in seconds.
Optimized specifically for Apple Silicon hardware for maximum efficiency.
Low-latency architecture lets you see words the moment they are spoken. Uses a rolling-window buffer to prioritize speed.
Analyze intent with models like Qwen3 on the fly as the audio streams.
Swap out ASR or LLM models, fine-tune VAD sensitivity for noisy environments, select specific audio hardware, and pipe clean text output directly into other CLI tools.
The current default. Supports robust speech recognition across 52 languages and dialects. Built on the Qwen3-Omni foundation, it delivers high accuracy in complex acoustic environments.
Hugging Face →