model support per CrispASR — pure C++ inference with GGUF quantisation
#40
by cstr - opened
We've built a complete C++ runtime for Voxtral-Mini-4B-Realtime in CrispASR, using ggml for inference. One binary, one GGUF file — no Python, no PyTorch.
What works:
- Decent Performance (e.g. 3.8x faster than voxtral.c on Intel Xeon 4-core, no GPU)
- Full transcription (causal RoPE encoder + 3.4B LLM with streaming audio injection)
- GGUF quantisation — Q4_K shrinks the model from 8.3 GB to ~2.5 GB
- Temperature sampling + best-of-N
- Streaming from mic/stdin (
--stream,--mic) - GPU acceleration via CUDA / Metal / Vulkan
- Word timestamps via forced alignment (
-am qwen3-forced-aligner.gguf) - Speaker diarisation, language ID, all output formats
Quick start:
git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend voxtral4b -m auto -f audio.wav
Pre-quantised GGUFs: cstr/voxtral-mini-4b-realtime-GGUF
CrispASR supports 11 ASR backends total — pick the right one for your use case with a single --backend flag.
cstr changed discussion title from CrispASR — pure C++ inference with GGUF quantisation, 3.8x faster than voxtral.c on CPU to CrispASR — pure C++ inference with GGUF quantisation
cstr changed discussion title from CrispASR — pure C++ inference with GGUF quantisation to model support per CrispASR — pure C++ inference with GGUF quantisation
git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend voxtral4b -m auto -f audio.wav
thanks