model support per CrispASR — pure C++ inference with GGUF quantisation

#40
by cstr - opened

We've built a complete C++ runtime for Voxtral-Mini-4B-Realtime in CrispASR, using ggml for inference. One binary, one GGUF file — no Python, no PyTorch.

What works:

  • Decent Performance (e.g. 3.8x faster than voxtral.c on Intel Xeon 4-core, no GPU)
  • Full transcription (causal RoPE encoder + 3.4B LLM with streaming audio injection)
  • GGUF quantisation — Q4_K shrinks the model from 8.3 GB to ~2.5 GB
  • Temperature sampling + best-of-N
  • Streaming from mic/stdin (--stream, --mic)
  • GPU acceleration via CUDA / Metal / Vulkan
  • Word timestamps via forced alignment (-am qwen3-forced-aligner.gguf)
  • Speaker diarisation, language ID, all output formats

Quick start:

git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend voxtral4b -m auto -f audio.wav

Pre-quantised GGUFs: cstr/voxtral-mini-4b-realtime-GGUF

CrispASR supports 11 ASR backends total — pick the right one for your use case with a single --backend flag.

cstr changed discussion title from CrispASR — pure C++ inference with GGUF quantisation, 3.8x faster than voxtral.c on CPU to CrispASR — pure C++ inference with GGUF quantisation
cstr changed discussion title from CrispASR — pure C++ inference with GGUF quantisation to model support per CrispASR — pure C++ inference with GGUF quantisation

git clone https://github.com/CrispStrobe/CrispASR && cd CrispASR
cmake -S . -B build && cmake --build build -j8
./build/bin/crispasr --backend voxtral4b -m auto -f audio.wav

thanks

Sign up or log in to comment