MERaLiON-2-10B-TurboQuant-MLX-8bit

MLX 8-bit TurboQuant quantization of aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B for Apple Silicon inference.

TurboQuant applies mixed-precision quantization that preserves critical attention layers at higher precision while aggressively quantizing less sensitive feed-forward layers, optimizing for speed without sacrificing quality.

Model Specifications

Property	Value
Base Model	aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B
Parameters	~10B
Architecture	Whisper encoder + Gemma-2-9B-IT decoder
Quantization	TurboQuant 8-bit (MLX)
Disk Size	~10 GB
Peak RAM	~11 GB
License	Apache 2.0
Task	Automatic Speech Recognition / Speech-to-Text

Quickstart

Installation

pip install mlx-lm mlx-whisper

Inference

from mlx_lm import load, generate
from mlx_lm.cache import TurboQuantCache

model, tokenizer = load("majentik/MERaLiON-2-10B-TurboQuant-MLX-8bit")

# Create TurboQuant-optimized KV cache
cache = TurboQuantCache(model)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Transcribe the following audio."}],
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=512,
    cache=cache,
)
print(response)

Quantization Details

TurboQuant is a mixed-precision quantization strategy that:

Retains attention projection layers at higher precision (8-bit)
Quantizes MLP/feed-forward layers more aggressively where precision loss is tolerable
Optimizes KV-cache memory layout for faster autoregressive decoding on Apple Silicon

This 8-bit variant offers the best quality preservation among the quantized variants while still providing significant memory savings over the full-precision model.

Supported Languages

MERaLiON-2 supports speech recognition in Southeast Asian languages including English, Mandarin Chinese, Malay, Tamil, and Indonesian.

Memory Estimates

Device	Feasibility
MacBook Air M1 (8 GB)	Not recommended
MacBook Pro M1/M2 (16 GB)	Feasible with limited headroom
MacBook Pro M2/M3 (32 GB)	Comfortable
Mac Studio M2 Ultra (64 GB+)	Recommended for production

Quant trade-off (MLX lane)

Bits	Approx size	Use case	Recommendation
2-bit	~2.6 GB	Aggressive quantization	Very low-RAM Macs
3-bit	~3.6 GB	Lossy but small	Low-RAM Macs
4-bit	~4.2 GB	Balanced default	Recommended for most Macs
5-bit	~5.0 GB	Higher fidelity	Quality-sensitive
6-bit	~6.0 GB	Approaching FP16 quality	High-fidelity
8-bit	~7.6 GB	Near-lossless reference	Fidelity-critical work

(Current variant — 8bit — is bolded.)

Variants in this family

(Showing 8 sibling variants under majentik/meralion2-10b-*. The current variant — TurboQuant-MLX-8bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-MLX-2bit	mlx-lm	~3.2 GB	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	~6.2 GB	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	~12 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-MLX-2bit	mlx-lm	~3.2 GB	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	~6.2 GB	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	~12 GB	Apple Silicon reference

Downloads last month: 37

MLX

Hardware compatibility

8-bit

majentik
/

MERaLiON-2-10B-TurboQuant-MLX-8bit