MERaLiON-2-10B-TurboQuant-MLX-4bit

MLX 4-bit TurboQuant quantization of aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B for Apple Silicon inference.

TurboQuant applies mixed-precision quantization that preserves critical attention layers at higher precision while aggressively quantizing less sensitive feed-forward layers, optimizing for speed without sacrificing quality.

Model Specifications

Property	Value
Base Model	aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B
Parameters	~10B
Architecture	Whisper encoder + Gemma-2-9B-IT decoder
Quantization	TurboQuant 4-bit (MLX)
Disk Size	~5 GB
Peak RAM	~6 GB
License	Apache 2.0
Task	Automatic Speech Recognition / Speech-to-Text

Quickstart

Installation

pip install mlx-lm mlx-whisper

Inference

from mlx_lm import load, generate
from mlx_lm.cache import TurboQuantCache

model, tokenizer = load("majentik/MERaLiON-2-10B-TurboQuant-MLX-4bit")

# Create TurboQuant-optimized KV cache
cache = TurboQuantCache(model)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Transcribe the following audio."}],
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=512,
    cache=cache,
)
print(response)

Quantization Details

TurboQuant is a mixed-precision quantization strategy that:

Retains attention projection layers at higher precision
Quantizes MLP/feed-forward layers more aggressively where precision loss is tolerable
Optimizes KV-cache memory layout for faster autoregressive decoding on Apple Silicon

This 4-bit variant offers a strong balance between model quality and memory efficiency, halving the memory footprint compared to the 8-bit variant with minimal quality degradation.

Supported Languages

MERaLiON-2 supports speech recognition in Southeast Asian languages including English, Mandarin Chinese, Malay, Tamil, and Indonesian.

Memory Estimates

Device	Feasibility
MacBook Air M1 (8 GB)	Feasible with limited headroom
MacBook Pro M1/M2 (16 GB)	Comfortable
MacBook Pro M2/M3 (32 GB)	Recommended
Mac Studio M2 Ultra (64 GB+)	Ideal for production

Quant trade-off (MLX lane)

Bits	Approx size	Use case	Recommendation
2-bit	~2.6 GB	Aggressive quantization	Very low-RAM Macs
3-bit	~3.6 GB	Lossy but small	Low-RAM Macs
4-bit	~4.2 GB	Balanced default	Recommended for most Macs
5-bit	~5.0 GB	Higher fidelity	Quality-sensitive
6-bit	~6.0 GB	Approaching FP16 quality	High-fidelity
8-bit	~7.6 GB	Near-lossless reference	Fidelity-critical work

(Current variant — 4bit — is bolded.)

Variants in this family

(Showing 8 sibling variants under majentik/meralion2-10b-*. The current variant — TurboQuant-MLX-4bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-MLX-2bit	mlx-lm	~3.2 GB	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	~6.2 GB	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	~12 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-MLX-2bit	mlx-lm	~3.2 GB	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	~6.2 GB	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	~12 GB	Apple Silicon reference

Downloads last month: 29

MLX

Hardware compatibility

Quantized

majentik
/

MERaLiON-2-10B-TurboQuant-MLX-4bit