MERaLiON-2-10B-TurboQuant-MLX-4bit

MLX 4-bit TurboQuant quantization of aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B for Apple Silicon inference.

TurboQuant applies mixed-precision quantization that preserves critical attention layers at higher precision while aggressively quantizing less sensitive feed-forward layers, optimizing for speed without sacrificing quality.

Model Specifications

Property Value
Base Model aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B
Parameters ~10B
Architecture Whisper encoder + Gemma-2-9B-IT decoder
Quantization TurboQuant 4-bit (MLX)
Disk Size ~5 GB
Peak RAM ~6 GB
License Apache 2.0
Task Automatic Speech Recognition / Speech-to-Text

Quickstart

Installation

pip install mlx-lm mlx-whisper

Inference

from mlx_lm import load, generate
from mlx_lm.cache import TurboQuantCache

model, tokenizer = load("majentik/MERaLiON-2-10B-TurboQuant-MLX-4bit")

# Create TurboQuant-optimized KV cache
cache = TurboQuantCache(model)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Transcribe the following audio."}],
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=512,
    cache=cache,
)
print(response)

Quantization Details

TurboQuant is a mixed-precision quantization strategy that:

  • Retains attention projection layers at higher precision
  • Quantizes MLP/feed-forward layers more aggressively where precision loss is tolerable
  • Optimizes KV-cache memory layout for faster autoregressive decoding on Apple Silicon

This 4-bit variant offers a strong balance between model quality and memory efficiency, halving the memory footprint compared to the 8-bit variant with minimal quality degradation.

Supported Languages

MERaLiON-2 supports speech recognition in Southeast Asian languages including English, Mandarin Chinese, Malay, Tamil, and Indonesian.

Memory Estimates

Device Feasibility
MacBook Air M1 (8 GB) Feasible with limited headroom
MacBook Pro M1/M2 (16 GB) Comfortable
MacBook Pro M2/M3 (32 GB) Recommended
Mac Studio M2 Ultra (64 GB+) Ideal for production

See Also

Quant trade-off (MLX lane)

Bits Approx size Use case Recommendation
2-bit ~2.6 GB Aggressive quantization Very low-RAM Macs
3-bit ~3.6 GB Lossy but small Low-RAM Macs
4-bit ~4.2 GB Balanced default Recommended for most Macs
5-bit ~5.0 GB Higher fidelity Quality-sensitive
6-bit ~6.0 GB Approaching FP16 quality High-fidelity
8-bit ~7.6 GB Near-lossless reference Fidelity-critical work

(Current variant โ€” 4bit โ€” is bolded.)

Variants in this family

(Showing 8 sibling variants under majentik/meralion2-10b-*. The current variant โ€” TurboQuant-MLX-4bit โ€” is bolded.)

Variant Runtime Approx size Use case
RotorQuant runtime modifier n/a KV-cache root (weight-agnostic)
RotorQuant-MLX-2bit mlx-lm ~3.2 GB Apple Silicon, smallest
RotorQuant-MLX-4bit mlx-lm ~6.2 GB Apple Silicon balanced
RotorQuant-MLX-8bit mlx-lm ~12 GB Apple Silicon reference
TurboQuant runtime modifier n/a KV-cache root (weight-agnostic)
TurboQuant-MLX-2bit mlx-lm ~3.2 GB Apple Silicon, smallest
TurboQuant-MLX-4bit mlx-lm ~6.2 GB Apple Silicon balanced
TurboQuant-MLX-8bit mlx-lm ~12 GB Apple Silicon reference
Downloads last month
29
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support