MERaLiON-2-10B-RotorQuant-MLX-2bit

MLX 2-bit RotorQuant quantization of aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B for Apple Silicon inference.

RotorQuant applies rotation-based quantization that decorrelates weight matrices before quantization, distributing outlier magnitudes more evenly across channels for improved accuracy at low bit-widths.

Model Specifications

Property Value
Base Model aisingapore/MERaLiON-AudioLLM-Whisper-SEA-LION-V3-10B
Parameters ~10B
Architecture Whisper encoder + Gemma-2-9B-IT decoder
Quantization RotorQuant 2-bit (MLX)
Disk Size ~3 GB
Peak RAM ~4 GB
License Apache 2.0
Task Automatic Speech Recognition / Speech-to-Text

Quickstart

Installation

pip install mlx-lm mlx-whisper

Inference

from mlx_lm import load, generate
from mlx_lm.cache import IsoQuantCache

model, tokenizer = load("majentik/MERaLiON-2-10B-RotorQuant-MLX-2bit")

# Create IsoQuantCache for RotorQuant models
cache = IsoQuantCache(model)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Transcribe the following audio."}],
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=512,
    cache=cache,
)
print(response)

Quantization Details

RotorQuant is a rotation-based quantization strategy that:

  • Applies learned rotation matrices to decorrelate weight channels before quantization
  • Reduces the impact of outlier weights that typically degrade quantized model quality
  • Provides more uniform weight distributions, leading to better accuracy retention
  • Pairs with IsoQuantCache for consistent KV-cache quantization during inference

This 2-bit variant offers the smallest memory footprint. RotorQuant's rotation-based approach is especially valuable at 2-bit, where outlier sensitivity causes the most quality degradation in naive quantization schemes.

Supported Languages

MERaLiON-2 supports speech recognition in Southeast Asian languages including English, Mandarin Chinese, Malay, Tamil, and Indonesian.

Memory Estimates

Device Feasibility
MacBook Air M1 (8 GB) Feasible
MacBook Pro M1/M2 (16 GB) Comfortable
MacBook Pro M2/M3 (32 GB) Ideal
Mac Studio M2 Ultra (64 GB+) Overkill for this variant

See Also

Quant trade-off (MLX lane)

Bits Approx size Use case Recommendation
2-bit ~2.6 GB Aggressive quantization Very low-RAM Macs
3-bit ~3.6 GB Lossy but small Low-RAM Macs
4-bit ~4.2 GB Balanced default Recommended for most Macs
5-bit ~5.0 GB Higher fidelity Quality-sensitive
6-bit ~6.0 GB Approaching FP16 quality High-fidelity
8-bit ~7.6 GB Near-lossless reference Fidelity-critical work

(Current variant โ€” 2bit โ€” is bolded.)

Variants in this family

(Showing 8 sibling variants under majentik/meralion2-10b-*. The current variant โ€” RotorQuant-MLX-2bit โ€” is bolded.)

Variant Runtime Approx size Use case
RotorQuant runtime modifier n/a KV-cache root (weight-agnostic)
RotorQuant-MLX-2bit mlx-lm ~3.2 GB Apple Silicon, smallest
RotorQuant-MLX-4bit mlx-lm ~6.2 GB Apple Silicon balanced
RotorQuant-MLX-8bit mlx-lm ~12 GB Apple Silicon reference
TurboQuant runtime modifier n/a KV-cache root (weight-agnostic)
TurboQuant-MLX-2bit mlx-lm ~3.2 GB Apple Silicon, smallest
TurboQuant-MLX-4bit mlx-lm ~6.2 GB Apple Silicon balanced
TurboQuant-MLX-8bit mlx-lm ~12 GB Apple Silicon reference
Downloads last month
26
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support