PersonaPlex 7B — TurboQuant 2-bit (NF2 + Walsh-Hadamard)

Ultra-compressed version of nvidia/personaplex-7b-v1 for deployment on edge devices like NVIDIA Jetson AGX Orin.

Quantized using a TurboQuant-inspired method combining Walsh-Hadamard Transform rotation with Normal Float 2-bit (NF2) optimal scalar quantization.

Compression Results

	Original (bf16)	INT4 NF4	This: NF2+WHT
Size	15.59 GB	4.14 GB	2.07 GB
Compression	1x	3.8x	7.5x
Eff. bits/weight	16	~4.2	~2.13
Cosine similarity	1.0	0.994	0.944
Model load test	PASS	PASS	PASS (0 missing keys)

Quantization Method

TurboQuant v2 — adapted from Google Research TurboQuant (arXiv:2504.19874):

Per-group normalization (group_size=128): RMS scaling to unit variance
Walsh-Hadamard Transform: Random rotation to spread weight outliers uniformly across dimensions, making the distribution more Gaussian
NF2 quantization: Lloyd-Max optimal 4-level scalar quantizer for Gaussian distributions
- Centroids: {-1.5104, -0.4528, 0.4528, 1.5104}
- Decision boundaries: {-0.9816, 0.0, 0.9816}
Packed storage: 4 codes per byte (2 bits each) + fp16 per-group scales

What stays in fp16 (77 tensors): Layer norms, biases, embeddings, positional encodings, RoPE — these are sensitive to quantization and small enough to keep full precision.

Per-Tensor Quality

Target Hardware

Device	VRAM/Memory	Weight Load Time	Real-time?
Jetson AGX Orin 64GB	64 GB unified	~10ms (204.8 GB/s)	Yes with INT2 GEMM
Jetson AGX Orin 32GB	32 GB unified	~10ms	Yes with INT2 GEMM
RTX 3060 12GB	12 GB	~3ms (360 GB/s)	Yes
RTX 4090 24GB	24 GB	~1.3ms (1 TB/s)	Yes

PersonaPlex requires 12.5 frames/second (80ms/frame budget). At 2.07 GB, weight loading is well under budget on any modern GPU.

Quick Start — Native 2-Bit Inference

Keeps packed 2-bit weights on GPU, dequantizes per-matmul. ~10 GB peak VRAM vs 19 GB for bf16.

import torch
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download

# Download this repo's files
weight_path = hf_hub_download("cudabenchmarktest/personaplex-7b-turbo2bit", "model-turbo2bit.safetensors", token=False)
linear2bit_path = hf_hub_download("cudabenchmarktest/personaplex-7b-turbo2bit", "linear2bit.py", token=False)

# Import the Linear2bit module
import importlib.util
spec = importlib.util.spec_from_file_location("linear2bit", linear2bit_path)
linear2bit_mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(linear2bit_mod)

# Load model (requires moshi-personaplex package)
from moshi.models.loaders import _lm_kwargs, LMModel
lm_kwargs = dict(_lm_kwargs); lm_kwargs["dep_q"] = 16
model = LMModel(device="meta", dtype=torch.bfloat16, **lm_kwargs)
state_2bit = load_file(weight_path, device="cpu")
model = linear2bit_mod.replace_linears_with_2bit(model, state_2bit, device="cuda", dtype=torch.bfloat16)
model.eval()

Dequantize to BF16 (if you prefer standard weights)

python dequantize.py --input model-turbo2bit.safetensors --output model-bf16.safetensors
# Output: 15.59 GB bf16 file

Server with personaplex-setup

# Native 2-bit (~10GB peak, no HF token needed)
NO_CUDA_GRAPH=1 python -m moshi.server --moshi-weight model-turbo2bit.safetensors --native-2bit --device cuda

# + CPU Mimi codec (saves ~840MB)
NO_CUDA_GRAPH=1 python -m moshi.server --moshi-weight model-turbo2bit.safetensors --native-2bit --cpu-mimi --device cuda

Voice Cloning

python clone-voice.py --input my_voice.wav --name MyVoice --device cuda

File Structure

File	Size	Description
`model-turbo2bit.safetensors`	2.07 GB	2-bit quantized weights
`linear2bit.py`	14 KB	Native 2-bit Linear module + Embedding8bit
`dequant-loader.py`	6 KB	Programmatic dequantizer
`dequantize.py`	5 KB	CLI dequantizer
`clone-voice.py`	15 KB	Voice cloning with LuxTTS preprocessing
`tokenizer-*.safetensors`	367 MB	Mimi audio codec
`tokenizer_spm_32k_3.model`	540 KB	Text tokenizer