PersonaPlex 7B β TurboQuant 2-bit (NF2 + Walsh-Hadamard)
Ultra-compressed version of nvidia/personaplex-7b-v1 for deployment on edge devices like NVIDIA Jetson AGX Orin.
Quantized using a TurboQuant-inspired method combining Walsh-Hadamard Transform rotation with Normal Float 2-bit (NF2) optimal scalar quantization.
Compression Results
| Original (bf16) | INT4 NF4 | This: NF2+WHT | |
|---|---|---|---|
| Size | 15.59 GB | 4.14 GB | 2.07 GB |
| Compression | 1x | 3.8x | 7.5x |
| Eff. bits/weight | 16 | ~4.2 | ~2.13 |
| Cosine similarity | 1.0 | 0.994 | 0.944 |
| Model load test | PASS | PASS | PASS (0 missing keys) |
Quantization Method
TurboQuant v2 β adapted from Google Research TurboQuant (arXiv:2504.19874):
- Per-group normalization (group_size=128): RMS scaling to unit variance
- Walsh-Hadamard Transform: Random rotation to spread weight outliers uniformly across dimensions, making the distribution more Gaussian
- NF2 quantization: Lloyd-Max optimal 4-level scalar quantizer for Gaussian distributions
- Centroids: {-1.5104, -0.4528, 0.4528, 1.5104}
- Decision boundaries: {-0.9816, 0.0, 0.9816}
- Packed storage: 4 codes per byte (2 bits each) + fp16 per-group scales
What stays in fp16 (77 tensors): Layer norms, biases, embeddings, positional encodings, RoPE β these are sensitive to quantization and small enough to keep full precision.
Per-Tensor Quality
Target Hardware
| Device | VRAM/Memory | Weight Load Time | Real-time? |
|---|---|---|---|
| Jetson AGX Orin 64GB | 64 GB unified | ~10ms (204.8 GB/s) | Yes with INT2 GEMM |
| Jetson AGX Orin 32GB | 32 GB unified | ~10ms | Yes with INT2 GEMM |
| RTX 3060 12GB | 12 GB | ~3ms (360 GB/s) | Yes |
| RTX 4090 24GB | 24 GB | ~1.3ms (1 TB/s) | Yes |
PersonaPlex requires 12.5 frames/second (80ms/frame budget). At 2.07 GB, weight loading is well under budget on any modern GPU.
Quick Start β Native 2-Bit Inference
Keeps packed 2-bit weights on GPU, dequantizes per-matmul. ~10 GB peak VRAM vs 19 GB for bf16.
import torch
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
# Download this repo's files
weight_path = hf_hub_download("cudabenchmarktest/personaplex-7b-turbo2bit", "model-turbo2bit.safetensors", token=False)
linear2bit_path = hf_hub_download("cudabenchmarktest/personaplex-7b-turbo2bit", "linear2bit.py", token=False)
# Import the Linear2bit module
import importlib.util
spec = importlib.util.spec_from_file_location("linear2bit", linear2bit_path)
linear2bit_mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(linear2bit_mod)
# Load model (requires moshi-personaplex package)
from moshi.models.loaders import _lm_kwargs, LMModel
lm_kwargs = dict(_lm_kwargs); lm_kwargs["dep_q"] = 16
model = LMModel(device="meta", dtype=torch.bfloat16, **lm_kwargs)
state_2bit = load_file(weight_path, device="cpu")
model = linear2bit_mod.replace_linears_with_2bit(model, state_2bit, device="cuda", dtype=torch.bfloat16)
model.eval()
Dequantize to BF16 (if you prefer standard weights)
python dequantize.py --input model-turbo2bit.safetensors --output model-bf16.safetensors
# Output: 15.59 GB bf16 file
Server with personaplex-setup
# Native 2-bit (~10GB peak, no HF token needed)
NO_CUDA_GRAPH=1 python -m moshi.server --moshi-weight model-turbo2bit.safetensors --native-2bit --device cuda
# + CPU Mimi codec (saves ~840MB)
NO_CUDA_GRAPH=1 python -m moshi.server --moshi-weight model-turbo2bit.safetensors --native-2bit --cpu-mimi --device cuda
Voice Cloning
python clone-voice.py --input my_voice.wav --name MyVoice --device cuda
File Structure
| File | Size | Description |
|---|---|---|
model-turbo2bit.safetensors |
2.07 GB | 2-bit quantized weights |
linear2bit.py |
14 KB | Native 2-bit Linear module + Embedding8bit |
dequant-loader.py |
6 KB | Programmatic dequantizer |
dequantize.py |
5 KB | CLI dequantizer |
clone-voice.py |
15 KB | Voice cloning with LuxTTS preprocessing |
tokenizer-*.safetensors |
367 MB | Mimi audio codec |
tokenizer_spm_32k_3.model |
540 KB | Text tokenizer |
Quantization Hardware
Quantized on 3x NVIDIA A100 80GB PCIe in 50 seconds using CUDA-accelerated Walsh-Hadamard Transform.
Related Models
- nvidia/personaplex-7b-v1 β Original bf16 weights (15.59 GB)
- cudabenchmarktest/personaplex-7b-nf4 β INT4 NF4 quantized (4.14 GB, 99.4% cosine sim)
License
Same as base model: NVIDIA Open Model License
Credits
Quantized by open-agents-ai using TurboQuant-inspired NF2+WHT method. Base model by NVIDIA Research (PersonaPlex paper).
- Downloads last month
- 152