PersonaPlex 7B β€” TurboQuant 2-bit (NF2 + Walsh-Hadamard)

Ultra-compressed version of nvidia/personaplex-7b-v1 for deployment on edge devices like NVIDIA Jetson AGX Orin.

Quantized using a TurboQuant-inspired method combining Walsh-Hadamard Transform rotation with Normal Float 2-bit (NF2) optimal scalar quantization.

Compression Results

Original (bf16) INT4 NF4 This: NF2+WHT
Size 15.59 GB 4.14 GB 2.07 GB
Compression 1x 3.8x 7.5x
Eff. bits/weight 16 ~4.2 ~2.13
Cosine similarity 1.0 0.994 0.944
Model load test PASS PASS PASS (0 missing keys)

Quantization Method

TurboQuant v2 β€” adapted from Google Research TurboQuant (arXiv:2504.19874):

  1. Per-group normalization (group_size=128): RMS scaling to unit variance
  2. Walsh-Hadamard Transform: Random rotation to spread weight outliers uniformly across dimensions, making the distribution more Gaussian
  3. NF2 quantization: Lloyd-Max optimal 4-level scalar quantizer for Gaussian distributions
    • Centroids: {-1.5104, -0.4528, 0.4528, 1.5104}
    • Decision boundaries: {-0.9816, 0.0, 0.9816}
  4. Packed storage: 4 codes per byte (2 bits each) + fp16 per-group scales

What stays in fp16 (77 tensors): Layer norms, biases, embeddings, positional encodings, RoPE β€” these are sensitive to quantization and small enough to keep full precision.

Per-Tensor Quality

Target Hardware

Device VRAM/Memory Weight Load Time Real-time?
Jetson AGX Orin 64GB 64 GB unified ~10ms (204.8 GB/s) Yes with INT2 GEMM
Jetson AGX Orin 32GB 32 GB unified ~10ms Yes with INT2 GEMM
RTX 3060 12GB 12 GB ~3ms (360 GB/s) Yes
RTX 4090 24GB 24 GB ~1.3ms (1 TB/s) Yes

PersonaPlex requires 12.5 frames/second (80ms/frame budget). At 2.07 GB, weight loading is well under budget on any modern GPU.

Quick Start β€” Native 2-Bit Inference

Keeps packed 2-bit weights on GPU, dequantizes per-matmul. ~10 GB peak VRAM vs 19 GB for bf16.

import torch
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download

# Download this repo's files
weight_path = hf_hub_download("cudabenchmarktest/personaplex-7b-turbo2bit", "model-turbo2bit.safetensors", token=False)
linear2bit_path = hf_hub_download("cudabenchmarktest/personaplex-7b-turbo2bit", "linear2bit.py", token=False)

# Import the Linear2bit module
import importlib.util
spec = importlib.util.spec_from_file_location("linear2bit", linear2bit_path)
linear2bit_mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(linear2bit_mod)

# Load model (requires moshi-personaplex package)
from moshi.models.loaders import _lm_kwargs, LMModel
lm_kwargs = dict(_lm_kwargs); lm_kwargs["dep_q"] = 16
model = LMModel(device="meta", dtype=torch.bfloat16, **lm_kwargs)
state_2bit = load_file(weight_path, device="cpu")
model = linear2bit_mod.replace_linears_with_2bit(model, state_2bit, device="cuda", dtype=torch.bfloat16)
model.eval()

Dequantize to BF16 (if you prefer standard weights)

python dequantize.py --input model-turbo2bit.safetensors --output model-bf16.safetensors
# Output: 15.59 GB bf16 file

Server with personaplex-setup

# Native 2-bit (~10GB peak, no HF token needed)
NO_CUDA_GRAPH=1 python -m moshi.server --moshi-weight model-turbo2bit.safetensors --native-2bit --device cuda

# + CPU Mimi codec (saves ~840MB)
NO_CUDA_GRAPH=1 python -m moshi.server --moshi-weight model-turbo2bit.safetensors --native-2bit --cpu-mimi --device cuda

Voice Cloning

python clone-voice.py --input my_voice.wav --name MyVoice --device cuda

File Structure

File Size Description
model-turbo2bit.safetensors 2.07 GB 2-bit quantized weights
linear2bit.py 14 KB Native 2-bit Linear module + Embedding8bit
dequant-loader.py 6 KB Programmatic dequantizer
dequantize.py 5 KB CLI dequantizer
clone-voice.py 15 KB Voice cloning with LuxTTS preprocessing
tokenizer-*.safetensors 367 MB Mimi audio codec
tokenizer_spm_32k_3.model 540 KB Text tokenizer

Quantization Hardware

Quantized on 3x NVIDIA A100 80GB PCIe in 50 seconds using CUDA-accelerated Walsh-Hadamard Transform.

Related Models

License

Same as base model: NVIDIA Open Model License

Credits

Quantized by open-agents-ai using TurboQuant-inspired NF2+WHT method. Base model by NVIDIA Research (PersonaPlex paper).

Downloads last month
152
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cudabenchmarktest/personaplex-7b-turbo2bit

Finetuned
(36)
this model

Papers for cudabenchmarktest/personaplex-7b-turbo2bit