Support this work → · X · GitHub · REAP paper · Cerebras REAP

Nemotron-3-Super-92B

REAP-pruned nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16.

At a glance

Base model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Format BF16
Total params 92B
Active / token 12B
Experts / layer 384
Layers —
Hidden size 4096
Context 262,144
On-disk size 185 GB

Which variant should I pick?

Variant Format Link
Nemotron-3-Super-64B BF16 link
Nemotron-3-Super-64B-W4A16 W4A16 link
Nemotron-3-Super-92B (this) BF16 link
Nemotron-3-Super-92B-W4A16 W4A16 link

This repo is a draft REAP-derived checkpoint based on nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16.

Provenance

Research References

Pruning details

  • Experts per MoE layer in upstream base: 512
  • Experts retained per layer in this variant: 384
  • Experts pruned per layer in this variant: 128
  • Expected safetensor shard count in this draft repo: 4
  • Source merged observation workflow: nemotron_super_merged_long50_short15120_v2

Method summary

The pruning signal comes from layerwise REAP observations collected over a mixed calibration corpus dominated by a personal AI-session history plus a bounded public augmentation slice.

Validated observation lanes used in the merged signal:

  • nemotron_super_long50_16k_v3
    • longest personal trajectories first
    • 50 trajectories
    • capped at 16384 tokens each
  • nemotron_super_short_mix_15120_t1024_b8192_v4
    • 15000 short personal prompts plus 120 bounded public prompts
    • capped at 1024 tokens each
    • packed under a safe 8192 token batch budget
  • merged canonical state: nemotron_super_merged_long50_short15120_v2

Model facts from the merged observation lane:

  • runtime architecture class: NemotronHForCausalLM
  • total blocks: 88
  • MoE blocks: 40
  • Mamba blocks: 40
  • attention blocks: 8
  • routed experts per token: 22

Intended use

This draft checkpoint is published for research into expert activation structure, residency planning, CPU offloading, and prompt-conditioned expert selection. It is not a production claim and it is not an NVIDIA release.

Draft caveats

  • This is a draft derived checkpoint.
  • We have not yet completed a full serving benchmark and quality benchmark campaign for this release on Hugging Face.
  • The repo preserves provenance back to the upstream NVIDIA release and should be evaluated in that context.

License and terms

Distribution of this derived checkpoint is intended to comply with the NVIDIA Open Model License included in LICENSE.txt. The required attribution notice is included in NOTICE.

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
172
Safetensors
Model size
92B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/Nemotron-3-Super-92B

Finetuned
(17)
this model

Collection including 0xSero/Nemotron-3-Super-92B

Paper for 0xSero/Nemotron-3-Super-92B