Instructions to use 0xSero/Nemotron-3-Super-92B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/Nemotron-3-Super-92B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="0xSero/Nemotron-3-Super-92B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("0xSero/Nemotron-3-Super-92B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("0xSero/Nemotron-3-Super-92B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 0xSero/Nemotron-3-Super-92B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/Nemotron-3-Super-92B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Nemotron-3-Super-92B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/Nemotron-3-Super-92B

SGLang

How to use 0xSero/Nemotron-3-Super-92B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "0xSero/Nemotron-3-Super-92B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Nemotron-3-Super-92B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "0xSero/Nemotron-3-Super-92B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Nemotron-3-Super-92B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 0xSero/Nemotron-3-Super-92B with Docker Model Runner:
```
docker model run hf.co/0xSero/Nemotron-3-Super-92B
```

Support this work → · X · GitHub · REAP paper · Cerebras REAP

Nemotron-3-Super-92B

REAP-pruned nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16.

At a glance


Base model	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Format	BF16
Total params	92B
Active / token	12B
Experts / layer	384
Layers	—
Hidden size	4096
Context	262,144
On-disk size	185 GB

Which variant should I pick?

Variant	Format	Link
`Nemotron-3-Super-64B`	BF16	link
`Nemotron-3-Super-64B-W4A16`	W4A16	link
`Nemotron-3-Super-92B` (this)	BF16	link
`Nemotron-3-Super-92B-W4A16`	W4A16	link

This repo is a draft REAP-derived checkpoint based on nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16.

Provenance

Research References

Paper: arXiv:2510.13999
Code and experiment source: 0xSero/reap-expert-swap
Support and research funding: donate.sybilsolutions.ai
Upstream base model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Relationship: REAP-derived expert-pruned checkpoint, not a mirror of the original base release
Draft status: draft research release
This repo publishes the derived checkpoint weights and runtime files produced from our REAP pruning workflow.

Pruning details

Experts per MoE layer in upstream base: 512
Experts retained per layer in this variant: 384
Experts pruned per layer in this variant: 128
Expected safetensor shard count in this draft repo: 4
Source merged observation workflow: nemotron_super_merged_long50_short15120_v2

Method summary

The pruning signal comes from layerwise REAP observations collected over a mixed calibration corpus dominated by a personal AI-session history plus a bounded public augmentation slice.

Validated observation lanes used in the merged signal:

nemotron_super_long50_16k_v3
- longest personal trajectories first
- 50 trajectories
- capped at 16384 tokens each
nemotron_super_short_mix_15120_t1024_b8192_v4
- 15000 short personal prompts plus 120 bounded public prompts
- capped at 1024 tokens each
- packed under a safe 8192 token batch budget
merged canonical state: nemotron_super_merged_long50_short15120_v2

Model facts from the merged observation lane:

runtime architecture class: NemotronHForCausalLM
total blocks: 88
MoE blocks: 40
Mamba blocks: 40
attention blocks: 8
routed experts per token: 22

Intended use

This draft checkpoint is published for research into expert activation structure, residency planning, CPU offloading, and prompt-conditioned expert selection. It is not a production claim and it is not an NVIDIA release.

Draft caveats

This is a draft derived checkpoint.
We have not yet completed a full serving benchmark and quality benchmark campaign for this release on Hugging Face.
The repo preserves provenance back to the upstream NVIDIA release and should be evaluated in that context.

License and terms

Distribution of this derived checkpoint is intended to comply with the NVIDIA Open Model License included in LICENSE.txt. The required attribution notice is included in NOTICE.

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Model tree for 0xSero/Nemotron-3-Super-92B

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Finetuned

(17)

this model

Collection including 0xSero/Nemotron-3-Super-92B

Nemotron — REAP

Collection

REAP-pruned & quantized NVIDIA Nemotron-3-Super. • 4 items • Updated 9 days ago

Paper for 0xSero/Nemotron-3-Super-92B

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20

0xSero
/

Nemotron-3-Super-92B