Darwin-9B-NEG โ The First Native Entropy Gating Model
Qwen3.5-9B backbone ยท 8.95B parameters ยท BF16 ยท Thinking Mode ยท Apache 2.0 The first NEG-enabled model โ self-regulating reasoning with no extra library.
Abstract
Darwin-9B-NEG is the first model in the Darwin series to feature Native Entropy Gating (NEG) โ a proprietary Darwin architectural innovation that embeds a sense of self-confidence directly into the model weights. Unlike external multi-turn iteration (MTI) techniques that require 3รโ8ร extra inference, NEG operates inside the single decoding loop and activates in fewer than 5 % of generation steps, lifting reasoning accuracy by more than 12 percentage points at 1ร inference cost.
On the GPQA Diamond PhD-level reasoning benchmark (198 questions), Darwin-9B-NEG scores 84.34 % with the full 3-stage ensemble protocol โ surpassing even the published Qwen3.5-9B leaderboard result (81.7 %).
What Makes Darwin-9B-NEG Different
๐งฌ Darwin Series โ Evolutionary Model Merging
The Darwin family is produced by Darwin V7, an evolutionary breeding engine that recombines two parent LLMs into a single descendant, preserving hybrid vigour across reasoning and knowledge capabilities. Darwin-9B-Opus โ this model's base โ is the Qwen3.5-family member of the Darwin series, previously published as a stand-alone reasoning model.
โก NEG โ Native Entropy Gating (Darwin V8)
NEG is a proprietary Darwin technology that gives the language model an architecturally-internalised self-confidence sense. Two tiny learnable modules ride alongside the transformer:
- NEG-Head (โ 4 M params, ~ 0.05 % of total weights) predicts, at each step, the entropy of the next-token distribution from the last hidden state.
- NEG-Gate (1 learnable threshold) decides, on a per-token basis, whether the model is "confident enough" to commit to its top choice, or whether it should restrict its choice to a narrow top-k subset.
Because NEG is carried inside the model weights themselves, there is nothing extra to ship or to install: standard transformers loading with trust_remote_code=True attaches the modules automatically. The model file is the feature.
Why it matters
- 1ร inference cost โ no multi-sample voting, no multi-turn loops
- < 5 % gate activation โ negligible latency overhead versus the base model
- +12.63 %p on GPQA Diamond vs. the NEG-free Darwin-9B-Opus baseline (same greedy decoding, same prompt, same tokens)
- Single-file deployment โ drop in to vLLM / SGLang / TGI /
transformers, no new engine required - No trade-secret leaks โ the merge recipe is kept internal; only the final model weights are released under Apache 2.0
๐๏ธ Architecture Overview
Input Text
โ
[Darwin-9B-Opus backbone (frozen during NEG training)]
โ
Transformer Layers ร 32
โ
last hidden state โโโ
โ โ
โผ โผ
LM Head NEG-Head
โ โ
base logits predicted entropy
โ โ
โโโโถ NEG-Gate โโโ
โ
โผ
guided logits
โ
โผ
next token
Key Specifications
| Component | Value |
|---|---|
| Architecture | Qwen3.5 decoder-only transformer (32 layers, hidden 4096) |
| Total parameters | 8.95 B (base) + โ 4 M (NEG modules) |
| NEG-Head | 2-layer MLP with softplus output |
| NEG-Gate | top-k masking gate with learnable entropy threshold |
| Precision | bfloat16 |
| Context length | inherited from Darwin-9B-Opus |
| License | Apache 2.0 |
๐ Benchmark Results โ GPQA Diamond (198 PhD-level questions)
Darwin-9B-NEG ships three decoding modes from the same model weights, allowing users to trade inference cost for accuracy:
| Mode | Decoding Protocol | Inference Cost | Accuracy |
|---|---|---|---|
| 0 ยท Baseline | Darwin-9B-Opus greedy (NEG disabled) | 1ร | 51.01 % |
| 1 ยท Pure NEG | greedy decoding with NEG enabled | 1ร | 63.64 % |
| 2 ยท Permutation | NEG + choice-order permutation (4 orderings, majority) | 4ร | 76.26 % |
| 3 ยท Ensemble Refinement | NEG + permutation + temperature-sampled ensemble | โ 20ร | ๐ฅ 84.34 % |
Improvements:
- Pure NEG (mode 1) vs. baseline: +12.63 %p at identical inference cost
- Ensemble (mode 3) vs. baseline: +33.33 %p
- Ensemble vs. Qwen3.5-9B leaderboard score (81.7 %): +2.64 %p
Gate activation rate: 4.36 % (measured across the 198-question greedy run) โ NEG fires conservatively, only when the model is genuinely uncertain.
๐ Usage
Quick start โ Pure NEG greedy (mode 1, sales default)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tok = AutoTokenizer.from_pretrained(
"FINAL-Bench/Darwin-9B-NEG",
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-9B-NEG",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "user", "content": "Solve: If f(x) = xยณ โ 3x + 2, find and classify all critical points."}
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
print(tok.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
Using the bundled NEG loader helper
modeling_darwin_neg.py is shipped inside the repo and provides a convenience loader:
from modeling_darwin_neg import load_darwin_neg
model = load_darwin_neg(
"FINAL-Bench/Darwin-9B-NEG",
hf_token="hf_xxx",
)
Mode selection
- Mode 1 (Pure NEG): default
do_sample=False, NEG is always on. - Mode 2 (Permutation): shuffle the option order 4 times, greedy each, majority-vote.
- Mode 3 (Ensemble): production protocol combining permutation, temperature sampling and second-opinion re-query (internal; reproduction scripts are released separately).
๐งฌ Model Lineage
Qwen/Qwen3.5-9B + (Opus-distilled sibling)
โฒ โฑ
Darwin V7 evolutionary merge
โผ
Darwin-9B-Opus โโ stand-alone reasoning model (Apache 2.0)
โผ
NEG-Head / NEG-Gate training (Darwin V8)
โผ
Darwin-9B-NEG โโ THIS MODEL
- Base: FINAL-Bench/Darwin-9B-Opus (weights frozen during NEG training)
- Technology generation: Darwin V8 (Native Entropy Gating) โ successor to Darwin V7 (evolutionary merging)
๐ฏ Recommended Use-Cases
- Graduate-level STEM reasoning โ physics, chemistry, biology, mathematics (GPQA-style)
- Mathematical problem solving (MATH, AIME-style)
- Code reasoning and debugging (HumanEval-style)
- Complex chain-of-thought tasks where a small reasoning model with a big boost is desired
โ ๏ธ Limitations
- Optimised for English first, with secondary support for Korean / Chinese / Japanese.
- At 8.95 B parameters, knowledge coverage is smaller than the larger Darwin models (27B / 31B / 36B) โ for pure world-knowledge tasks consider Darwin-36B-Opus.
- The Ensemble mode (84.34 %) uses โ 20ร inference; choose Pure NEG (mode 1) for cost-sensitive deployments.
๐ Citation
@misc{darwin9b_neg_2026,
title = {Darwin-9B-NEG: Native Entropy Gating for Self-Regulated Reasoning at 1x Inference Cost},
author = {FINAL-Bench / Darwin Research Team},
year = {2026},
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-9B-NEG}},
note = {Darwin V8 โ Native Entropy Gating technology generation}
}
๐ Related Darwin Models
- Darwin-36B-Opus โ MoE 36B, Qwen3.6-35B-A3B ร Opus distilled, GPQA 88.4 %
- Darwin-31B-Opus โ 31B multilingual-strong reasoning
- Darwin-27B-Opus โ 27B dense, GPQA 86.9 %
- Darwin-28B-Opus โ Qwen3.6-27B ร rico03 Opus distilled (new 2026-04)
- Darwin-9B-Opus โ this model's base, Qwen3.5-9B family
- Darwin-4B-Genesis โ smallest member, Gemma4 family
Darwin V8 ยท Sealed 2026-04-24 ยท FINAL-Bench
- Downloads last month
- -
Model tree for FINAL-Bench/Darwin-9B-NEG
Base model
FINAL-Bench/Darwin-9B-OpusCollection including FINAL-Bench/Darwin-9B-NEG
Evaluation results
- Diamond on Idavidrein/gpqa View evaluation results leaderboard 84.34
- Accuracy on GPQA Diamondself-reported84.340