Valendra Qwen3.5-4B Demon Angel (Experimental model)

Valendra Qwen3.5-4B Demon Angel is a merged model created from the LoRA adapter trained in this repository and the Qwen/Qwen3.5-4B base model. The name is deliberately literal: it reflects the core internal opposition between a demon that attacks weak reasoning and an angel that proposes the answer.

Overview

This model was trained to internalize a structured self-debate pattern before emitting a visible answer.

An angel proposes a solution.
A demon attacks weak assumptions, blind spots, and overconfidence.
A judge synthesizes the outcome and chooses the final stance.

The intent is not to expose chain-of-thought in production. The intent is to make the visible answer stronger by forcing internal critique and synthesis first.

Relation to SDRL

This model is aligned in spirit with Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning, arXiv:2601.22297v1.

It is not a reproduction of SDRL. Instead, it follows the same broad intuition inside this repository's own stack: a single model should improve when it learns to work across multiple reasoning trajectories instead of solving every prompt in isolation.

Details

Base model: Qwen/Qwen3.5-4B
Suggested repo: valendra/qwen3.5-4b-demon-angel
Training flow: LoRA SFT, then GRPO-style reinforcement learning, then local merge
Internal format: a single block with angel, demon, and judge roles
Serving goal: expose only the visible answer after the internal reasoning block

Intended Use

Use this model for experiments where you want stronger internal critique and synthesis than a plain instruction-tuned baseline, while still serving only a final answer.

Limitations

This model was trained with synthetic and programmatic supervision, so it should be validated on real downstream prompts before production use.
It is designed around a learned internal debate format, not around unrestricted free-form reasoning traces.
This model card describes the merged artifact produced in this repository. It does not claim benchmark parity with SDRL or paper-level reproduction.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "valendra/qwen3.5-4b-demon-angel"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")