Model Card for CS336 A5 Expert Iteration Model

Model Details

This model is a fine-tuned checkpoint based on Qwen/Qwen2.5-Math-1.5B for mathematical reasoning. It was trained as part of CS336 Assignment 5 using an Expert Iteration (EI) style pipeline.

The model is designed for math question answering and reasoning-style generation. Training starts from the base Qwen2.5-Math model and iteratively improves the policy by generating candidate responses, scoring them, and using selected trajectories for supervised fine-tuning.

This is a course project artifact intended for experimentation, evaluation, and analysis rather than production deployment.

Model Sources

  • Base model: Qwen/Qwen2.5-Math-1.5B
  • Model repository: rahulrao493/cs336-a5-ei-model
  • Code repository: https://github.com/VictorHuu/assignment5-alignment

Intended Use

This model is intended for:

  • experimentation with reasoning-oriented fine-tuning,
  • educational use in the context of CS336,
  • analysis of Expert Iteration training dynamics,
  • evaluation on math-style question answering tasks.

This model is not intended for high-stakes use, production deployment, or as a verified mathematical reasoning system.

How to Use

Example usage with transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "rahulrao493/cs336-a5-ei-model"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Solve: 12 * 13 ="
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Base Model

The model is initialized from:

  • Qwen/Qwen2.5-Math-1.5B

Qwen2.5-Math-1.5B is a math-specialized base model. Its model page states that it mainly supports solving English and Chinese math problems.

Training Data

The model was trained using the CS336 Assignment 5 dataset repository:

  • garg-aayush/sft-cs336-assign5-datasets

Files used in this project:

  • Training file: sft-reason/sft_gpt-oss-120b_filtered.jsonl
  • Validation file: sft-reason/val.jsonl

The training task focuses on mathematical reasoning, where examples consist of math problems together with reasoning-style responses and grading signals.

Training Procedure

The model was trained with an Expert Iteration style pipeline. At each EI step, the current model generates candidate trajectories, these trajectories are scored, and selected responses are used for supervised fine-tuning updates.

For the uploaded experiment, the run configuration is:

  • Run tag: g8_ep2_b2048_r8
  • Rollout count: 8
  • SFT epochs per EI step: 2
  • EI batch size: 2048

Training Hyperparameters

  • Training method: Expert Iteration (EI)
  • Seed: 42
  • Batch size: 4
  • Gradient accumulation steps: 8
  • SFT epochs: 2
  • Max iterations per EI step: 1024
  • Number of EI steps: 5
  • Minimum kept trajectories per question: 1
  • Learning rate (max): 1e-5
  • Learning rate (min): 1e-5
  • Weight decay: 0.01
  • Warmup steps: 20
  • LoRA: disabled

Rollout Configuration

  • Temperature: 0.7
  • Max rollout tokens: 384
  • Min rollout tokens: 4

Evaluation

Evaluation Data

The model was evaluated on the validation split from the CS336 Assignment 5 dataset repository:

  • sft-reason/val.jsonl

Evaluation Configuration

Evaluation used deterministic decoding with the following settings:

  • Temperature: 0.0
  • Max generation tokens: 512
  • Min generation tokens: 4
  • Max model length: 1024

Metrics

The primary task-level metric is:

  • Accuracy

In addition, training dynamics were analyzed using diagnostic metrics including:

  • Format reward
  • Average reward
  • Average token entropy

Results

format_reward_ei
Format reward over training
reward_ei
Average reward over training
For the `g8_ep2_b2048_r8` run, training showed strong improvement in formatting behavior and reward-based metrics over EI steps.

Observed trends from logged evaluation curves:

  • Format reward increased from approximately 0.0 at the start of training to about 0.9 by the end of training.
  • Average reward increased from approximately 0.0 to about 0.15.
  • Average token entropy decreased from roughly 0.31 to about 0.08.

These trends indicate that the model became more format-consistent and more confident in its token-level predictions over training.

Across multiple EI configurations, entropy curves tended to converge to a similar low range later in training, while reward and format learning dynamics differed more noticeably. This suggests that the tested hyperparameter variations affected optimization dynamics more strongly than the final entropy level of the learned response distribution.

Entropy should not be treated as a direct proxy for mathematical correctness. A model can produce lower-entropy responses while still making incorrect conclusions, so entropy is best interpreted alongside task-level accuracy and reward.

Summary of Findings

  • The uploaded model is an EI fine-tuned version of Qwen/Qwen2.5-Math-1.5B.
  • The uploaded experiment corresponds to g8_ep2_b2048_r8.
  • Format reward improved substantially during training.
  • Average reward also improved over training.
  • Average token entropy decreased sharply, indicating increasingly peaked output distributions.
  • Different EI settings showed more variation in convergence speed than in final entropy.
Downloads last month
26
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rahulrao493/cs336-a5-ei-model

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(161)
this model

Dataset used to train rahulrao493/cs336-a5-ei-model