Model Card for CS336 A5 Expert Iteration Model
Model Details
This model is a fine-tuned checkpoint based on Qwen/Qwen2.5-Math-1.5B for mathematical reasoning. It was trained as part of CS336 Assignment 5 using an Expert Iteration (EI) style pipeline.
The model is designed for math question answering and reasoning-style generation. Training starts from the base Qwen2.5-Math model and iteratively improves the policy by generating candidate responses, scoring them, and using selected trajectories for supervised fine-tuning.
This is a course project artifact intended for experimentation, evaluation, and analysis rather than production deployment.
Model Sources
- Base model:
Qwen/Qwen2.5-Math-1.5B - Model repository:
rahulrao493/cs336-a5-ei-model - Code repository:
https://github.com/VictorHuu/assignment5-alignment
Intended Use
This model is intended for:
- experimentation with reasoning-oriented fine-tuning,
- educational use in the context of CS336,
- analysis of Expert Iteration training dynamics,
- evaluation on math-style question answering tasks.
This model is not intended for high-stakes use, production deployment, or as a verified mathematical reasoning system.
How to Use
Example usage with transformers:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "rahulrao493/cs336-a5-ei-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Solve: 12 * 13 ="
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=False
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Base Model
The model is initialized from:
Qwen/Qwen2.5-Math-1.5B
Qwen2.5-Math-1.5B is a math-specialized base model. Its model page states that it mainly supports solving English and Chinese math problems.
Training Data
The model was trained using the CS336 Assignment 5 dataset repository:
garg-aayush/sft-cs336-assign5-datasets
Files used in this project:
- Training file:
sft-reason/sft_gpt-oss-120b_filtered.jsonl - Validation file:
sft-reason/val.jsonl
The training task focuses on mathematical reasoning, where examples consist of math problems together with reasoning-style responses and grading signals.
Training Procedure
The model was trained with an Expert Iteration style pipeline. At each EI step, the current model generates candidate trajectories, these trajectories are scored, and selected responses are used for supervised fine-tuning updates.
For the uploaded experiment, the run configuration is:
- Run tag:
g8_ep2_b2048_r8 - Rollout count: 8
- SFT epochs per EI step: 2
- EI batch size: 2048
Training Hyperparameters
- Training method: Expert Iteration (EI)
- Seed: 42
- Batch size: 4
- Gradient accumulation steps: 8
- SFT epochs: 2
- Max iterations per EI step: 1024
- Number of EI steps: 5
- Minimum kept trajectories per question: 1
- Learning rate (max): 1e-5
- Learning rate (min): 1e-5
- Weight decay: 0.01
- Warmup steps: 20
- LoRA: disabled
Rollout Configuration
- Temperature: 0.7
- Max rollout tokens: 384
- Min rollout tokens: 4
Evaluation
Evaluation Data
The model was evaluated on the validation split from the CS336 Assignment 5 dataset repository:
sft-reason/val.jsonl
Evaluation Configuration
Evaluation used deterministic decoding with the following settings:
- Temperature: 0.0
- Max generation tokens: 512
- Min generation tokens: 4
- Max model length: 1024
Metrics
The primary task-level metric is:
- Accuracy
In addition, training dynamics were analyzed using diagnostic metrics including:
- Format reward
- Average reward
- Average token entropy
Results
Format reward over training |
Average reward over training |
Observed trends from logged evaluation curves:
- Format reward increased from approximately 0.0 at the start of training to about 0.9 by the end of training.
- Average reward increased from approximately 0.0 to about 0.15.
- Average token entropy decreased from roughly 0.31 to about 0.08.
These trends indicate that the model became more format-consistent and more confident in its token-level predictions over training.
Across multiple EI configurations, entropy curves tended to converge to a similar low range later in training, while reward and format learning dynamics differed more noticeably. This suggests that the tested hyperparameter variations affected optimization dynamics more strongly than the final entropy level of the learned response distribution.
Entropy should not be treated as a direct proxy for mathematical correctness. A model can produce lower-entropy responses while still making incorrect conclusions, so entropy is best interpreted alongside task-level accuracy and reward.
Summary of Findings
- The uploaded model is an EI fine-tuned version of
Qwen/Qwen2.5-Math-1.5B. - The uploaded experiment corresponds to
g8_ep2_b2048_r8. - Format reward improved substantially during training.
- Average reward also improved over training.
- Average token entropy decreased sharply, indicating increasingly peaked output distributions.
- Different EI settings showed more variation in convergence speed than in final entropy.
- Downloads last month
- 26