rahulrao493 commited on
Commit
a722942
verified
1 Parent(s): ee74e42

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -0
README.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - garg-aayush/sft-cs336-assign5-datasets
5
+ language:
6
+ - en
7
+ - zh
8
+ metrics:
9
+ - accuracy
10
+ base_model:
11
+ - Qwen/Qwen2.5-Math-1.5B
12
+ pipeline_tag: question-answering
13
+ library_name: transformers
14
+ tags:
15
+ - cs336
16
+ - assignment
17
+ - expert_iteration
18
+ - fine-tuned
19
+ - math
20
+ - reasoning
21
+ ---
22
+
23
+ # Model Card for CS336 A5 Expert Iteration Model
24
+
25
+ ## Model Details
26
+
27
+ This model is a fine-tuned checkpoint based on `Qwen/Qwen2.5-Math-1.5B` for mathematical reasoning. It was trained as part of CS336 Assignment 5 using an Expert Iteration (EI) style pipeline.
28
+
29
+ The model is designed for math question answering and reasoning-style generation. Training starts from the base Qwen2.5-Math model and iteratively improves the policy by generating candidate responses, scoring them, and using selected trajectories for supervised fine-tuning.
30
+
31
+ This is a course project artifact intended for experimentation, evaluation, and analysis rather than production deployment.
32
+
33
+ ### Model Sources
34
+
35
+ - **Base model:** `Qwen/Qwen2.5-Math-1.5B`
36
+ - **Model repository:** `rahulrao493/cs336-a5-ei-model`
37
+ - **Code repository:** `https://github.com/VictorHuu/assignment5-alignment`
38
+
39
+ ## Intended Use
40
+
41
+ This model is intended for:
42
+
43
+ - experimentation with reasoning-oriented fine-tuning,
44
+ - educational use in the context of CS336,
45
+ - analysis of Expert Iteration training dynamics,
46
+ - evaluation on math-style question answering tasks.
47
+
48
+ This model is not intended for high-stakes use, production deployment, or as a verified mathematical reasoning system.
49
+
50
+ ## How to Use
51
+
52
+ Example usage with `transformers`:
53
+
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModelForCausalLM
56
+
57
+ model_name = "rahulrao493/cs336-a5-ei-model"
58
+
59
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
60
+ model = AutoModelForCausalLM.from_pretrained(model_name)
61
+
62
+ prompt = "Solve: 12 * 13 ="
63
+ inputs = tokenizer(prompt, return_tensors="pt")
64
+ outputs = model.generate(
65
+ **inputs,
66
+ max_new_tokens=128,
67
+ do_sample=False
68
+ )
69
+
70
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
71
+ ```
72
+ ## Training Details
73
+
74
+ ### Base Model
75
+
76
+ The model is initialized from:
77
+
78
+ - `Qwen/Qwen2.5-Math-1.5B`
79
+
80
+ Qwen2.5-Math-1.5B is a math-specialized base model. Its model page states that it mainly supports solving English and Chinese math problems.
81
+
82
+ ### Training Data
83
+
84
+ The model was trained using the CS336 Assignment 5 dataset repository:
85
+
86
+ - `garg-aayush/sft-cs336-assign5-datasets`
87
+
88
+ Files used in this project:
89
+
90
+ - **Training file:** `sft-reason/sft_gpt-oss-120b_filtered.jsonl`
91
+ - **Validation file:** `sft-reason/val.jsonl`
92
+
93
+ The training task focuses on mathematical reasoning, where examples consist of math problems together with reasoning-style responses and grading signals.
94
+
95
+ ### Training Procedure
96
+
97
+ The model was trained with an Expert Iteration style pipeline. At each EI step, the current model generates candidate trajectories, these trajectories are scored, and selected responses are used for supervised fine-tuning updates.
98
+
99
+ For the uploaded experiment, the run configuration is:
100
+
101
+ - **Run tag:** `g8_ep2_b2048_r8`
102
+ - **Rollout count:** 8
103
+ - **SFT epochs per EI step:** 2
104
+ - **EI batch size:** 2048
105
+
106
+ ### Training Hyperparameters
107
+
108
+ - **Training method:** Expert Iteration (EI)
109
+ - **Seed:** 42
110
+ - **Batch size:** 4
111
+ - **Gradient accumulation steps:** 8
112
+ - **SFT epochs:** 2
113
+ - **Max iterations per EI step:** 1024
114
+ - **Number of EI steps:** 5
115
+ - **Minimum kept trajectories per question:** 1
116
+ - **Learning rate (max):** 1e-5
117
+ - **Learning rate (min):** 1e-5
118
+ - **Weight decay:** 0.01
119
+ - **Warmup steps:** 20
120
+ - **LoRA:** disabled
121
+
122
+ ### Rollout Configuration
123
+
124
+ - **Temperature:** 0.7
125
+ - **Max rollout tokens:** 384
126
+ - **Min rollout tokens:** 4
127
+
128
+ ## Evaluation
129
+
130
+ ### Evaluation Data
131
+
132
+ The model was evaluated on the validation split from the CS336 Assignment 5 dataset repository:
133
+
134
+ - `sft-reason/val.jsonl`
135
+
136
+ ### Evaluation Configuration
137
+
138
+ Evaluation used deterministic decoding with the following settings:
139
+
140
+ - **Temperature:** 0.0
141
+ - **Max generation tokens:** 512
142
+ - **Min generation tokens:** 4
143
+ - **Max model length:** 1024
144
+
145
+ ### Metrics
146
+
147
+ The primary task-level metric is:
148
+
149
+ - **Accuracy**
150
+
151
+ In addition, training dynamics were analyzed using diagnostic metrics including:
152
+
153
+ - **Format reward**
154
+ - **Average reward**
155
+ - **Average token entropy**
156
+
157
+ ### Results
158
+ <table>
159
+ <tr>
160
+ <td align="center">
161
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/69a97fd5cf51f3cc27e769d6/Y0Ds_VidzKsxrwKM8mH_b.png" alt="format_reward_ei" width="100%" />
162
+ <br />
163
+ <em>Format reward over training</em>
164
+ </td>
165
+ <td align="center">
166
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/69a97fd5cf51f3cc27e769d6/rOpXKm-XpNqXvLTlO4Q-j.png" alt="reward_ei" width="100%" />
167
+ <br />
168
+ <em>Average reward over training</em>
169
+ </td>
170
+ </tr>
171
+ </table>
172
+ For the `g8_ep2_b2048_r8` run, training showed strong improvement in formatting behavior and reward-based metrics over EI steps.
173
+
174
+ Observed trends from logged evaluation curves:
175
+
176
+ - **Format reward** increased from approximately **0.0** at the start of training to about **0.9** by the end of training.
177
+ - **Average reward** increased from approximately **0.0** to about **0.15**.
178
+ - **Average token entropy** decreased from roughly **0.31** to about **0.08**.
179
+
180
+ These trends indicate that the model became more format-consistent and more confident in its token-level predictions over training.
181
+
182
+ Across multiple EI configurations, entropy curves tended to converge to a similar low range later in training, while reward and format learning dynamics differed more noticeably. This suggests that the tested hyperparameter variations affected optimization dynamics more strongly than the final entropy level of the learned response distribution.
183
+
184
+ Entropy should not be treated as a direct proxy for mathematical correctness. A model can produce lower-entropy responses while still making incorrect conclusions, so entropy is best interpreted alongside task-level accuracy and reward.
185
+
186
+ ### Summary of Findings
187
+
188
+ - The uploaded model is an EI fine-tuned version of `Qwen/Qwen2.5-Math-1.5B`.
189
+ - The uploaded experiment corresponds to `g8_ep2_b2048_r8`.
190
+ - Format reward improved substantially during training.
191
+ - Average reward also improved over training.
192
+ - Average token entropy decreased sharply, indicating increasingly peaked output distributions.
193
+ - Different EI settings showed more variation in convergence speed than in final entropy.
194
+
195
+