Create README.md

Browse files

Files changed (1) hide show

README.md +195 -0

README.md ADDED Viewed

	@@ -0,0 +1,195 @@

+---
+license: apache-2.0
+datasets:
+- garg-aayush/sft-cs336-assign5-datasets
+language:
+- en
+- zh
+metrics:
+- accuracy
+base_model:
+- Qwen/Qwen2.5-Math-1.5B
+pipeline_tag: question-answering
+library_name: transformers
+tags:
+- cs336
+- assignment
+- expert_iteration
+- fine-tuned
+- math
+- reasoning
+---
+# Model Card for CS336 A5 Expert Iteration Model
+## Model Details
+This model is a fine-tuned checkpoint based on `Qwen/Qwen2.5-Math-1.5B` for mathematical reasoning. It was trained as part of CS336 Assignment 5 using an Expert Iteration (EI) style pipeline.
+The model is designed for math question answering and reasoning-style generation. Training starts from the base Qwen2.5-Math model and iteratively improves the policy by generating candidate responses, scoring them, and using selected trajectories for supervised fine-tuning.
+This is a course project artifact intended for experimentation, evaluation, and analysis rather than production deployment.
+### Model Sources
+- **Base model:** `Qwen/Qwen2.5-Math-1.5B`
+- **Model repository:** `rahulrao493/cs336-a5-ei-model`
+- **Code repository:** `https://github.com/VictorHuu/assignment5-alignment`
+## Intended Use
+This model is intended for:
+- experimentation with reasoning-oriented fine-tuning,
+- educational use in the context of CS336,
+- analysis of Expert Iteration training dynamics,
+- evaluation on math-style question answering tasks.
+This model is not intended for high-stakes use, production deployment, or as a verified mathematical reasoning system.
+## How to Use
+Example usage with `transformers`:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_name = "rahulrao493/cs336-a5-ei-model"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+prompt = "Solve: 12 * 13 ="
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=128,
+    do_sample=False
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Training Details
+### Base Model
+The model is initialized from:
+- `Qwen/Qwen2.5-Math-1.5B`
+Qwen2.5-Math-1.5B is a math-specialized base model. Its model page states that it mainly supports solving English and Chinese math problems.
+### Training Data
+The model was trained using the CS336 Assignment 5 dataset repository:
+- `garg-aayush/sft-cs336-assign5-datasets`
+Files used in this project:
+- **Training file:** `sft-reason/sft_gpt-oss-120b_filtered.jsonl`
+- **Validation file:** `sft-reason/val.jsonl`
+The training task focuses on mathematical reasoning, where examples consist of math problems together with reasoning-style responses and grading signals.
+### Training Procedure
+The model was trained with an Expert Iteration style pipeline. At each EI step, the current model generates candidate trajectories, these trajectories are scored, and selected responses are used for supervised fine-tuning updates.
+For the uploaded experiment, the run configuration is:
+- **Run tag:** `g8_ep2_b2048_r8`
+- **Rollout count:** 8
+- **SFT epochs per EI step:** 2
+- **EI batch size:** 2048
+### Training Hyperparameters
+- **Training method:** Expert Iteration (EI)
+- **Seed:** 42
+- **Batch size:** 4
+- **Gradient accumulation steps:** 8
+- **SFT epochs:** 2
+- **Max iterations per EI step:** 1024
+- **Number of EI steps:** 5
+- **Minimum kept trajectories per question:** 1
+- **Learning rate (max):** 1e-5
+- **Learning rate (min):** 1e-5
+- **Weight decay:** 0.01
+- **Warmup steps:** 20
+- **LoRA:** disabled
+### Rollout Configuration
+- **Temperature:** 0.7
+- **Max rollout tokens:** 384
+- **Min rollout tokens:** 4
+## Evaluation
+### Evaluation Data
+The model was evaluated on the validation split from the CS336 Assignment 5 dataset repository:
+- `sft-reason/val.jsonl`
+### Evaluation Configuration
+Evaluation used deterministic decoding with the following settings:
+- **Temperature:** 0.0
+- **Max generation tokens:** 512
+- **Min generation tokens:** 4
+- **Max model length:** 1024
+### Metrics
+The primary task-level metric is:
+- **Accuracy**
+In addition, training dynamics were analyzed using diagnostic metrics including:
+- **Format reward**
+- **Average reward**
+- **Average token entropy**
+### Results
+<table>
+  <tr>
+    <td align="center">
+      <img src="https://cdn-uploads.huggingface.co/production/uploads/69a97fd5cf51f3cc27e769d6/Y0Ds_VidzKsxrwKM8mH_b.png" alt="format_reward_ei" width="100%" />
+      <br />
+      <em>Format reward over training</em>
+    </td>
+    <td align="center">
+      <img src="https://cdn-uploads.huggingface.co/production/uploads/69a97fd5cf51f3cc27e769d6/rOpXKm-XpNqXvLTlO4Q-j.png" alt="reward_ei" width="100%" />
+      <br />
+      <em>Average reward over training</em>
+    </td>
+  </tr>
+</table>
+For the `g8_ep2_b2048_r8` run, training showed strong improvement in formatting behavior and reward-based metrics over EI steps.
+Observed trends from logged evaluation curves:
+- **Format reward** increased from approximately **0.0** at the start of training to about **0.9** by the end of training.
+- **Average reward** increased from approximately **0.0** to about **0.15**.
+- **Average token entropy** decreased from roughly **0.31** to about **0.08**.
+These trends indicate that the model became more format-consistent and more confident in its token-level predictions over training.
+Across multiple EI configurations, entropy curves tended to converge to a similar low range later in training, while reward and format learning dynamics differed more noticeably. This suggests that the tested hyperparameter variations affected optimization dynamics more strongly than the final entropy level of the learned response distribution.
+Entropy should not be treated as a direct proxy for mathematical correctness. A model can produce lower-entropy responses while still making incorrect conclusions, so entropy is best interpreted alongside task-level accuracy and reward.
+### Summary of Findings
+- The uploaded model is an EI fine-tuned version of `Qwen/Qwen2.5-Math-1.5B`.
+- The uploaded experiment corresponds to `g8_ep2_b2048_r8`.
+- Format reward improved substantially during training.
+- Average reward also improved over training.
+- Average token entropy decreased sharply, indicating increasingly peaked output distributions.
+- Different EI settings showed more variation in convergence speed than in final entropy.