SmolLM2-360M - Dpo
Model Description
This model is a Merged Standalone Model fine-tuned from HuggingFaceTB/SmolLM2-360M using the Dpo training method.
Direct Preference Optimization - Paired preference optimization without reference model
This model was developed as part of thesis research on LLM Alignment using Preference Optimization Methods.
Model Details
| Property | Value |
|---|---|
| Base Model | HuggingFaceTB/SmolLM2-360M |
| Training Method | Dpo |
| Model Type | Merged Standalone Model |
| Training Date | December 2025 |
| Framework | PyTorch + Transformers + PEFT |
Benchmark Results
| Benchmark | Score |
|---|---|
| HellaSwag (10-shot) | 0.550 |
| TruthfulQA (0-shot MC2) | 0.361 |
| MMLU-Mini (5-shot) | 0.264 |
Comparative Analysis
The following chart compares this method against other training approaches on the same base model:
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Batch Size | 2 |
| Gradient Accumulation | 8 |
| Effective Batch Size | 16 |
| Learning Rate | 2e-4 |
| Max Sequence Length | 512 |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| Dataset | UltraFeedback Binarized |
Usage
Direct Loading (Merged Model)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("SmolLM2-360M")
tokenizer = AutoTokenizer.from_pretrained("SmolLM2-360M")
# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Training Methodology
Dpo
Direct Preference Optimization - Paired preference optimization without reference model
Key Features:
- Paired preference optimization
- Direct policy optimization without reward model
- Efficient single-stage training
- Bradley-Terry preference modeling
Citation
If you use this model in your research, please cite:
@misc{smollm2_360m_dpo_2025,
title = {SmolLM2-360M Fine-tuned with Dpo},
author = {Thesis Research},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Nishef/SmolLM2-360M-Full_DPO_20251225_043457}
}
Repository Structure
.
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # Model weights
βββ tokenizer files # Tokenizer configuration
βββ eval_summary.csv # Evaluation results
βββ thesis_plots/ # Visualization assets
β βββ benchmark_results.png
β βββ training_loss.png
βββ README.md # This file
Acknowledgments
- Base Model: HuggingFaceTB/SmolLM2-360M
- Training Framework: Hugging Face Transformers
- Fine-tuning Library: PEFT
License
This model is released under the Apache 2.0 license.
This model was created as part of thesis research on LLM alignment using preference optimization methods.
- Downloads last month
- 43
Model tree for Nishef/SmolLM2-360M-Full_DPO_20251225_043457-merged
Base model
HuggingFaceTB/SmolLM2-360M