SmolLM2-360M - Dpo

Benchmark Results

Model Description

This model is a Merged Standalone Model fine-tuned from HuggingFaceTB/SmolLM2-360M using the Dpo training method.

Direct Preference Optimization - Paired preference optimization without reference model

This model was developed as part of thesis research on LLM Alignment using Preference Optimization Methods.

Model Details

Property Value
Base Model HuggingFaceTB/SmolLM2-360M
Training Method Dpo
Model Type Merged Standalone Model
Training Date December 2025
Framework PyTorch + Transformers + PEFT

Benchmark Results

Benchmark Score
HellaSwag (10-shot) 0.550
TruthfulQA (0-shot MC2) 0.361
MMLU-Mini (5-shot) 0.264

Comparative Analysis

The following chart compares this method against other training approaches on the same base model:

Training Loss Curves

Training Configuration

Parameter Value
Epochs 1
Batch Size 2
Gradient Accumulation 8
Effective Batch Size 16
Learning Rate 2e-4
Max Sequence Length 512
LoRA Rank 16
LoRA Alpha 32
Dataset UltraFeedback Binarized

Usage

Direct Loading (Merged Model)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("SmolLM2-360M")
tokenizer = AutoTokenizer.from_pretrained("SmolLM2-360M")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Training Methodology

Dpo

Direct Preference Optimization - Paired preference optimization without reference model

Key Features:

  • Paired preference optimization
  • Direct policy optimization without reward model
  • Efficient single-stage training
  • Bradley-Terry preference modeling

Citation

If you use this model in your research, please cite:

@misc{smollm2_360m_dpo_2025,
  title = {SmolLM2-360M Fine-tuned with Dpo},
  author = {Thesis Research},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Nishef/SmolLM2-360M-Full_DPO_20251225_043457}
}

Repository Structure

.
β”œβ”€β”€ adapter_config.json      # LoRA configuration
β”œβ”€β”€ adapter_model.safetensors # Model weights
β”œβ”€β”€ tokenizer files          # Tokenizer configuration
β”œβ”€β”€ eval_summary.csv         # Evaluation results
β”œβ”€β”€ thesis_plots/            # Visualization assets
β”‚   β”œβ”€β”€ benchmark_results.png
β”‚   └── training_loss.png
└── README.md               # This file

Acknowledgments

License

This model is released under the Apache 2.0 license.


This model was created as part of thesis research on LLM alignment using preference optimization methods.

Downloads last month
43
Safetensors
Model size
0.4B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Nishef/SmolLM2-360M-Full_DPO_20251225_043457-merged

Finetuned
(89)
this model

Dataset used to train Nishef/SmolLM2-360M-Full_DPO_20251225_043457-merged