Publish mgpt2 dpo checkpoint (step 420, val_loss 0.001878)

b640ec6 verified 2 months ago

6.14 kB

	---
	language:
	- en
	- hi
	- kn
	license: mit
	tags:
	- causal-lm
	- multilingual
	- indic
	- hindi
	- kannada
	- instruction-tuned
	- dpo
	- preference-alignment
	pipeline_tag: text-generation
	base_model: ace-1/mgpt2-sft
	---

	# mgpt2-dpo — Multilingual GPT-2 (Preference-Aligned)

	Recommended model from this project. `mgpt2-sft` further aligned with
	Direct Preference Optimization (DPO, β=0.1) on 13,500 toxic preference pairs
	from ai4bharat/indic-align (HHRLHF-T). Chosen responses are Llama2-70B-Chat
	safety refusals; rejected responses are raw pretrained-model continuations.

	DPO increased the log-probability of safety-refusal responses by +6.1% and
	decreased the log-probability of rejected responses by −16.8% relative to
	the SFT checkpoint. Generation comparisons show the SFT model attempts to comply
	with harmful prompts; the DPO model redirects. See the [project report](https://github.com)
	for full analysis.

	## Quick start

	```python
	import sys, torch
	import torch.nn.functional as F
	from huggingface_hub import snapshot_download

	local = snapshot_download("ace-1/mgpt2-dpo")
	sys.path.insert(0, local)
	from model import GPT
	from tokenizer.regex_tokenizer import RegexTokenizer

	ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
	model = GPT(ckpt["config"])
	model.load_state_dict(ckpt["model"])
	model.eval()

	enc = RegexTokenizer()
	enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")

	prompts = [
	"Explain what photosynthesis is.", # English
	"प्रकाश संश्लेषण क्या है?", # Hindi (Devanagari)
	"ದ್ಯುತಿಸಂಶ್ಲೇಷಣೆ ಎಂದರೇನು?", # Kannada script
	]

	for prompt in prompts:
	ids = enc.encode(prompt)
	x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
	with torch.no_grad():
	for _ in range(120):
	logits, _ = model(x[:, -1024:])
	probs = F.softmax(logits[:, -1, :] / 0.7, dim=-1)
	next_id = torch.multinomial(probs, num_samples=1)
	if next_id.item() == 50256: break
	x = torch.cat([x, next_id], dim=1)
	print(f"Prompt : {prompt}")
	print(f"Response: {enc.decode(x[0, len(ids):].tolist())}")
	print()
	```

	## Intended use

	Good for:
	- Multilingual instruction following with light safety alignment (en/hi/kn)
	- Research: DPO alignment dynamics at 124M scale
	- Demo of end-to-end LLM pipeline: pretrain → SFT → DPO

	Not for: Production safety-critical applications. Alignment is format-preference alignment
	(coherent refusals vs incoherent noise), not full safety alignment. At 124M
	parameters the pretrained model could not generate coherent harmful content, so the
	DPO preference signal is weaker than production RLHF setups.

	## Model details

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| GPT-2 (12 layers / 12 heads / 768d) \|
	\| Parameters \| ~124M \|
	\| Vocabulary \| 50,257 (mgpt2 BPE) + padded to 50,304 \|
	\| Context length \| 1,024 tokens \|
	\| Training stage \| DPO (preference-aligned) \|
	\| Git commit \| `e463752bb14b` \|

	## Training configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| `seed` \| `1337` \|
	\| `batch_size` \| `32` \|
	\| `micro_batch_size` \| `4` \|
	\| `beta` \| `0.1` \|
	\| `max_lr` \| `1e-06` \|
	\| `min_lr_ratio` \| `0.1` \|
	\| `warmup_steps` \| `20` \|
	\| `epochs` \| `1` \|
	\| `weight_decay` \| `0.1` \|
	\| `eval_interval` \| `50` \|

	## Evaluation

	\| Metric \| Value \| Notes \|
	\|---\|---\|---\|
	\| Preference win-rate \| 1.000 \| Held-out DPO pairs (n=1,496) \|
	\| DPO val loss \| ~0 \| Training converged fully \|
	\| SFT loss regression \| +1.2% \| Within 5% threshold (regression_ok=True) \|
	\| Chosen log-p Δ \| +6.1% \| vs SFT checkpoint on same pairs \|
	\| Rejected log-p Δ \| −16.8% \| vs SFT checkpoint on same pairs \|
	\| Preference margin Δ \| +29.1% \| chosen − rejected margin widened \|

	> 100% win-rate reflects format-preference alignment (coherent refusals vs word-salad),
	> not full safety alignment. See project report for full generation comparison.

	## Training data

	\| Language \| Count \| Chosen source \| Rejected source \|
	\|---\|---\|---\|---\|
	\| English (`eng_Latn`) \| 8,250 \| Llama2-70B-Chat safety refusals \| Phase C pretrained mgpt2 \|
	\| Hindi Devanagari (`hin_Deva`) \| 2,700 \| IndicTrans2-translated refusals \| Phase C pretrained mgpt2 \|
	\| Kannada script (`kan_Knda`) \| 1,950 \| IndicTrans2-translated refusals \| Phase C pretrained mgpt2 \|
	\| Hindi Latin (`hin_Latn`) \| 1,050 \| IndicTrans2 romanisation \| Phase C pretrained mgpt2 \|
	\| Kannada Latin (`kan_Latn`) \| 1,050 \| IndicTrans2 romanisation \| Phase C pretrained mgpt2 \|

	13,500 train / 1,499 val pairs. Source: [ai4bharat/indic-align](https://huggingface.co/datasets/ai4bharat/indic-align) HHRLHF-T config.

	## Tokenizer

	Custom multilingual regex + BPE tokenizer (`mgpt2`), trained on the same corpus mixture.
	Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:

	\| Bucket \| tiktoken-gpt2 \| mgpt2 \| Δ \|
	\|---\|---:\|---:\|---:\|
	\| Overall \| 480 tok/kB \| 223 tok/kB \| −54% \|
	\| Devanagari \| 592 tok/kB \| 215 tok/kB \| −64% \|
	\| Kannada \| 981 tok/kB \| 213 tok/kB \| −78% \|
	\| Latin \| 257 tok/kB \| 230 tok/kB \| −10% \|

	Tokenizer published separately: [ace-1/mgpt2-tokenizer](https://huggingface.co/ace-1/mgpt2-tokenizer)

	## Known limitations

	- Format-preference alignment, not full safety alignment. At 124M parameters, the pretrained model generates incoherent text for toxic prompts, so the DPO preference signal trains format preference (coherent refusals vs noise) rather than genuine safety reasoning.
	- Transliterated Latin script drift (inherited from SFT checkpoint) — `hin_Latn`/`kan_Latn` may switch scripts mid-generation.
	- 124M parameters. Factual accuracy and multi-step reasoning are limited.
	- Research checkpoint — not evaluated for production use.

	## Citation

	```bibtex
	@misc{mgpt2,
	title = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
	year = {2026},
	note = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
	url = {https://huggingface.co/ace-1/mgpt2-dpo}
	}
	```