Khmer Text Summarization β mT5-base Fine-Tuned
Fine-tuned google/mt5-base on a Khmer-language news summarization dataset (~6 500 articleβsummary pairs).
Model Details
| Property | Value |
|---|---|
| Base model | google/mt5-base |
| Language | Khmer (km) |
| Task | Abstractive text summarization |
| Max input length | 1 024 tokens |
| Max summary length | 256 tokens |
| Beam search | 4 beams |
| No-repeat n-gram | 3 |
Quick Start
from transformers import pipeline
summarizer = pipeline(
"summarization",
model="taravirak/khmer-mt5-summarization",
tokenizer="taravirak/khmer-mt5-summarization",
device=0, # set to -1 for CPU
)
article = "ααΌααααα
αΌαα’αααααααΆααΆααΆααααααα
ααΈααα..."
result = summarizer(
"summarize: " + article,
max_new_tokens=256,
num_beams=4,
length_penalty=1.0,
no_repeat_ngram_size=3,
early_stopping=True,
)
print(result[0]["summary_text"])
Manual Usage
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "taravirak/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.eval()
article = "ααΌααααα
αΌαα’αααααααΆααΆααΆααααααα
ααΈααα..."
inputs = tokenizer(
"summarize: " + article,
return_tensors="pt",
max_length=1024,
truncation=True,
)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=256,
num_beams=4,
length_penalty=1.0,
no_repeat_ngram_size=3,
early_stopping=True,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
Training Notes
- Dataset: Properly scraped Khmer news articles with human-written summaries
- Input prefix:
summarize:prepended to each article - Mixed precision: bf16 with gradient checkpointing
- Effective batch size: 32 (batch=2, grad_accum=16)
- Optimizer: AdamW, lr=5e-5, weight_decay=0.01, 10 epochs
- Downloads last month
- 18