Khmer Text Summarization β€” mT5-base Fine-Tuned

Fine-tuned google/mt5-base on a Khmer-language news summarization dataset (~6 500 article–summary pairs).

Model Details

Property Value
Base model google/mt5-base
Language Khmer (km)
Task Abstractive text summarization
Max input length 1 024 tokens
Max summary length 256 tokens
Beam search 4 beams
No-repeat n-gram 3

Quick Start

from transformers import pipeline

summarizer = pipeline(
    "summarization",
    model="taravirak/khmer-mt5-summarization",
    tokenizer="taravirak/khmer-mt5-summarization",
    device=0,          # set to -1 for CPU
)

article = "αžŸαžΌαž˜αž”αž‰αŸ’αž…αžΌαž›αž’αžαŸ’αžαž”αž‘αž‡αžΆαž—αžΆαžŸαžΆαžαŸ’αž˜αŸ‚αžšαž“αŸ…αž‘αžΈαž“αŸαŸ‡..."
result = summarizer(
    "summarize: " + article,
    max_new_tokens=256,
    num_beams=4,
    length_penalty=1.0,
    no_repeat_ngram_size=3,
    early_stopping=True,
)
print(result[0]["summary_text"])

Manual Usage

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "taravirak/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.eval()

article = "αžŸαžΌαž˜αž”αž‰αŸ’αž…αžΌαž›αž’αžαŸ’αžαž”αž‘αž‡αžΆαž—αžΆαžŸαžΆαžαŸ’αž˜αŸ‚αžšαž“αŸ…αž‘αžΈαž“αŸαŸ‡..."
inputs  = tokenizer(
    "summarize: " + article,
    return_tensors="pt",
    max_length=1024,
    truncation=True,
)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        num_beams=4,
        length_penalty=1.0,
        no_repeat_ngram_size=3,
        early_stopping=True,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Training Notes

  • Dataset: Properly scraped Khmer news articles with human-written summaries
  • Input prefix: summarize: prepended to each article
  • Mixed precision: bf16 with gradient checkpointing
  • Effective batch size: 32 (batch=2, grad_accum=16)
  • Optimizer: AdamW, lr=5e-5, weight_decay=0.01, 10 epochs
Downloads last month
18
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support