You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Tokkonizer-KM V3f

A production-ready Khmer-native tokenizer that outperforms Google's mT5 and Meta's XLM-R on every Khmer metric with 31x smaller vocabulary.

Live Demo: angkor-ai.com/labs

Tokenization Examples

"ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។"
→ [▁ | ព្រះពុទ្ធសាសនា | មានសារៈសំខាន់ | ។]
  4 tokens, TPC 0.143 ✅

"នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា"
→ [▁នាយករដ្ឋមន្ត្រី | បានថ្លែង | សុន្ទរកថា]
  3 tokens, TPC 0.094 ✅

Sanskrit/Pali: ធម៌ → 1 token ✅ | កម្ម → 1 token ✅ | និព្វាន → 1 token ✅

Performance

Metric V3f (8K) mT5 (250K) XLM-R (250K)
TPC (Khmer) 0.293 0.348 0.327
Sanskrit/Pali 93.3% 21.4% 28.6%
Cultural preservation 91.7% 75.0% 91.7%
UNK rate 0% 0% 0%
Lossless round-trip Yes No No
Speed 15M/s 3.3M/s 2.8M/s
ALT F1 (5K sentences) 99.94%

Intended Uses

  • Khmer text preprocessing for NLP pipelines
  • Semantic search / RAG over Khmer documents
  • Keyboard prediction engine
  • Spell checking (with companion lexicon)

Not intended for: text generation, translation, non-Khmer languages.

How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("khopilot/km-tokenizer-khmer")
tokens = tokenizer.encode("ព្រះពុទ្ធសាសនា")
decoded = tokenizer.decode(tokens)  # 100% lossless

Or with SentencePiece directly:

import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
pieces = sp.encode("កម្ពុជា", out_type=str)  # ["▁", "កម្ពុជា"]

Training

  • Algorithm: SentencePiece Unigram
  • Vocabulary: 8,000 tokens
  • Corpus: 648MB cleaned Khmer text (957K lines) — Wikipedia, news, government, religious texts
  • Character coverage: 1.0 (full Khmer Unicode)
  • User-defined symbols: 7 Sanskrit/Pali terms
  • Key finding: 7 UDS outperformed 500 UDS — less intervention = better results
  • Hardware: Apple M3 Pro, ~30 min training
  • CO2: negligible (CPU only)

Graph Regularization (Layer 2)

When paired with graph-regularized GPT-2 (separate model):

Metric Baseline Graph-Reg
Coherence@10 0.32% 15.5% (48x)
Collapse 0% 0.2%
Perplexity cost +2.8%
Retrieval MRR 0.417 0.460 (+10.4%)

Companion: Khmer NLP Engine (26MB SQLite)

A complete prediction + correction + emoji engine built on this tokenizer:

  • 60K word-pair predictions (IDF-weighted)
  • 28K phrase predictions
  • 12,677 validated words (spell check)
  • 552 romanization mappings (Latin→Khmer)
  • 400 contextual emoji suggestions
  • 282 consonant cluster validations

Demo: angkor-ai.com/labs

Limitations & Caveats

  • Sanskrit/Pali circularity: 7 of 15 test terms were user-defined symbols (guaranteed preservation). True EM optimizer success rate on non-UDS terms: 87.5% (7/8).
  • ALT F1 in-domain: 99.94% boundary F1 benefits from shared ZWSP segmentation conventions between training data and ALT. Cross-domain word-level F1 estimated ~95-97%.
  • Retrieval MRR: +10.4% on 20 questions — preliminary, not statistically significant (overlapping bootstrap CIs).
  • Grapheme break rate: 1.08% (target 1.0%)
  • Corpus bias: formal/news text overrepresented vs conversational
  • Foreign names fragment into individual characters
  • සමាធិ (samadhi) is the only Sanskrit term that still fragments

Version History

Version Vocab TPC Status
V6.5 (Aug 2025) 32K 0.664 Failed
V7 (Sep 2025) 16K 0.294 Deployed
V3f (Mar 2026) 8K 0.293 Production

Citation

@software{delrieu2026tokkonizer,
  author = {Delrieu, Nicolas},
  title = {Tokkonizer-KM: Graph-Regularized Tokenization for Khmer},
  year = {2026},
  url = {https://github.com/khopilot/tokkonizer-km}
}

Contact

Downloads last month
70
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Evaluation results