Tokkonizer-KM V3f
A production-ready Khmer-native tokenizer that outperforms Google's mT5 and Meta's XLM-R on every Khmer metric with 31x smaller vocabulary.
Live Demo: angkor-ai.com/labs
Tokenization Examples
"ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។"
→ [▁ | ព្រះពុទ្ធសាសនា | មានសារៈសំខាន់ | ។]
4 tokens, TPC 0.143 ✅
"នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា"
→ [▁នាយករដ្ឋមន្ត្រី | បានថ្លែង | សុន្ទរកថា]
3 tokens, TPC 0.094 ✅
Sanskrit/Pali: ធម៌ → 1 token ✅ | កម្ម → 1 token ✅ | និព្វាន → 1 token ✅
Performance
| Metric | V3f (8K) | mT5 (250K) | XLM-R (250K) |
|---|---|---|---|
| TPC (Khmer) | 0.293 | 0.348 | 0.327 |
| Sanskrit/Pali | 93.3% | 21.4% | 28.6% |
| Cultural preservation | 91.7% | 75.0% | 91.7% |
| UNK rate | 0% | 0% | 0% |
| Lossless round-trip | Yes | No | No |
| Speed | 15M/s | 3.3M/s | 2.8M/s |
| ALT F1 (5K sentences) | 99.94% | — | — |
Intended Uses
- Khmer text preprocessing for NLP pipelines
- Semantic search / RAG over Khmer documents
- Keyboard prediction engine
- Spell checking (with companion lexicon)
Not intended for: text generation, translation, non-Khmer languages.
How to Use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("khopilot/km-tokenizer-khmer")
tokens = tokenizer.encode("ព្រះពុទ្ធសាសនា")
decoded = tokenizer.decode(tokens) # 100% lossless
Or with SentencePiece directly:
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
pieces = sp.encode("កម្ពុជា", out_type=str) # ["▁", "កម្ពុជា"]
Training
- Algorithm: SentencePiece Unigram
- Vocabulary: 8,000 tokens
- Corpus: 648MB cleaned Khmer text (957K lines) — Wikipedia, news, government, religious texts
- Character coverage: 1.0 (full Khmer Unicode)
- User-defined symbols: 7 Sanskrit/Pali terms
- Key finding: 7 UDS outperformed 500 UDS — less intervention = better results
- Hardware: Apple M3 Pro, ~30 min training
- CO2: negligible (CPU only)
Graph Regularization (Layer 2)
When paired with graph-regularized GPT-2 (separate model):
| Metric | Baseline | Graph-Reg |
|---|---|---|
| Coherence@10 | 0.32% | 15.5% (48x) |
| Collapse | 0% | 0.2% |
| Perplexity cost | — | +2.8% |
| Retrieval MRR | 0.417 | 0.460 (+10.4%) |
Companion: Khmer NLP Engine (26MB SQLite)
A complete prediction + correction + emoji engine built on this tokenizer:
- 60K word-pair predictions (IDF-weighted)
- 28K phrase predictions
- 12,677 validated words (spell check)
- 552 romanization mappings (Latin→Khmer)
- 400 contextual emoji suggestions
- 282 consonant cluster validations
Demo: angkor-ai.com/labs
Limitations & Caveats
- Sanskrit/Pali circularity: 7 of 15 test terms were user-defined symbols (guaranteed preservation). True EM optimizer success rate on non-UDS terms: 87.5% (7/8).
- ALT F1 in-domain: 99.94% boundary F1 benefits from shared ZWSP segmentation conventions between training data and ALT. Cross-domain word-level F1 estimated ~95-97%.
- Retrieval MRR: +10.4% on 20 questions — preliminary, not statistically significant (overlapping bootstrap CIs).
- Grapheme break rate: 1.08% (target 1.0%)
- Corpus bias: formal/news text overrepresented vs conversational
- Foreign names fragment into individual characters
- සමាធិ (samadhi) is the only Sanskrit term that still fragments
Version History
| Version | Vocab | TPC | Status |
|---|---|---|---|
| V6.5 (Aug 2025) | 32K | 0.664 | Failed |
| V7 (Sep 2025) | 16K | 0.294 | Deployed |
| V3f (Mar 2026) | 8K | 0.293 | Production |
Citation
@software{delrieu2026tokkonizer,
author = {Delrieu, Nicolas},
title = {Tokkonizer-KM: Graph-Regularized Tokenization for Khmer},
year = {2026},
url = {https://github.com/khopilot/tokkonizer-km}
}
Contact
- Downloads last month
- 70
Evaluation results
- TPC (Khmer)self-reported0.293
- Sanskrit/Pali Preservationself-reported93.330
- ALT Segmentation F1self-reported99.940