You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Tokkonizer-KM V3f

A production-ready Khmer-native tokenizer that outperforms Google's mT5 and Meta's XLM-R on every Khmer metric with 31x smaller vocabulary.

Live Demo: angkor-ai.com/labs

Tokenization Examples

"ព្រះពុទ្ធសាសនាមានសារៈសំខាន់។"
→ [▁ | ព្រះពុទ្ធសាសនា | មានសារៈសំខាន់ | ។]
  4 tokens, TPC 0.143 ✅

"នាយករដ្ឋមន្ត្រីបានថ្លែងសុន្ទរកថា"
→ [▁នាយករដ្ឋមន្ត្រី | បានថ្លែង | សុន្ទរកថា]
  3 tokens, TPC 0.094 ✅

Sanskrit/Pali: ធម៌ → 1 token ✅ | កម្ម → 1 token ✅ | និព្វាន → 1 token ✅

Performance

Metric	V3f (8K)	mT5 (250K)	XLM-R (250K)
TPC (Khmer)	0.293	0.348	0.327
Sanskrit/Pali	93.3%	21.4%	28.6%
Cultural preservation	91.7%	75.0%	91.7%
UNK rate	0%	0%	0%
Lossless round-trip	Yes	No	No
Speed	15M/s	3.3M/s	2.8M/s
ALT F1 (5K sentences)	99.94%	—	—

Intended Uses

Khmer text preprocessing for NLP pipelines
Semantic search / RAG over Khmer documents
Keyboard prediction engine
Spell checking (with companion lexicon)

Not intended for: text generation, translation, non-Khmer languages.

How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("khopilot/km-tokenizer-khmer")
tokens = tokenizer.encode("ព្រះពុទ្ធសាសនា")
decoded = tokenizer.decode(tokens)  # 100% lossless

Or with SentencePiece directly:

import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
pieces = sp.encode("កម្ពុជា", out_type=str)  # ["▁", "កម្ពុជា"]

Training

Algorithm: SentencePiece Unigram
Vocabulary: 8,000 tokens
Corpus: 648MB cleaned Khmer text (957K lines) — Wikipedia, news, government, religious texts
Character coverage: 1.0 (full Khmer Unicode)
User-defined symbols: 7 Sanskrit/Pali terms
Key finding: 7 UDS outperformed 500 UDS — less intervention = better results
Hardware: Apple M3 Pro, ~30 min training
CO2: negligible (CPU only)

Graph Regularization (Layer 2)

When paired with graph-regularized GPT-2 (separate model):

Metric	Baseline	Graph-Reg
Coherence@10	0.32%	15.5% (48x)
Collapse	0%	0.2%
Perplexity cost	—	+2.8%
Retrieval MRR	0.417	0.460 (+10.4%)

Companion: Khmer NLP Engine (26MB SQLite)

A complete prediction + correction + emoji engine built on this tokenizer:

60K word-pair predictions (IDF-weighted)
28K phrase predictions
12,677 validated words (spell check)
552 romanization mappings (Latin→Khmer)
400 contextual emoji suggestions
282 consonant cluster validations

Demo: angkor-ai.com/labs

Limitations & Caveats

Sanskrit/Pali circularity: 7 of 15 test terms were user-defined symbols (guaranteed preservation). True EM optimizer success rate on non-UDS terms: 87.5% (7/8).
ALT F1 in-domain: 99.94% boundary F1 benefits from shared ZWSP segmentation conventions between training data and ALT. Cross-domain word-level F1 estimated ~95-97%.
Retrieval MRR: +10.4% on 20 questions — preliminary, not statistically significant (overlapping bootstrap CIs).
Grapheme break rate: 1.08% (target 1.0%)
Corpus bias: formal/news text overrepresented vs conversational
Foreign names fragment into individual characters
සමាធិ (samadhi) is the only Sanskrit term that still fragments

Version History

Version	Vocab	TPC	Status
V6.5 (Aug 2025)	32K	0.664	Failed
V7 (Sep 2025)	16K	0.294	Deployed
V3f (Mar 2026)	8K	0.293	Production

Citation

@software{delrieu2026tokkonizer,
  author = {Delrieu, Nicolas},
  title = {Tokkonizer-KM: Graph-Regularized Tokenization for Khmer},
  year = {2026},
  url = {https://github.com/khopilot/tokkonizer-km}
}

Contact

Downloads last month: 70

Evaluation results

TPC (Khmer)
self-reported

0.293
Sanskrit/Pali Preservation
self-reported

93.330
ALT Segmentation F1
self-reported

99.940