dataflare/egypt-legal-corpus
Viewer • Updated • 2.43k • 337
DF-Arc is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining Morphological Pre-tokenization with PMI-based Phrase Merging.
It achieves near 1:1 fertility (1.16) and high semantic density.
LlamaTokenizer).| Model | Fertility | Total Tokens | Total Words |
|---|---|---|---|
| DF-Arc | 1.162 | 133,485 | 114,882 |
| GPT-4 (cl100k) | 3.689 | 423,743 | 114,882 |
| AraBERT v2 | 1.555 | 178,609 | 114,882 |
| AraT5 | 1.193 | 137,107 | 114,882 |
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc")
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"
print(tokenizer.tokenize(text))
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
@misc{df_arc,
title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization},
author={Dataflare Lab},
year={2026},
publisher={Hugging Face}
}