|6,139|102|
## Feature Overview:
1. +81,492 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on **3 millions** texts from [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus).
2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
## Simple example
```python
tokenizer = AutoTokenizer.from_pretrained(
"transhumanist-already-exists/tereshchenkoblue-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) - [55939, 124769, 117298, 199258] only 4 tokens 💪🏻
```
## Metrics
Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory).
||lang-uk/malyuk |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts |QIRIM/crh_monocorpus(Cyrillic) | 94 texts |allenai/c4(ru) | 100k texts|allenai/c4(bg) | 100k texts|allenai/c4(be)| 100k texts|
|--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
|words count | 22,898,164 | 36,170,971 | 198,173,216 | 1,868,259 | 42,557,519 | 44,627,199 | 43,153,645 |
||||||||||||||||
|tokenizers |tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|
|Qwen/Qwen3-8B |84,408,084 |3.686 |46,884,593 |1.296 |395,581,536 |1.996 |7,956,741 |4.259 |116,115,062 |2.728 |132,597,427 |2.971 |173,571,099 |4.022 |
|meta-llama/Llama-3.1-8B-Instruct|57,226,997 |2.499 |46,085,724 |1.274 |382,143,751 |1.928 |7,386,873 |3.954 |104,974,733 |2.467 |119,123,733 |2.669 |150,189,294 |3.48 |
|microsoft/Phi-4-mini-instruct |59,447,036 |2.596 |45,423,925 |**1.256** |335,188,687 |**1.691** |5,995,822 |3.209 |91,824,464 |**2.158** |102,472,523 |2.296 |119,587,038 |**2.771** |
|CohereLabs/aya-expanse-8b |50,973,632 |2.226 |47,364,187 |1.309 |353,221,932 |1.782 |6,614,719 |3.541 |93,089,697 |2.187 |112,612,668 |**2.523** |141,262,943 |3.273 |
|google/gemma-3-12b-it |57,388,402 |2.506 |47,285,432 |1.307 |354,241,840 |1.788 |6,240,944 |3.341 |95,520,817 |2.245 |103,950,626 |2.329 |131,398,147 |3.045 |
|**tereshchenkoblue-tokenizer (Ours)** |37,277,244 |**1.628**🤩 |47,315,375 |1.308 |354,316,113 |1.788 |4,400,824 |**2.356** |108,791,712 |2.556 |112,179,836 |2.514 |131,907,954 |3.057 |
|Comments | Significant improvement over the original Gemma 3 | English tokenisation is unchanged (AllenAI / C4 contains a small amount of mixed-language text). | Tereshchenko Blue retains all EU-language tokens, so the statistics stay the same apart from lang-overlap effects. | Shows significant improvement on QIRIM Cyrillic | Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen. | Other Cyrillic languages, such as Bulgarian and Belarusian, drops insignificantly.|
## Contents
- [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings).
- [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Gemma-3-style tokenizer trained on 3 mln Malyuk Ukrainian corpus plus Cyrillic QIRIM (3x oversampled).
- [merge_info.json](merge_info.json): Lists the replaced Gemma-3 token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](malyuk_qirim_tokenizer.json).
- [tokenizer_config.json](tokenizer_config.json): Configuration metadata.
- [special_tokens_map.json](special_tokens_map.json): Mapping of special token (The same with Gemma-3).
## Initialisation of embeddings for new tokens in Gemma 3 models
Some tokens are identical to those in the original Gemma 3 tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.
### P.S.
In my opinion, Ukraine’s language-tech orientation toward the EU and the English-speaking world makes the tokens cut from the original Gemma-3 tokenizer a lower priority for any future national LLM.
## Citation
**BibTeX:**
```bibtex
@misc{zaduha2025post9194,
author = "{Bohdan Didenko}",
title = "{Post \#9194 on Telegram Channel Zaduha}",
howpublished = "\url{https://t.me/zaduha/9194}",
month = june,
year = {2025},
note = "[Online; accessed 26 June 2025]"
}
```
|