---
inference: false
library_name: transformers
base_model: google/gemma-3-12b-it
language:
  - uk
datasets:
  - lang-uk/malyuk
  - QIRIM/crh_monocorpus
multilinguality:
  - multililingual
tags:
  - gemma-3-tokenizer
  - ukraine
  - corpus-linguistics
pretty_name: “gemma-3 - ukrainized gemma tokenizer”  
---
# Tereshchenko Blue — Gemma‑3 tokenizer faceted to let Ukrainian shine.

<figure>
    <img src="tereshchenkoblue.png" width="300px" style="margin-left:'auto' margin-right:'auto' display:'block'" caption=""/>
    <figcaption><a ref="https://en.wikipedia.org/wiki/Tereshchenko_diamond">Tereshchenko Blue is the second biggest blue diamond in the world</a></figcaption>
</figure>

### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Tereshchenko Blue makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.

### How to possible
More than 16 of the most popular writing systems in the world were analyzed.
Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.

### Replaced tokens
|Writing system|Tokens removed|Tokens retained|
|-|-|-|
|Han (Chinese)|16,488|4,122|
|Devanagari (Hindi)|10,976|2,743|
|Bengali|7,983|1,995|
|Arabic|6,730|1,682|
|Hiragana / Katakana (Japanese)|3,944|985|
|Hangul (Korean)|3,744|935|
|Tamil|3,080|770|
|Thai|1,740|435|
|Malayalam|1,566|391|
|Telugu|1,428|356|
|Gujarati|1,080|270|
|Kannada|1,016|253|
|Ethiopic|691|172|
|Hebrew|670|167|
|Khmer|481|119|
|Sinhala|435|108|
|Myanmar|410|102|
|Lao|243|60|
|Gurmukhi|215|53|
|Tibetan|107|26|
|Oriya|100|25|
|Cyrillic|13,398|0|
|Gemma-3 \<unused-*\>|6,139|102|


## Feature Overview:

1. +81,492 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on **3 millions** texts from [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus).
2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.

## Simple example
```python
tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/tereshchenkoblue-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) - [55939, 124769, 117298, 199258] only 4 tokens 💪🏻
```


## Metrics
	
Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory).
||lang-uk/malyuk                                                                                              |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts                                                              |QIRIM/crh_monocorpus(Cyrillic) | 94 texts                                                   |allenai/c4(ru)                                                                                                             | 100k texts|allenai/c4(bg)                                                                                                                                                                    | 100k texts|allenai/c4(be)| 100k texts|
|--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
|words count                      <td colspan=2>22,898,164                                                                                                                 <td colspan=2>36,170,971         <td colspan=2>198,173,216        <td colspan=2>1,868,259             <td colspan=2>42,557,519                <td colspan=2>44,627,199                           <td colspan=2>43,153,645                |
||||||||||||||||
|tokenizers             |tokens                                                                                                             |toks/word|tokens               |toks/word|tokens                                                                                         |toks/word|tokens                                                                                |toks/word|tokens                                                                                                                            |toks/word|tokens                                                                                                                                                                                  |toks/word|tokens               |toks/word|
|Qwen/Qwen3-8B                   |84,408,084                                                                                                         |3.686    |46,884,593           |1.296    |395,581,536                                                                                    |1.996    |7,956,741                                                                             |4.259    |116,115,062                                                                                                                       |2.728    |132,597,427                                                                                                                                                                             |2.971    |173,571,099          |4.022    |
|meta-llama/Llama-3.1-8B-Instruct|57,226,997                                                                                                         |2.499    |46,085,724           |1.274    |382,143,751                                                                                    |1.928    |7,386,873                                                                             |3.954    |104,974,733                                                                                                                       |2.467    |119,123,733                                                                                                                                                                             |2.669    |150,189,294          |3.48     |
|microsoft/Phi-4-mini-instruct   |59,447,036                                                                                                         |2.596    |45,423,925           |**1.256**    |335,188,687                                                                                    |**1.691**    |5,995,822                                                                             |3.209    |91,824,464                                                                                                                        |**2.158**    |102,472,523                                                                                                                                                                             |2.296    |119,587,038          |**2.771**    |
|CohereLabs/aya-expanse-8b       |50,973,632                                                                                                         |2.226    |47,364,187           |1.309    |353,221,932                                                                                    |1.782    |6,614,719                                                                             |3.541    |93,089,697                                                                                                                        |2.187    |112,612,668                                                                                                                                                                             |**2.523**    |141,262,943          |3.273    |
|google/gemma-3-12b-it           |57,388,402                                                                                                         |2.506    |47,285,432           |1.307    |354,241,840                                                                                    |1.788    |6,240,944                                                                             |3.341    |95,520,817                                                                                                                        |2.245    |103,950,626                                                                                                                                                                             |2.329    |131,398,147          |3.045    |
|**tereshchenkoblue-tokenizer (Ours)**                |37,277,244                                                                                              |**1.628**🤩   |47,315,375           |1.308    |354,316,113                                                                                   |1.788     |4,400,824                                                                             |**2.356**    |108,791,712                                                                                                                       |2.556    |112,179,836                                                                                                                                                                             |2.514    |131,907,954          |3.057    |
|Comments                        <td colspan=2> Significant improvement over the original Gemma 3<td colspan=2>English tokenisation is unchanged (AllenAI / C4 contains a small amount of mixed-language text).<td colspan=2>Tereshchenko Blue retains all EU-language tokens, so the statistics stay the same apart from lang-overlap effects.<td colspan=2>Shows significant improvement on QIRIM Cyrillic<td colspan=2>Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen.<td colspan=4> Other Cyrillic languages, such as Bulgarian and Belarusian, drops insignificantly.|


## Contents

- [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings).

- [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Gemma-3-style tokenizer trained on 3 mln Malyuk Ukrainian corpus plus Cyrillic QIRIM (3x oversampled).

- [merge_info.json](merge_info.json): Lists the replaced Gemma-3 token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](malyuk_qirim_tokenizer.json).

- [tokenizer_config.json](tokenizer_config.json): Configuration metadata.

- [special_tokens_map.json](special_tokens_map.json): Mapping of special token (The same with Gemma-3).

## Initialisation of embeddings for new tokens in Gemma 3 models
Some tokens are identical to those in the original Gemma 3 tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.


### P.S.
 In my opinion, Ukraine’s language-tech orientation toward the EU and the English-speaking world makes the tokens cut from the original Gemma-3 tokenizer a lower priority for any future national LLM.
## Citation

**BibTeX:**

```bibtex
@misc{zaduha2025post9194,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9194 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9194}",
  month        = june,
  year         = {2025},
  note         = "[Online; accessed 26 June 2025]"
}
```