--- inference: false library_name: transformers base_model: google/gemma-3-12b-it language: - uk datasets: - lang-uk/malyuk - QIRIM/crh_monocorpus multilinguality: - multililingual tags: - gemma-3-tokenizer - ukraine - corpus-linguistics pretty_name: “gemma-3 - ukrainized gemma tokenizer” --- # Tereshchenko Blue — Gemma‑3 tokenizer faceted to let Ukrainian shine.
Tereshchenko Blue is the second biggest blue diamond in the world
### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Tereshchenko Blue makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens. ### How to possible More than 16 of the most popular writing systems in the world were analyzed. Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned. ### Replaced tokens |Writing system|Tokens removed|Tokens retained| |-|-|-| |Han (Chinese)|16,488|4,122| |Devanagari (Hindi)|10,976|2,743| |Bengali|7,983|1,995| |Arabic|6,730|1,682| |Hiragana / Katakana (Japanese)|3,944|985| |Hangul (Korean)|3,744|935| |Tamil|3,080|770| |Thai|1,740|435| |Malayalam|1,566|391| |Telugu|1,428|356| |Gujarati|1,080|270| |Kannada|1,016|253| |Ethiopic|691|172| |Hebrew|670|167| |Khmer|481|119| |Sinhala|435|108| |Myanmar|410|102| |Lao|243|60| |Gurmukhi|215|53| |Tibetan|107|26| |Oriya|100|25| |Cyrillic|13,398|0| |Gemma-3 \|6,139|102| ## Feature Overview: 1. +81,492 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on **3 millions** texts from [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). 2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected. 3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings. 4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one. ## Simple example ```python tokenizer = AutoTokenizer.from_pretrained( "transhumanist-already-exists/tereshchenkoblue-tokenizer" ) toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False) print(toks.input_ids) - [55939, 124769, 117298, 199258] only 4 tokens 💪🏻 ``` ## Metrics Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory). ||lang-uk/malyuk |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts |QIRIM/crh_monocorpus(Cyrillic) | 94 texts |allenai/c4(ru) | 100k texts|allenai/c4(bg) | 100k texts|allenai/c4(be)| 100k texts| |--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------| |words count 22,898,164 36,170,971 198,173,216 1,868,259 42,557,519 44,627,199 43,153,645 | |||||||||||||||| |tokenizers |tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word| |Qwen/Qwen3-8B |84,408,084 |3.686 |46,884,593 |1.296 |395,581,536 |1.996 |7,956,741 |4.259 |116,115,062 |2.728 |132,597,427 |2.971 |173,571,099 |4.022 | |meta-llama/Llama-3.1-8B-Instruct|57,226,997 |2.499 |46,085,724 |1.274 |382,143,751 |1.928 |7,386,873 |3.954 |104,974,733 |2.467 |119,123,733 |2.669 |150,189,294 |3.48 | |microsoft/Phi-4-mini-instruct |59,447,036 |2.596 |45,423,925 |**1.256** |335,188,687 |**1.691** |5,995,822 |3.209 |91,824,464 |**2.158** |102,472,523 |2.296 |119,587,038 |**2.771** | |CohereLabs/aya-expanse-8b |50,973,632 |2.226 |47,364,187 |1.309 |353,221,932 |1.782 |6,614,719 |3.541 |93,089,697 |2.187 |112,612,668 |**2.523** |141,262,943 |3.273 | |google/gemma-3-12b-it |57,388,402 |2.506 |47,285,432 |1.307 |354,241,840 |1.788 |6,240,944 |3.341 |95,520,817 |2.245 |103,950,626 |2.329 |131,398,147 |3.045 | |**tereshchenkoblue-tokenizer (Ours)** |37,277,244 |**1.628**🤩 |47,315,375 |1.308 |354,316,113 |1.788 |4,400,824 |**2.356** |108,791,712 |2.556 |112,179,836 |2.514 |131,907,954 |3.057 | |Comments Significant improvement over the original Gemma 3English tokenisation is unchanged (AllenAI / C4 contains a small amount of mixed-language text).Tereshchenko Blue retains all EU-language tokens, so the statistics stay the same apart from lang-overlap effects.Shows significant improvement on QIRIM CyrillicRussian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen. Other Cyrillic languages, such as Bulgarian and Belarusian, drops insignificantly.| ## Contents - [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings). - [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Gemma-3-style tokenizer trained on 3 mln Malyuk Ukrainian corpus plus Cyrillic QIRIM (3x oversampled). - [merge_info.json](merge_info.json): Lists the replaced Gemma-3 token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](malyuk_qirim_tokenizer.json). - [tokenizer_config.json](tokenizer_config.json): Configuration metadata. - [special_tokens_map.json](special_tokens_map.json): Mapping of special token (The same with Gemma-3). ## Initialisation of embeddings for new tokens in Gemma 3 models Some tokens are identical to those in the original Gemma 3 tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule. ### P.S. In my opinion, Ukraine’s language-tech orientation toward the EU and the English-speaking world makes the tokens cut from the original Gemma-3 tokenizer a lower priority for any future national LLM. ## Citation **BibTeX:** ```bibtex @misc{zaduha2025post9194, author = "{Bohdan Didenko}", title = "{Post \#9194 on Telegram Channel Zaduha}", howpublished = "\url{https://t.me/zaduha/9194}", month = june, year = {2025}, note = "[Online; accessed 26 June 2025]" } ```