Instructions to use transhumanist-already-exists/tereshchenkoblue-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use transhumanist-already-exists/tereshchenkoblue-tokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("transhumanist-already-exists/tereshchenkoblue-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -24,8 +24,6 @@ pretty_name: “gemma-3 - ukrainized gemma tokenizer”
|
|
| 24 |
<img src="tereshchenkoblue.png" width="300px" style="margin-left:'auto' margin-right:'auto' display:'block'" caption=""/>
|
| 25 |
<figcaption><a ref="https://en.wikipedia.org/wiki/Tereshchenko_diamond">Tereshchenko Blue is the second biggest blue diamond in the world</a></figcaption>
|
| 26 |
</figure>
|
| 27 |
-
<!--
|
| 28 |
-
 -->
|
| 29 |
|
| 30 |
### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Tereshchenko Blue makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
|
| 31 |
|
|
@@ -58,7 +56,7 @@ Roughly four-fifths of tokens in scripts geographically and culturally distant f
|
|
| 58 |
|Tibetan|107|26|
|
| 59 |
|Oriya|100|25|
|
| 60 |
|Cyrillic|13398|0|
|
| 61 |
-
|Gemma-3 \<unused-*\>|
|
| 62 |
|
| 63 |
|
| 64 |
## Feature Overview:
|
|
|
|
| 24 |
<img src="tereshchenkoblue.png" width="300px" style="margin-left:'auto' margin-right:'auto' display:'block'" caption=""/>
|
| 25 |
<figcaption><a ref="https://en.wikipedia.org/wiki/Tereshchenko_diamond">Tereshchenko Blue is the second biggest blue diamond in the world</a></figcaption>
|
| 26 |
</figure>
|
|
|
|
|
|
|
| 27 |
|
| 28 |
### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Tereshchenko Blue makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
|
| 29 |
|
|
|
|
| 56 |
|Tibetan|107|26|
|
| 57 |
|Oriya|100|25|
|
| 58 |
|Cyrillic|13398|0|
|
| 59 |
+
|Gemma-3 \<unused-*\>|6139|102|
|
| 60 |
|
| 61 |
|
| 62 |
## Feature Overview:
|