Hypa-Gemma4 E2B

Hypa-Gemma4 E2B

A multilingual, tool-aware fine-tune of Google's Gemma 4 E2B for low-resource and underrepresented languages.

License: Apache 2.0 Base: Gemma 4 E2B-it GitHub: Hypa-Gemma Blog Post Trained with Unsloth

Model Description

Hypa-Gemma4 E2B (hypaai/Hypa-Gemma4-E2B-v1) is a LoRA-merged fine-tune of Google DeepMind's Gemma 4 E2B-it, produced by Hypa Intelligence. It is the first model released in our open research line on adapting modern open foundation models for low-resource and underrepresented languages, with a deliberate focus on retaining the base model's tool-aware and agentic prompting structure.

This release covers seventeen languages: English, French, Spanish, and fourteen languages of Nigeria. Several of the smaller languages in this set (including Annang, Eggon, Idoma, Igala, Nupe, and Urhobo) have not been formally represented in large-scale fine-tuning corpora before, or had no settled ISO-style language tag at the time we needed one.

The model is intended for translation, language detection, dictionary-style explanation, and general multilingual instruction-following. It inherits Gemma 4's native chat template, system / user / model role structure, and dedicated formatting for thinking and tool use.

Property Value
Base model google/gemma-4-E2B-it (2.3B effective, 5.1B with PLE)
Method LoRA (r=1024, α=1024) via Unsloth + QLoRA, then merged to 16-bit
Trainable parameters 1.62B / 6.74B (24.04%)
Training data 15.9M examples across 11 concatenated sub-datasets
Compute 1× NVIDIA H200 SXM, 7.41 days
Languages 17
Context window 128K (inherited from base model)
License Apache 2.0

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "hypaai/Hypa-Gemma4-E2B-v1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an expert Igbo translator."},
    {"role": "user", "content": "Translate to Igbo: Good morning, how are you today?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

For thinking mode, pass enable_thinking=True to apply_chat_template. For tool-use payloads, use the standard Gemma 4 messages-with-tools format.

Languages Covered

Code Language Code Language
en English ibb Ibibio
ann Annang idm Idoma
efi Efik igl Igala
ebi Ebira ig Igbo
ego Eggon nup Nupe
es Spanish pg Pidgin
fr French tiv Tiv
ha Hausa urh Urhobo
yo Yoruba

Some of the smaller languages in this set required custom or non-standard tags because no widely-adopted machine-readable code existed at the time of training. Where ISO 639-3 codes were available, we used them; where they were not, we documented our internal codes in the data release so downstream users can reproduce splits.

Training Data

Training data comprises 15.9 million examples assembled from eleven Hypa Intelligence sub-datasets, each contributing a different signal:

  1. Synthetic_Dictionary_Text_ONLY, Synthetic_Dictionary_FF_CC_Text_ONLY, JSON_Dictionary_Text_ONLY — three views (prose, cloze, structured JSON) of a curated multilingual dictionary, providing lexical grounding across all target languages.
  2. Fleurs9000_Text_ONLY — ~9,000 text-only examples derived from FLEURS, providing parallel translations across the language set.
  3. CommonVoice_35k_Text_ONLY, CommonVoice_15k_Text_ONLY — transcript-only data drawn from Mozilla CommonVoice, providing real-world spoken-language distribution as text.
  4. cv_15k_translation, cv_15k_detection — CommonVoice transcripts reformulated into translation and language-detection tasks.
  5. cbp_translation, cbp_detection — parallel translation and detection pairs from a community-sourced parallel corpus with broad coverage of the smaller languages.
  6. gpt_oss_synthetic — general-purpose synthetic instruction data generated using open-source LLMs, included specifically to mitigate catastrophic forgetting of the base model's instruction-following ability.

A public 10k subset of the training data is released as hypaai/Hypa-Text-10k. Additional sub-datasets are progressively being released under the hypaai organization.

Prompt Formatting

Every example was formatted using Gemma 4's native chat template, with explicit system, user, and model roles and dedicated control tokens for thinking (<|think|>, <|channel>thought ... <channel|>) and tool use. Loss was computed only on assistant turns via train_on_responses_only with instruction_part="<|turn>user\n" and response_part="<|turn>model\n".

Training Procedure

Hyperparameter Value
LoRA rank (r) 1024
LoRA alpha (α) 1024
LoRA dropout 0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Vision / audio modules frozen
Quantization 4-bit base, bf16 compute
Optimizer AdamW 8-bit
Learning rate 1e-5
LR schedule cosine, 500 warmup steps
Weight decay 0.001
Max grad norm 1.0
Per-device batch size 32
Gradient accumulation 2
Effective batch size 64
Sequence length 2048
Epochs 1
Total steps 248,862
Precision bfloat16
Gradient checkpointing enabled (Unsloth)
Hardware 1× NVIDIA H200 SXM (Runpod)
Runtime 7.41 days
Compute (approx) 2.11 × 10²⁰ FLOPs
Random seed 3407

Training was performed using Unsloth, which provides day-zero Gemma 4 support and patches the shared-KV-cache interaction with gradient checkpointing that affects naive QLoRA setups in vanilla transformers.

Evaluation and Recommendations

Training metrics

  • Final training loss: 0.41 (smooth monotonic decay)
  • Best evaluation loss: 3.128 at step 40,000
  • Final evaluation loss: 3.577

Honest note on overfitting

Evaluation loss bottomed out at step 40,000 (≈16% of total training) and rose steadily for the remaining ~209,000 steps, ending ~14% above its best value. The training loss continued to decrease throughout. This is a clear overfitting signature: the model memorized the training distribution beyond the point where additional optimization improved generalization.

For downstream use, we recommend the LoRA checkpoint at step 40,000 rather than the final merged 16-bit weights published in this repository. The intermediate checkpoint is available at hypaai/Hypa-Gemma4-E2B-v1-LoRAs. The merged 16-bit weights in this repo represent the end-of-training state and are useful as a reference but are not the strongest checkpoint by evaluation loss.

The full discussion of why this happened and what we will change for v2 is in the public write-up. Headline planned fixes: lower LoRA rank, load_best_model_at_end=True with eval_loss as the selection metric, language-held-out evaluation splits, and a lower step count.

Qualitative observations

Internal qualitative review on translation tasks shows substantial improvements over the base Gemma 4 E2B for every language in the set, with the largest deltas on the smallest languages (Annang, Efik, Ibibio), where the base model was effectively unusable. Quantitative chrF++, BLEU, and BLEURT results across language pairs will follow in a separate evaluation post.

Intended Use

Direct use cases:

  • Translation between English / French / Spanish and the fourteen covered low-resource languages
  • Language detection across all seventeen languages
  • Dictionary-style lexical lookup and explanation
  • Multilingual instruction-following on dialogue tasks
  • Tool-aware / function-calling-style prompting (inheriting the base model's structure)

Downstream use:

  • Suitable as a starting point for further fine-tuning on more specialized tasks within the supported languages
  • Suitable for adapter stacking (e.g., domain-specific LoRA on top)
  • Suitable for on-device deployment when quantized (E2B at 4-bit fits on mid-range mobile hardware)

Out-of-Scope and Limitations

  • Not safety-tuned for sensitive domains. This model has not undergone RLHF or DPO post-training. It should not be used unsupervised for medical, legal, financial, or psychological-counseling applications.
  • Quality varies by language. The smallest languages in the set are underrepresented even within our training mix and the resulting model output should be reviewed by native speakers before being used in production.
  • Overfitting on the final checkpoint. As noted above, the merged 16-bit weights in this repository correspond to the end of training, not the best evaluation checkpoint. For applications that prioritize generalization, use the step-40,000 LoRA from the companion repository.
  • Vision and audio components are frozen. This is a text-only fine-tune. The base model's image and audio capabilities are preserved at the underlying weight level but were not exercised during training and have not been validated for our target languages.
  • Tokenization quality. Gemma 4's 256K-vocabulary tokenizer fragments low-resource languages less aggressively than smaller-vocabulary tokenizers, but the smallest languages in this release still tokenize at higher cost per character than English. This is the gap we expect future iterations to close.
  • Coverage is finite. The seventeen languages in this release are the start, not the end. Many other underrepresented languages are not yet supported and may produce unreliable output.

Bias, Risks, and Limitations

This model inherits the biases and limitations of its base model (Google Gemma 4) and adds the biases of its fine-tuning corpus, which is weighted toward dictionary, religious-parallel, and CommonVoice text. Religious-parallel text in particular is a known cause of register and content bias in low-resource translation models. Users deploying this model in customer-facing applications should evaluate output for cultural appropriateness in their specific use case and language.

The model is not intended to make decisions affecting people's rights, health, finances, or wellbeing. Like all language models, it can produce confident-sounding output that is incorrect, particularly on the smallest languages where training data was thinnest.

Released Artifacts

Citation

If you use Hypa-Gemma4 E2B or any of the related work, please cite:

@misc{hypaai2026hypagemma4e2b,
  title        = {Hypa-Gemma4 E2B: A Multilingual Fine-Tune of Gemma 4 for Underrepresented Languages},
  author       = {{Hypa Intelligence}},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/hypaai/Hypa-Gemma4-E2B-v1}},
  note         = {Apache 2.0 License. Blog: \url{https://hypaintelligence.com/updates/tuning-gemma-4-for-multilingual-and-tool-aware-language-understanding}}
}

License

Released under the Apache License 2.0, inheriting the license of the Gemma 4 base model. Free to use, modify, and redistribute for both research and commercial purposes, with no monthly active user caps and no attribution friction.

Acknowledgments

  • Google DeepMind for releasing Gemma 4 under Apache 2.0 and for its 140+ language pretraining commitment.
  • Unsloth for day-zero Gemma 4 support and for handling the KV-sharing interaction with gradient checkpointing.
  • Runpod for reliable H200 infrastructure.
  • The language communities, speakers, and reviewers whose texts, voices, and feedback grounded this work and keep it honest.

Hypa IntelligenceWebsiteHugging FaceGitHubBlog

Multilingualism is not a feature. It is a prerequisite for AI that represents all of us.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hypaai/Hypa-Gemma4-E2B-v1-LoRAs

Adapter
(91)
this model

Dataset used to train hypaai/Hypa-Gemma4-E2B-v1-LoRAs

Collection including hypaai/Hypa-Gemma4-E2B-v1-LoRAs