OmniVoice Amharic

Non-autoregressive discrete diffusion TTS for Amharic, fine-tuned from OmniVoice (Qwen3-0.6B + HiggsAudioV2, 8 codebooks).

Initialized from v3, trained on a multi-dataset Amharic corpus (~81K samples, ~331 hours).

πŸ† Best Result

Metric Value
Best loss 3.9518
Best step 10,000 / 12,000
Epochs ~8.3 / 10

Quick Start β€” Colab Inference

Open In Colab

Paste this into a Colab notebook (GPU runtime required β€” T4 is enough):

Cell 1: Install

!pip install -q omnivoice soundfile

Cell 2: Load Model

import torch
import numpy as np
import soundfile as sf
from IPython.display import Audio, display
from omnivoice import OmniVoice, OmniVoiceGenerationConfig

MODEL_ID = "demeleww/omnivoice-amharic"

model = OmniVoice.from_pretrained(
    MODEL_ID,
    device_map="cuda:0",
    dtype=torch.float16,
    load_asr=True,  # for auto-transcribing reference audio
)
print(f"βœ… Loaded {MODEL_ID}")

Cell 3: Generate Speech (no voice cloning)

text = "αˆ°αˆ‹αˆα£ αŠ₯αŠ•αŠ³αŠ• α‹°αˆ…αŠ“ αˆ˜αŒ£α‰½αˆα’ α‹­αˆ… α‹¨αŠ αˆ›αˆ­αŠ› αŠ•αŒαŒαˆ­ αˆ™αŠ¨αˆ« αŠα‹α’"

gen_config = OmniVoiceGenerationConfig(
    num_step=32,
    guidance_scale=2.0,
)

audio = model.generate(
    text=text,
    language="Amharic",
    generation_config=gen_config,
)

waveform = (audio[0] * 32767).astype(np.int16)
display(Audio(waveform, rate=24000))
sf.write("output.wav", audio[0], 24000)
print("βœ… Saved to output.wav")

Cell 4: Generate with Voice Cloning (optional)

# Upload a reference WAV file to Colab first, or use a URL
REF_AUDIO = "reference.wav"  # path to your reference audio

prompt = model.create_voice_clone_prompt(
    ref_audio=REF_AUDIO,
    ref_text=None,  # auto-transcribe (or provide exact transcript)
)

audio = model.generate(
    text="αˆ°αˆ‹αˆα£ ሡሜ α‹³α‹Šα‰΅ αŠα‹α’ α‹›αˆ¬ αŒ₯ሩ α‰€αŠ• αŠα‹α’",
    language="Amharic",
    voice_clone_prompt=prompt,
    generation_config=OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0),
)

waveform = (audio[0] * 32767).astype(np.int16)
display(Audio(waveform, rate=24000))

Compare All Models

Use this cell to compare the base OmniVoice model vs all fine-tuned versions side by side:

#@title Compare All Models
import torch, gc
import numpy as np
from IPython.display import Audio, display, HTML
from omnivoice import OmniVoice, OmniVoiceGenerationConfig

MODELS = {
    "🌍 Base OmniVoice":            "k2-fsa/OmniVoice",
    "πŸ‡ͺπŸ‡Ή Amharic v3":              "demeleww/omnivoice-amharic-finetuned-v3",
    "πŸ‡ͺπŸ‡Ή Amharic Best (this repo)": "demeleww/omnivoice-amharic",
}

text = "αˆ°αˆ‹αˆα£ αŠ₯αŠ•αŠ³αŠ• α‹°αˆ…αŠ“ αˆ˜αŒ£α‰½αˆα’ α‹­αˆ… α‹¨αŠ αˆ›αˆ­αŠ› αŠ•αŒαŒαˆ­ αˆ™αŠ¨αˆ« αŠα‹α’"
gen_config = OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0)

for name, model_id in MODELS.items():
    print(f"\n{'='*50}")
    print(f"Loading {name} ({model_id})...")
    m = OmniVoice.from_pretrained(model_id, device_map="cuda:0", dtype=torch.float16)

    audio = m.generate(text=text, language="Amharic", generation_config=gen_config)
    waveform = (audio[0] * 32767).astype(np.int16)

    display(HTML(f"<h3>{name}</h3>"))
    display(Audio(waveform, rate=24000))

    # Free VRAM before loading next model
    del m
    gc.collect()
    torch.cuda.empty_cache()
    print(f"βœ… {name} done")

Tip: On a T4 (16GB), each model uses ~3GB VRAM in fp16. Load one at a time.

Datasets

Total: ~81,731 samples, ~331 hours (token cache)

Training

Parameter Value
Learning rate 2e-5
LR scheduler Cosine
Max steps 12,000
Epochs ~10
Batch tokens 28,672
Precision bf16
Codebook weights [8, 8, 6, 6, 4, 4, 2, 2]

Training History

Run Steps Best Loss Notes
1 0β†’1500 ~4.15 Init from v3
2 1500β†’6000 3.9994 (step 4190) Checkpoints after 4100 lost (storage full)
3 2700β†’12000 3.9518 (step 10000) Extended to 10 epochs

Architecture

  • Backbone: Qwen3-0.6B (636M params)
  • Audio tokenizer: HiggsAudioV2 (8 codebooks, 1025 vocab)
  • Method: Non-autoregressive discrete diffusion
  • Recipe from NMikka's Georgian fine-tune

All Versions

Model Description
omnivoice-amharic-finetuned This model β€” best (step 10000, loss 3.9518)
omnivoice-amharic-finetuned-v3 Previous version (WaxalNLP only)
Checkpoints All checkpoints (10000-BEST, 12000-final, run1-final)

Resources

Downloads last month
237
Safetensors
Model size
0.6B params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for demeleww/omnivoice-amharic

Finetuned
Qwen/Qwen3-0.6B
Finetuned
k2-fsa/OmniVoice
Finetuned
(15)
this model

Datasets used to train demeleww/omnivoice-amharic

Paper for demeleww/omnivoice-amharic