OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Paper β’ 2604.00688 β’ Published β’ 11
Non-autoregressive discrete diffusion TTS for Amharic, fine-tuned from OmniVoice (Qwen3-0.6B + HiggsAudioV2, 8 codebooks).
Initialized from v3, trained on a multi-dataset Amharic corpus (~81K samples, ~331 hours).
| Metric | Value |
|---|---|
| Best loss | 3.9518 |
| Best step | 10,000 / 12,000 |
| Epochs | ~8.3 / 10 |
Paste this into a Colab notebook (GPU runtime required β T4 is enough):
!pip install -q omnivoice soundfile
import torch
import numpy as np
import soundfile as sf
from IPython.display import Audio, display
from omnivoice import OmniVoice, OmniVoiceGenerationConfig
MODEL_ID = "demeleww/omnivoice-amharic"
model = OmniVoice.from_pretrained(
MODEL_ID,
device_map="cuda:0",
dtype=torch.float16,
load_asr=True, # for auto-transcribing reference audio
)
print(f"β
Loaded {MODEL_ID}")
text = "α°ααα£ α₯αα³α α°α
α αα£α½αα’ αα
α¨α ααα αααα αα¨α« ααα’"
gen_config = OmniVoiceGenerationConfig(
num_step=32,
guidance_scale=2.0,
)
audio = model.generate(
text=text,
language="Amharic",
generation_config=gen_config,
)
waveform = (audio[0] * 32767).astype(np.int16)
display(Audio(waveform, rate=24000))
sf.write("output.wav", audio[0], 24000)
print("β
Saved to output.wav")
# Upload a reference WAV file to Colab first, or use a URL
REF_AUDIO = "reference.wav" # path to your reference audio
prompt = model.create_voice_clone_prompt(
ref_audio=REF_AUDIO,
ref_text=None, # auto-transcribe (or provide exact transcript)
)
audio = model.generate(
text="α°ααα£ α΅α α³αα΅ ααα’ αα¬ α₯α© αα ααα’",
language="Amharic",
voice_clone_prompt=prompt,
generation_config=OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0),
)
waveform = (audio[0] * 32767).astype(np.int16)
display(Audio(waveform, rate=24000))
Use this cell to compare the base OmniVoice model vs all fine-tuned versions side by side:
#@title Compare All Models
import torch, gc
import numpy as np
from IPython.display import Audio, display, HTML
from omnivoice import OmniVoice, OmniVoiceGenerationConfig
MODELS = {
"π Base OmniVoice": "k2-fsa/OmniVoice",
"πͺπΉ Amharic v3": "demeleww/omnivoice-amharic-finetuned-v3",
"πͺπΉ Amharic Best (this repo)": "demeleww/omnivoice-amharic",
}
text = "α°ααα£ α₯αα³α α°α
α αα£α½αα’ αα
α¨α ααα αααα αα¨α« ααα’"
gen_config = OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0)
for name, model_id in MODELS.items():
print(f"\n{'='*50}")
print(f"Loading {name} ({model_id})...")
m = OmniVoice.from_pretrained(model_id, device_map="cuda:0", dtype=torch.float16)
audio = m.generate(text=text, language="Amharic", generation_config=gen_config)
waveform = (audio[0] * 32767).astype(np.int16)
display(HTML(f"<h3>{name}</h3>"))
display(Audio(waveform, rate=24000))
# Free VRAM before loading next model
del m
gc.collect()
torch.cuda.empty_cache()
print(f"β
{name} done")
Tip: On a T4 (16GB), each model uses ~3GB VRAM in fp16. Load one at a time.
| Dataset | Notes |
|---|---|
| google/WaxalNLP | Amharic subset |
| gheero-Leyu/leyu-amharic-addis-ababa-dialect | Addis Ababa dialect only |
| surafelabebe/amharic_clear_audio_tts | |
| chappM/amharic-bdu-asr |
Total: ~81,731 samples, ~331 hours (token cache)
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| LR scheduler | Cosine |
| Max steps | 12,000 |
| Epochs | ~10 |
| Batch tokens | 28,672 |
| Precision | bf16 |
| Codebook weights | [8, 8, 6, 6, 4, 4, 2, 2] |
| Run | Steps | Best Loss | Notes |
|---|---|---|---|
| 1 | 0β1500 | ~4.15 | Init from v3 |
| 2 | 1500β6000 | 3.9994 (step 4190) | Checkpoints after 4100 lost (storage full) |
| 3 | 2700β12000 | 3.9518 (step 10000) | Extended to 10 epochs |
| Model | Description |
|---|---|
| omnivoice-amharic-finetuned | This model β best (step 10000, loss 3.9518) |
| omnivoice-amharic-finetuned-v3 | Previous version (WaxalNLP only) |
| Checkpoints | All checkpoints (10000-BEST, 12000-final, run1-final) |