File size: 1,943 Bytes
6f91e60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# HiggsAudio-V2 Model Architecture
<img src="../figures/higgs_audio_v2_architecture_combined.png" width=800/>


Our model is built on top of [Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B). To enhance the model’s ability to process audio tokens, we incorporate the "DualFFN" architecture as an audio adapter. DualFFN acts as an audio-specific expert, boosting the LLM's performance with minimal computational overhead. Our implementation preserves 91% of the original LLM’s training speed with the inclusion of DualFFN.

Since our audio tokenizer is based on Residual Vector-Quantization (RVQ) and contains multiple codebooks, we adopt the [delay pattern](https://proceedings.neurips.cc/paper_files/paper/2023/file/94b472a1842cd7c56dcb125fb2765fbd-Paper-Conference.pdf) to enable simultaneous code generation across codebooks while supporting streaming.

<img src="../figures/higgs_audio_v2_open_source_delay_pattern.png" width=800/>


## DualFFN Performance Ablation Study

To assess the effectiveness of DualFFN, we trained two smaller models based on LLaMA-3.1-1B: one incorporating DualFFN and one without. Both models were trained for 250K steps with a learning rate of 5e-4 on a subset of the AudioVerse dataset. We evaluated their performance on SeedTTS-Eval, with the results presented in the figures below. The model equipped with DualFFN consistently outperforms its counterpart in terms of word error rate (WER) and speaker similarity.

- SeedTTS-EN

<img src="../figures/dual_ffn_comparison_seed_tts_en_wer.png" width=400/> <img src="../figures/dual_ffn_comparison_seed_tts_en_sim.png" width=400/>

- SeedTTS-ZH

<img src="../figures/dual_ffn_comparison_seed_tts_zh_wer.png" width=400/> <img src="../figures/dual_ffn_comparison_seed_tts_zh_sim.png" width=400>

We may notice that the model with DualFFN consistently outperforms the model without DualFFN in terms of word-error-rate (WER) and speaker similarity.