Higgs Audio Tokenizer
In this work, we introduce a new discretized audio tokenizer that runs at just 25 frames per second while keeping—or even improving—audio quality compared to tokenizers with twice the bitrate. Our model is the first to train on 24 kHz data covering speech, music, and sound events in one unified system. It also uses a simple non-diffusion encoder/decoder for fast, batch inference.
Basics of Audio Quantization
An audio signal sampled at $f_s$ Hz is first split into frames by an encoder with hop size $M$, giving a frame rate $f_r = \frac{f_s}{M}\quad\text{(frames/s)}.$ Two common quantizers are:
- Residual Vector Quantization (RVQ): $N_q$ cascaded layers with codebook size $N_{cb}$ each. When $N_{cb}=1$, it reduces to single-vector quantization.
- Finite Scalar Quantization (FSQ): A single layer ($N_q=1$) with codebook size $N_{cb}$.
If every combination of codewords is a token, the vocabulary size is $N_{cb}^{N_q}$, and each token needs $N_q\log_2 N_{cb}$ bits. The overall bitrate (bits/s, BPS) is simply $f_r \times N_q \log_2 N_{cb}.$
We aim to push this bitrate as low as possible without hurting audio fidelity.
What Makes Ours Better
- Low Frame Rate: At 25 fps, our tokenizer halves the frame rate of many baselines when still maintaining high audio quality.
- Unified 24 kHz Training: We mix speech, music, and sound-event clips in one model, capturing both semantic and acoustic details, hugely facilitating the training of audio language models.
- Fast Inference: By avoiding diffusion steps, our encoder/decoder processes batches quickly, making it practical for real-time or large-scale tasks.
Data and Evaluation Metrics
We test on four subsets:
Speech, Music, Sound Event: Includes 1,000 clips for each category, with each clip lasting 10 seconds. Clips are randomly sampled from DAPS (Speech), MUSDB (Music), and AudioSet (Sound Event).
Audiophile: Contains 150 clips, each 30 seconds long, curated from eleven high-fidelity test discs. The clips feature both music and sound events, selected for audio quality evaluation.
We measure:
- Acoustic Quality: STFT distance between the original and reconstructed audio.
- Semantic Integrity: Semantic preservation of the original audio using SeedTTS[15] dataset on English and Chinese.
- Aesthetics: SOTA unified model-based quality assessment, Meta Audiobox Aesthetics[8], for Content Enjoyment (CE), Content Usefulness (CU) .
We compare our tokenizer with a wide range of baselines, from tokenizers mainly built for better acoustic reconstruction and compression rate, to those focused on semantic integrity, and to tokenizers used in existing large audio language models. We also compare with tokenizers that are pretrained specifically on speech or on music.
The tables below summarize the tokenizers evaluated. As shown, our tokenizer achieves a well-rounded balance of efficiency, semantic fidelity, and acoustic quality.
Accoustic Evaluation
We use the STFT metric here for simplicity. The baselines are ordered chronologically, grouped by whether semantic distillation (SD) is applied.Despite DAC’s top acoustic quality at 12× the bitrate, our tokenizer leads all other baselines.
| Tokenizer | 💬 | 🎵 | 🥁 | SD | $f_s$ | $f_r$ | BPS* (k) ↓ | Speech ↓ | Sound Event ↓ | Music ↓ | Audiophile ↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Encodec[3] | ✓ | ✓ | ✓ | 24 | 75 | 24 | 1.96 | 2.65 | 2.52 | 2.30 | |
| DAC[2] | ✓ | ✓ | ✓ | 24 | 75 | 24 | 1.13 | 1.45 | 1.34 | 1.62 | |
| SNAC-24k[6] | ✓ | 24 | (12, 23, 47) | 0.98 | 1.92 | 2.69 | 2.54 | 2.52 | |||
| SNAC-44.1k[6] | ✓ | ✓ | 44.1 | (14, 29, 57, 115) | 2.6 | 1.83 | 2.25 | 2.05 | 2.00 | ||
| WavTokenizer[7] | ✓ | ✓ | 24 | 75 | 0.9 | 1.93 | 2.44 | 2.17 | 2.15 | ||
| WavTokenizer (Speech)[7] | ✓ | 24 | 75 | 0.9 | 1.78 | 2.47 | 2.42 | 2.47 | |||
| MuCodec[11] | ✓ | 48 | 25 | 0.35 | 2.87 | 3.69 | 3.36 | 2.97 | |||
| FlowDec-75m[12] | ✓ | ✓ | ✓ | 48 | 75 | 7.5 | 1.73 | 2.14 | 2.01 | 2.03 | |
| FlowDec-25s[12] | ✓ | ✓ | ✓ | 48 | 25 | 4 | 1.94 | 2.42 | 2.25 | 2.33 | |
| SpeechTokenizer[14] | ✓ | ✓ | 16 | 50 | 4 | 3.21 | 3.58 | 3.65 | 3.69 | ||
| SemantiCodec[5] | ✓ | ✓ | ✓ | ✓ | 16 | 50 | 1.4 | 3.05 | 3.28 | 3.24 | 3.18 |
| Mimi[13] | ✓ | ✓ | 24 | 12.5 | 4.4 | 1.77 | 2.40 | 2.30 | 2.15 | ||
| XCodec[1] | ✓ | ✓ | ✓ | ✓ | 16 | 50 | 4 | 2.95 | 3.16 | 3.00 | 3.03 |
| CosyVoice 2[13] | ✓ | ✓ | 16 | 25 | -** | 2.30 | 3.30 | 3.14 | 3.25 | ||
| XCodec2[9] | ✓ | ✓ | 16 | 50 | 0.8 | 3.06 | 3.72 | 3.62 | 3.64 | ||
| XY[10] | ✓ | ✓ | 24 | 12.5 | 1 | 1.89 | 2.51 | 2.40 | 2.26 | ||
| Ours | ✓ | ✓ | ✓ | ✓ | 24 | 25 | 2 | 1.62 | 2.03 | 1.85 | 1.80 |
* Bits-per-second is calculated according to the checkpoint the author provided.
** CosyVoice 2 uses the continuous feature as the conditioning, we include it for completeness.
Semantic Evaluation
Here we only compare with tokenizers that are trained with semantic distillation. SeedTTS is a dataset includes prompt/target audio and texts. We reconstructed the target audio, and use the word error rate (WER) and speaker similarity (SIM) metrics to evaluate the semantic integrity. SIM is calculated by the similarity between the prompt audio and reconstructed targeted audio with WavLM-large as the embedding model.
The following table shows that our tokenizer achieves comparable performance to tokenizers that 2.2x the bitrate of our model.
| Model | BPS (k) | en WER ↓ | en SIM ↑ | zh WER ↓ | zh SIM ↑ |
|---|---|---|---|---|---|
| SpeechTokenizer | 4 | 2.82 | 0.63 | 2.04 | 0.65 |
| SemantiCodec | 1.4 | 3.46 | 0.56 | 2.18 | 0.60 |
| Mimi | 4.4 | 2.35 | 0.70 | 1.48 | 0.72 |
| XCodec | 4.0 | 2.68 | 0.63 | 1.66 | 0.66 |
| CosyVoice 2 | - | 3.17 | 0.65 | 2.11 | 0.70 |
| XCodec2 | 0.8 | 2.74 | 0.62 | 1.91 | 0.67 |
| XY-MOSS-TTSD | 1.0 | 2.72 | 0.61 | 1.58 | 0.67 |
| Ours | 2.0 | 2.52 | 0.67 | 1.48 | 0.71 |
Audiobox Aesthetics Evaluation
This model based evaluation[8] further demonstrates the superiority of our tokenizer. CU is the Content Usefulness and CE is the Content Enjoyment. Each term is rated on a scale of 1-10. Notably, our tokenizer performs best on the Audiophile set—demonstrating a clear advantage when the original audio quality is high.
| Model | BPS (k) | Music CE ↑ | Music CU ↑ | Sound Event CE ↑ | Sound Event CU ↑ | Speech CE ↑ | Speech CU ↑ | Audiophile CE ↑ | Audiophile CU ↑ |
|---|---|---|---|---|---|---|---|---|---|
| Origin | - | 6.20 | 7.10 | 4.47 | 5.64 | 5.03 | 4.87 | 7.17 | 7.65 |
| SpeechTokenizer | 4.0 | 3.55 | 5.22 | 3.03 | 4.50 | 4.68 | 4.58 | 3.59 | 5.07 |
| SemantiCodec | 1.4 | 6.01 | 6.83 | 4.22 | 5.30 | 4.28 | 4.12 | 6.97 | 7.43 |
| Mimi | 4.4 | 6.01 | 6.83 | 4.26 | 5.35 | 4.87 | 4.72 | 6.80 | 7.29 |
| XCodec | 4.0 | 6.30 | 7.10 | 4.43 | 5.45 | 4.96 | 4.79 | 7.06 | 7.49 |
| CosyVoice 2 | - | 5.21 | 6.14 | 4.08 | 4.73 | 4.91 | 4.75 | 5.97 | 6.56 |
| XCodec2 | 0.8 | 4.38 | 5.66 | 3.43 | 4.63 | 4.93 | 4.78 | 4.56 | 5.46 |
| XY-MOSS-TTSD | 1.0 | 5.77 | 6.80 | 4.23 | 5.34 | 4.88 | 4.72 | 6.95 | 7.48 |
| Ours | 2.0 | 6.35 | 7.15 | 4.47 | 5.51 | 4.90 | 4.70 | 7.21 | 7.66 |
Note that since some tokenizers are trained on 16 kHz data, we upsample their audio outputs to 24 kHz before computing metrics. Different upsampling methods may cause slight variations (e.g., 4.36 vs. 4.43 for XCodec Sound Event CE). We report the best results we could obtain and highlight any results within 0.05 of the best one.
Reference
[11] Xu, Yaoxun, et al. "MuCodec: Ultra Low-Bitrate Music Codec." arXiv preprint arXiv:2409.13216 (2024).
