---
license: cc-by-nc-4.0
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-Omni-7B
tags:
- reward model
- BTRM
- RLHF
- evaluation
- naturalness
- speech
datasets:
- RMSnow/SpeechJudge-Data
---

# SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

[![Paper](https://img.shields.io/badge/PAPER-b31b1b?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2511.07931)
[![Demo](https://img.shields.io/badge/DEMO_PAGE-1f6feb?style=for-the-badge&logo=googlechrome&logoColor=white)](https://speechjudge.github.io/)
[![GitHub](https://img.shields.io/badge/GITHUB-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/AmphionTeam/SpeechJudge)
[![HF Model](https://img.shields.io/badge/HF_MODEL-FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/RMSnow/SpeechJudge-GRM)
[![HF Data](https://img.shields.io/badge/HF_DATA-FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/datasets/RMSnow/SpeechJudge-Data)
[![HF Space](https://img.shields.io/badge/HF_SPACE-FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/spaces/yuantuo666/SpeechJudge-GRM)


Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce **SpeechJudge**, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on ***naturalness***—one of the most fundamental subjective metrics for speech synthesis:

- **SpeechJudge-Data**: a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. 
- **SpeechJudge-Eval**: a challenging benchmark for speech naturalness judgment.
- **SpeechJudge-GRM**: a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases.

## SpeechJudge-BTRM

### Installation

1. Clone this repository:
```bash
git clone https://github.com/AmphionTeam/SpeechJudge.git
cd SpeechJudge
```

2. Install the required dependencies:
```bash
pip install transformers==4.52.3
pip install accelerate==1.10.0
pip install qwen-omni-utils==0.0.8
```

### Usage

### Basic Usage

The main entry point is `infer/main_btrm.py`. Here's a basic example:

```python
from utils import download_hugginface_model
from btrm_pipeline import RewardModelInferencePipeline


if __name__ == "__main__":
    # Load the model
    qwen_omni_path = "pretrained/Qwen2.5-Omni-7B"
    model_path = "pretrained/SpeechJudge-BTRM"

    download_hugginface_model("Qwen/Qwen2.5-Omni-7B", qwen_omni_path)
    download_hugginface_model("RMSnow/SpeechJudge-BTRM", model_path)
    inference_pipeline = RewardModelInferencePipeline(qwen_omni_path, model_path)

    # The compared two speeches (and the corresponding text)
    target_text = "The worn leather, once supple and inviting, now hangs limp and lifeless. Its time has passed, like autumn leaves surrendering to winter's chill. I shall cast it aside, making way for new beginnings and fresh possibilities."
    wav_path_a = "examples/wav_a.wav"
    wav_path_b = "examples/wav_b.wav"

    # Compare the two audio outputs
    score_A, score_B = inference_pipeline.get_pairwise_rewards(
        target_text, wav_path_a, wav_path_b
    )
    final_result = "A" if score_A > score_B else "B" if score_A < score_B else "Tie"

    print(f"\n[Final Result] {final_result}")
    print(f"Score of Audio A: {score_A}, Score of Audio B: {score_B}")
```

### Running the Example

The repository includes example audio files in `infer/examples/`. To run the provided example:

```bash
cd infer
python main_btrm.py
```

## Citation

If you use SpeechJudge in your research, please cite our paper:

```bibtex
@article{zhang2025speechjudge,
  title={SpeechJudge: Towards Human-Level Judgment for Speech Naturalness},
  author={Zhang, Xueyao and Wang, Chaoren and Liao, Huan and Li, Ziniu and Wang, Yuancheng and Wang, Li and Jia, Dongya and Chen, Yuanzhe and Li, Xiulin and Chen, Zhuo and Wu, Zhizheng},
  journal={arXiv preprint arXiv:2511.07931},
  year={2025}
}
```