SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

---
tags:
- music-structure-annotation
- transformer
---

<p align="center">
  <img src="https://github.com/ASLP-lab/SongFormer/blob/main/figs/logo.png?raw=true" width="50%" />
</p>

<h1 align="center">SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision</h1>

<div align="center">

![Python](https://img.shields.io/badge/Python-3.10-brightgreen)
![License](https://img.shields.io/badge/License-CC%20BY%204.0-lightblue)
[![arXiv Paper](https://img.shields.io/badge/arXiv-2510.02797-blue)](https://arxiv.org/abs/2510.02797)
[![GitHub](https://img.shields.io/badge/GitHub-SongFormer-black)](https://github.com/ASLP-lab/SongFormer)
[![HuggingFace Space](https://img.shields.io/badge/HuggingFace-space-yellow)](https://huggingface.co/spaces/ASLP-lab/SongFormer)
[![HuggingFace Model](https://img.shields.io/badge/HuggingFace-model-blue)](https://huggingface.co/ASLP-lab/SongFormer)
[![Dataset SongFormDB](https://img.shields.io/badge/HF%20Dataset-SongFormDB-green)](https://huggingface.co/datasets/ASLP-lab/SongFormDB)
[![Dataset SongFormBench](https://img.shields.io/badge/HF%20Dataset-SongFormBench-orange)](https://huggingface.co/datasets/ASLP-lab/SongFormBench)
[![Discord](https://img.shields.io/badge/Discord-join%20us-purple?logo=discord&logoColor=white)](https://discord.gg/p5uBryC4Zs)
[![lab](https://img.shields.io/badge/🏫-ASLP-grey?labelColor=lightgrey)](http://www.npu-aslp.org/)

</div>

<div align="center">
  <h3>
    Chunbo Hao<sup>1*</sup>, Ruibin Yuan<sup>2,5*</sup>, Jixun Yao<sup>1</sup>, Qixin Deng<sup>3,5</sup>,<br>Xinyi Bai<sup>4,5</sup>, Wei Xue<sup>2</sup>, Lei Xie<sup>1†</sup>
  </h3>
  
  <p>
    <sup>*</sup>Equal contribution &nbsp;&nbsp; <sup>†</sup>Corresponding author
  </p>
  
  <p>
    <sup>1</sup>Audio, Speech and Language Processing Group (ASLP@NPU),<br>Northwestern Polytechnical University<br>
    <sup>2</sup>Hong Kong University of Science and Technology<br>
    <sup>3</sup>Northwestern University<br>
    <sup>4</sup>Cornell University<br>
    <sup>5</sup>Multimodal Art Projection (M-A-P)
  </p>
</div>

----

SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research.

![](https://github.com/ASLP-lab/SongFormer/blob/main/figs/songformer.png?raw=true)

For a more detailed deployment guide, please refer to the [GitHub repository](https://github.com/ASLP-lab/SongFormer/).

## 🚀 QuickStart

### Prerequisites

Before running the model, follow the instructions in the [GitHub repository](https://github.com/ASLP-lab/SongFormer/) to set up the required **Python environment**.

---

### Input: Audio File Path

You can perform inference by providing the path to an audio file:

```python
from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os

# Download the model from Hugging Face Hub
local_dir = snapshot_download(
    repo_id="ASLP-lab/SongFormer",
    repo_type="model",
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns="*",
    ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)

# Add the local directory to path and set environment variable
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir

# Load the model
songformer = AutoModel.from_pretrained(
    local_dir,
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Set device and switch to evaluation mode
device = "cuda:0"
songformer.to(device)
songformer.eval()

# Run inference
result = songformer("path/to/audio/file.wav")
```

---

### Input: Tensor or NumPy Array

Alternatively, you can directly feed a raw audio waveform as a NumPy array or PyTorch tensor:

```python
from transformers import AutoModel
from huggingface_hub import snapshot_download
import sys
import os
import numpy as np

# Download model
local_dir = snapshot_download(
    repo_id="ASLP-lab/SongFormer",
    repo_type="model",
    local_dir_use_symlinks=False,
    resume_download=True,
    allow_patterns="*",
    ignore_patterns=["SongFormer.pt", "SongFormer.safetensors"],
)

# Setup environment
sys.path.append(local_dir)
os.environ["SONGFORMER_LOCAL_DIR"] = local_dir

# Load model
songformer = AutoModel.from_pretrained(
    local_dir,
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Configure device
device = "cuda:0"
songformer.to(device)
songformer.eval()

# Generate dummy audio input (sampling rate: 24,000 Hz, e.g., 60 seconds of audio)
audio = np.random.randn(24000 * 60).astype(np.float32)

# Perform inference
result = songformer(audio)
```

> ⚠️ **Note:** The expected sampling rate for input audio is **24,000 Hz**.

---

### Output Format

The model returns a structured list of segment predictions, with each entry containing timing and label information:

```json
[
  {
    "start": 0.0,          // Start time of segment (in seconds)
    "end": 15.2,           // End time of segment (in seconds)
    "label": "verse"       // Predicted segment label
  },
  ...
]
```

## 🔧 Notes

- The initialization logic of **MusicFM** has been modified to eliminate the need for loading checkpoint files during instantiation, improving both reliability and startup efficiency.

## 📚 Citation

If you use **SongFormer** in your research or application, please cite our work:

```bibtex
@misc{hao2025songformer,
  title         = {SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision},
  author        = {Chunbo Hao and Ruibin Yuan and Jixun Yao and Qixin Deng and Xinyi Bai and Wei Xue and Lei Xie},
  year          = {2025},
  eprint        = {2510.02797},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2510.02797}
}
```