dataton-v5b-wavlm-finetune
Fine-tuned WavLM-Large for speaker retrieval (Dataton Kryptonite Tembr 2026).
Base Model
- Orange/Speaker-wavLM-id β WavLM-Large pretrained for speaker verification on VoxCeleb
Fine-tuning Details
- Dataset: 673K audio files, 11,049 speakers (Kryptonite Tembr 2026 train set)
- GPU: NVIDIA RTX 3090 (24 GB VRAM) on vast.ai
- Stages:
- Stage 1 (3 epochs): frozen backbone, head initialization. LR=1e-3, margin=0.2
- Stage 2 (5 epochs): full fine-tune. LR=5e-5, margin=0.2, chunks 3s
- Augmentations: MUSAN noise (0.6), RIR reverb (0.6), speed perturbation (0.5), codec (0.3)
- Loss: AAMSoftmax, scale=30, 3 sub-centers per speaker
- Emb dim: 250
Architecture
Raw waveform 16kHz
β MVN normalization (per-utterance)
β WavLMModel (24 transformer layers, 1024 hidden)
β Stats pooling: mean + std β (B, 2048)
β TopLayers: Conv1d 2048β512 β BN β ReLU β Conv1d 512β250 β BN β ReLU β L2-norm
β 250-dim embedding
Performance
Solo model on Kryptonite Tembr public leaderboard:
- P@10 = 0.6167 (rank 15)
In ensemble with pretrained WavLM + ERes2NetV2:
- P@10 = 0.7040 (rank 7)
Usage
from huggingface_hub import hf_hub_download
import torch
ckpt_path = hf_hub_download("stasvinokur/dataton-v5b-wavlm-finetune", "stage2_best.pt")
state = torch.load(ckpt_path, map_location="cpu")
# model_state_dict and head_state_dict are in the checkpoint
print(state.keys())
Full inference pipeline: https://github.com/trade-stasvinokur/dataton_kryptonite_tembr_2026
Reproducibility: what you need for P@10 = 0.7040 submission
The winning 0.7040 result is an ensemble of three speaker embeddings. Only one model weight file (this repository) requires manual hosting β the other two are standard pretrained checkpoints that are fetched automatically by the code.
| Artifact | Where hosted | Size | How to fetch |
|---|---|---|---|
stage2_best.pt (our fine-tune) |
this HF repo | 3.8 GB | hf_hub_download("stasvinokur/dataton-v5b-wavlm-finetune", "stage2_best.pt") |
| WavLM-Large base (v5a) | HF Orange/Speaker-wavLM-id |
1.2 GB | auto via transformers.AutoModel.from_pretrained(...) |
| ERes2NetV2 (v3b) | modelscope iic/speech_eres2netv2_sv_zh-cn_16k-common |
~70 MB | auto via modelscope.pipelines.pipeline(...) |
Only stage2_best.pt needed manual hosting β and it is now in this repo.
Pipeline (reproduces 0.7040)
- Download all three checkpoints (above)
- Run inference on each to get per-model embeddings for test set β three
emb_avg.npy - L2-normalize each β concat (250 + 250 + 192 = 692 dim)
- Post-processing:
- centering (gallery mean)
- ZCA whitening (eps=1e-5)
- L2 normalize
- cosine retrieval top-50
- AS-Norm (cohort=200)
- query expansion (k=10)
- re-retrieval top-50
- k-reciprocal re-rank (k1=20, k2=6, Ξ»=0.3)
- Take top-10 β
submission.csv
Full reproducible code: https://github.com/trade-stasvinokur/dataton_kryptonite_tembr_2026
See src/experiments/v9_postproc_search/results.md for the exact post-processing grid search.