dataton-v5b-wavlm-finetune

Fine-tuned WavLM-Large for speaker retrieval (Dataton Kryptonite Tembr 2026).

Base Model

Orange/Speaker-wavLM-id — WavLM-Large pretrained for speaker verification on VoxCeleb

Fine-tuning Details

Dataset: 673K audio files, 11,049 speakers (Kryptonite Tembr 2026 train set)
GPU: NVIDIA RTX 3090 (24 GB VRAM) on vast.ai
Stages:
- Stage 1 (3 epochs): frozen backbone, head initialization. LR=1e-3, margin=0.2
- Stage 2 (5 epochs): full fine-tune. LR=5e-5, margin=0.2, chunks 3s
Augmentations: MUSAN noise (0.6), RIR reverb (0.6), speed perturbation (0.5), codec (0.3)
Loss: AAMSoftmax, scale=30, 3 sub-centers per speaker
Emb dim: 250

Architecture

Raw waveform 16kHz
  → MVN normalization (per-utterance)
  → WavLMModel (24 transformer layers, 1024 hidden)
  → Stats pooling: mean + std → (B, 2048)
  → TopLayers: Conv1d 2048→512 → BN → ReLU → Conv1d 512→250 → BN → ReLU → L2-norm
  → 250-dim embedding

Performance

Solo model on Kryptonite Tembr public leaderboard:

P@10 = 0.6167 (rank 15)

In ensemble with pretrained WavLM + ERes2NetV2:

P@10 = 0.7040 (rank 7)

Usage

from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download("stasvinokur/dataton-v5b-wavlm-finetune", "stage2_best.pt")
state = torch.load(ckpt_path, map_location="cpu")

# model_state_dict and head_state_dict are in the checkpoint
print(state.keys())

Full inference pipeline: https://github.com/trade-stasvinokur/dataton_kryptonite_tembr_2026

Reproducibility: what you need for P@10 = 0.7040 submission

The winning 0.7040 result is an ensemble of three speaker embeddings. Only one model weight file (this repository) requires manual hosting — the other two are standard pretrained checkpoints that are fetched automatically by the code.

Artifact	Where hosted	Size	How to fetch
`stage2_best.pt` (our fine-tune)	this HF repo	3.8 GB	`hf_hub_download("stasvinokur/dataton-v5b-wavlm-finetune", "stage2_best.pt")`
WavLM-Large base (v5a)	HF `Orange/Speaker-wavLM-id`	1.2 GB	auto via `transformers.AutoModel.from_pretrained(...)`
ERes2NetV2 (v3b)	modelscope `iic/speech_eres2netv2_sv_zh-cn_16k-common`	~70 MB	auto via `modelscope.pipelines.pipeline(...)`

Only stage2_best.pt needed manual hosting — and it is now in this repo.

Pipeline (reproduces 0.7040)

Download all three checkpoints (above)
Run inference on each to get per-model embeddings for test set → three emb_avg.npy
L2-normalize each → concat (250 + 250 + 192 = 692 dim)
Post-processing:
- centering (gallery mean)
- ZCA whitening (eps=1e-5)
- L2 normalize
- cosine retrieval top-50
- AS-Norm (cohort=200)
- query expansion (k=10)
- re-retrieval top-50
- k-reciprocal re-rank (k1=20, k2=6, λ=0.3)
Take top-10 → submission.csv

Full reproducible code: https://github.com/trade-stasvinokur/dataton_kryptonite_tembr_2026

See src/experiments/v9_postproc_search/results.md for the exact post-processing grid search.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stasvinokur/dataton-v5b-wavlm-finetune

Base model

microsoft/wavlm-large

Finetuned

Orange/Speaker-wavLM-id

Finetuned

(1)

this model