dataton-v5b-wavlm-finetune

Fine-tuned WavLM-Large for speaker retrieval (Dataton Kryptonite Tembr 2026).

Base Model

Fine-tuning Details

  • Dataset: 673K audio files, 11,049 speakers (Kryptonite Tembr 2026 train set)
  • GPU: NVIDIA RTX 3090 (24 GB VRAM) on vast.ai
  • Stages:
    • Stage 1 (3 epochs): frozen backbone, head initialization. LR=1e-3, margin=0.2
    • Stage 2 (5 epochs): full fine-tune. LR=5e-5, margin=0.2, chunks 3s
  • Augmentations: MUSAN noise (0.6), RIR reverb (0.6), speed perturbation (0.5), codec (0.3)
  • Loss: AAMSoftmax, scale=30, 3 sub-centers per speaker
  • Emb dim: 250

Architecture

Raw waveform 16kHz
  β†’ MVN normalization (per-utterance)
  β†’ WavLMModel (24 transformer layers, 1024 hidden)
  β†’ Stats pooling: mean + std β†’ (B, 2048)
  β†’ TopLayers: Conv1d 2048β†’512 β†’ BN β†’ ReLU β†’ Conv1d 512β†’250 β†’ BN β†’ ReLU β†’ L2-norm
  β†’ 250-dim embedding

Performance

Solo model on Kryptonite Tembr public leaderboard:

  • P@10 = 0.6167 (rank 15)

In ensemble with pretrained WavLM + ERes2NetV2:

  • P@10 = 0.7040 (rank 7)

Usage

from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download("stasvinokur/dataton-v5b-wavlm-finetune", "stage2_best.pt")
state = torch.load(ckpt_path, map_location="cpu")

# model_state_dict and head_state_dict are in the checkpoint
print(state.keys())

Full inference pipeline: https://github.com/trade-stasvinokur/dataton_kryptonite_tembr_2026

Reproducibility: what you need for P@10 = 0.7040 submission

The winning 0.7040 result is an ensemble of three speaker embeddings. Only one model weight file (this repository) requires manual hosting β€” the other two are standard pretrained checkpoints that are fetched automatically by the code.

Artifact Where hosted Size How to fetch
stage2_best.pt (our fine-tune) this HF repo 3.8 GB hf_hub_download("stasvinokur/dataton-v5b-wavlm-finetune", "stage2_best.pt")
WavLM-Large base (v5a) HF Orange/Speaker-wavLM-id 1.2 GB auto via transformers.AutoModel.from_pretrained(...)
ERes2NetV2 (v3b) modelscope iic/speech_eres2netv2_sv_zh-cn_16k-common ~70 MB auto via modelscope.pipelines.pipeline(...)

Only stage2_best.pt needed manual hosting β€” and it is now in this repo.

Pipeline (reproduces 0.7040)

  1. Download all three checkpoints (above)
  2. Run inference on each to get per-model embeddings for test set β†’ three emb_avg.npy
  3. L2-normalize each β†’ concat (250 + 250 + 192 = 692 dim)
  4. Post-processing:
    • centering (gallery mean)
    • ZCA whitening (eps=1e-5)
    • L2 normalize
    • cosine retrieval top-50
    • AS-Norm (cohort=200)
    • query expansion (k=10)
    • re-retrieval top-50
    • k-reciprocal re-rank (k1=20, k2=6, Ξ»=0.3)
  5. Take top-10 β†’ submission.csv

Full reproducible code: https://github.com/trade-stasvinokur/dataton_kryptonite_tembr_2026

See src/experiments/v9_postproc_search/results.md for the exact post-processing grid search.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for stasvinokur/dataton-v5b-wavlm-finetune

Finetuned
(1)
this model