MLX Speech Models
Collection
Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. β’ 29 items β’ Updated β’ 1
MLX-compatible weights for WeSpeaker ResNet34-LM, converted from the pyannote speaker embedding model with BatchNorm fused into Conv2d.
WeSpeaker ResNet34-LM is a speaker embedding model (~6.6M params) that produces 256-dimensional L2-normalized speaker embeddings from audio. Trained on VoxCeleb for speaker verification and diarization.
Architecture:
Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz)
β
ββ Conv2d(1β32, k=3, p=1) + ReLU
ββ Layer1: 3Γ BasicBlock(32β32)
ββ Layer2: 4Γ BasicBlock(32β64, stride=2)
ββ Layer3: 6Γ BasicBlock(64β128, stride=2)
ββ Layer4: 3Γ BasicBlock(128β256, stride=2)
β
ββ Statistics Pooling: mean + std β [B, 5120]
ββ Linear(5120β256) β L2 normalize
β
Output: [B, 256] speaker embedding
BatchNorm is fused into Conv2d at conversion time β no BN layers in the MLX model.
import SpeechVAD
// Speaker embedding
let model = try await WeSpeakerModel.fromPretrained()
let embedding = model.embed(audio: samples, sampleRate: 16000)
// embedding: [Float] of length 256, L2-normalized
// Compare speakers
let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)
// Full speaker diarization pipeline
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
}
Part of speech-swift.
python3 scripts/convert_wespeaker.py --upload
Converts the original pyannote/wespeaker-voxceleb-resnet34-LM checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:
w_fused = w Γ Ξ³/β(ΟΒ²+Ξ΅), b_fused = Ξ² β ΞΌΓΞ³/β(ΟΒ²+Ξ΅)[O, I, H, W] β [O, H, W, I] for MLX channels-lastresnet. prefix, seg_1 β embeddingnum_batches_tracked keys| PyTorch Key | MLX Key | Shape |
|---|---|---|
resnet.conv1.weight + resnet.bn1.* |
conv1.weight |
[32, 3, 3, 1] |
resnet.layer{L}.{B}.conv{1,2}.weight + bn{1,2}.* |
layer{L}.{B}.conv{1,2}.weight |
[O, 3, 3, I] |
resnet.layer{L}.0.shortcut.0.weight + shortcut.1.* |
layer{L}.0.shortcut.weight |
[O, 1, 1, I] |
resnet.seg_1.weight |
embedding.weight |
[256, 5120] |
resnet.seg_1.bias |
embedding.bias |
[256] |
The original WeSpeaker model is released under the MIT License.
Quantized
Base model
pyannote/wespeaker-voxceleb-resnet34-LM