Whisper Small Fine-tuned for Yoruba (Code-Switching)

Model Description

  • Developed by: LyngualLabs
  • Model type: Fine-tuned OpenAI Whisper Small (Seq2Seq/Transformer)
  • Languages: Yoruba, English (Code-Switched)
  • Finetuned from model: openai/whisper-small
  • Task: Automatic Speech Recognition (Transcribe)

Intended Uses & Limitations

This model is designed for transcribing speech that frequently alternates between Yoruba and English (Code-Switching). Unlike the base Whisper model, this version is specifically optimized to handle the accents, tonality, and linguistic blending found in Yoruba-English conversations.

πŸ›  Training Procedure

Training Overview

The model was fine-tuned using the Hugging Face Seq2SeqTrainer. The training process involved a two-stage run (due to infrastructure interruptions) on a high-memory NVIDIA A100 (80GB). The massive VRAM allowed for a large per-device batch size of 128, ensuring stable gradients and faster convergence.

Training Logs

The following logs illustrate the model's convergence over the training duration. Data includes metrics from both the initial run and the resumed training session:

Step Training Loss Validation Loss WER (%)
500 0.4292 0.4460 27.92
1000 0.2418 0.3411 22.91
1500 0.1400 0.3238 21.71
2000* 0.0865 0.3612 22.29
2500* 0.0578 0.3475 20.80

*Steps from the resumed training phase.

Hyperparameters

The model was trained with high-throughput settings optimized for the A100 80GB GPU:

  • Learning Rate: 1e-5
  • Batch Size: 128 (Per Device)
  • Precision: BF16 (Brain Float 16) with TF32 enabled
  • Optimizer: AdamW
  • Warmup Steps: 500
  • Dataloader: 12 Workers with pinned memory
  • Hardware: NVIDIA A100-SXM4-80GB

Final Evaluation

The model was evaluated on a held-out test set after the final checkpoint.

Metric Value
Test WER 20.76%

πŸ’» How to Use

You can use this model directly with the Hugging Face pipeline.

from transformers import pipeline

# 1. Load the pipeline
# The model defaults to "yoruba" language settings via the generation config
pipe = pipeline("automatic-speech-recognition", model="LyngualLabs/whisper-small-yoruba")

# 2. Run transcription
# You do not need to specify forced_decoder_ids as they are handled in the config
result = pipe("path_to_your_audio.wav")

print(result["text"])

⚠️ Limitations & Bias

  • Code-Switching Sensitivity: While robust, the model may occasionally transcribe English loanwords using Yoruba orthography (or vice versa) depending on the speaker's accent.
  • Audio Environment: Performance is best on clear, near-field audio. Noisy backgrounds may increase the WER significantly compared to studio-quality recordings.
  • Dialects: The training data focuses on standard Yoruba mixed with English. Deep dialectal variations of Yoruba may result in lower accuracy.
  • Inference: Ensure forced_decoder_ids are set to None (or handled automatically by the pipeline) to allow the model to naturally predict the language tokens.

πŸ“œ Citation

If you use this model in your research, please cite it as follows:

@misc{whisper-small-yoruba-cs,
  author = {LyngualLabs},
  title = {Whisper Small Fine-tuned for Yoruba-English Code-Switching},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/LyngualLabs/whisper-small-yoruba}
}
Downloads last month
77
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for LyngualLabs/whisper-small-yoruba

Finetuned
(3349)
this model

Evaluation results

  • Final Test WER on Yoruba-English Code-Switched Dataset
    self-reported
    20.760