Whisper Small Fine-tuned for Yoruba (Code-Switching)
Model Description
- Developed by: LyngualLabs
- Model type: Fine-tuned OpenAI Whisper Small (Seq2Seq/Transformer)
- Languages: Yoruba, English (Code-Switched)
- Finetuned from model: openai/whisper-small
- Task: Automatic Speech Recognition (Transcribe)
Intended Uses & Limitations
This model is designed for transcribing speech that frequently alternates between Yoruba and English (Code-Switching). Unlike the base Whisper model, this version is specifically optimized to handle the accents, tonality, and linguistic blending found in Yoruba-English conversations.
π Training Procedure
Training Overview
The model was fine-tuned using the Hugging Face Seq2SeqTrainer. The training process involved a two-stage run (due to infrastructure interruptions) on a high-memory NVIDIA A100 (80GB). The massive VRAM allowed for a large per-device batch size of 128, ensuring stable gradients and faster convergence.
Training Logs
The following logs illustrate the model's convergence over the training duration. Data includes metrics from both the initial run and the resumed training session:
| Step | Training Loss | Validation Loss | WER (%) |
|---|---|---|---|
| 500 | 0.4292 | 0.4460 | 27.92 |
| 1000 | 0.2418 | 0.3411 | 22.91 |
| 1500 | 0.1400 | 0.3238 | 21.71 |
| 2000* | 0.0865 | 0.3612 | 22.29 |
| 2500* | 0.0578 | 0.3475 | 20.80 |
*Steps from the resumed training phase.
Hyperparameters
The model was trained with high-throughput settings optimized for the A100 80GB GPU:
- Learning Rate: 1e-5
- Batch Size: 128 (Per Device)
- Precision: BF16 (Brain Float 16) with TF32 enabled
- Optimizer: AdamW
- Warmup Steps: 500
- Dataloader: 12 Workers with pinned memory
- Hardware: NVIDIA A100-SXM4-80GB
Final Evaluation
The model was evaluated on a held-out test set after the final checkpoint.
| Metric | Value |
|---|---|
| Test WER | 20.76% |
π» How to Use
You can use this model directly with the Hugging Face pipeline.
from transformers import pipeline
# 1. Load the pipeline
# The model defaults to "yoruba" language settings via the generation config
pipe = pipeline("automatic-speech-recognition", model="LyngualLabs/whisper-small-yoruba")
# 2. Run transcription
# You do not need to specify forced_decoder_ids as they are handled in the config
result = pipe("path_to_your_audio.wav")
print(result["text"])
β οΈ Limitations & Bias
- Code-Switching Sensitivity: While robust, the model may occasionally transcribe English loanwords using Yoruba orthography (or vice versa) depending on the speaker's accent.
- Audio Environment: Performance is best on clear, near-field audio. Noisy backgrounds may increase the WER significantly compared to studio-quality recordings.
- Dialects: The training data focuses on standard Yoruba mixed with English. Deep dialectal variations of Yoruba may result in lower accuracy.
- Inference: Ensure forced_decoder_ids are set to None (or handled automatically by the pipeline) to allow the model to naturally predict the language tokens.
π Citation
If you use this model in your research, please cite it as follows:
@misc{whisper-small-yoruba-cs,
author = {LyngualLabs},
title = {Whisper Small Fine-tuned for Yoruba-English Code-Switching},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/LyngualLabs/whisper-small-yoruba}
}
- Downloads last month
- 77
Model tree for LyngualLabs/whisper-small-yoruba
Base model
openai/whisper-smallEvaluation results
- Final Test WER on Yoruba-English Code-Switched Datasetself-reported20.760