Whisper Small Fine-tuned for Yoruba (Code-Switching)

Model Description

Developed by: LyngualLabs
Model type: Fine-tuned OpenAI Whisper Small (Seq2Seq/Transformer)
Languages: Yoruba, English (Code-Switched)
Finetuned from model: openai/whisper-small
Task: Automatic Speech Recognition (Transcribe)

Intended Uses & Limitations

This model is designed for transcribing speech that frequently alternates between Yoruba and English (Code-Switching). Unlike the base Whisper model, this version is specifically optimized to handle the accents, tonality, and linguistic blending found in Yoruba-English conversations.

🛠 Training Procedure

Training Overview

The model was fine-tuned using the Hugging Face Seq2SeqTrainer. The training process involved a two-stage run (due to infrastructure interruptions) on a high-memory NVIDIA A100 (80GB). The massive VRAM allowed for a large per-device batch size of 128, ensuring stable gradients and faster convergence.

Training Logs

The following logs illustrate the model's convergence over the training duration. Data includes metrics from both the initial run and the resumed training session:

Step	Training Loss	Validation Loss	WER (%)
500	0.4292	0.4460	27.92
1000	0.2418	0.3411	22.91
1500	0.1400	0.3238	21.71
2000*	0.0865	0.3612	22.29
2500*	0.0578	0.3475	20.80

*Steps from the resumed training phase.

Hyperparameters

The model was trained with high-throughput settings optimized for the A100 80GB GPU:

Learning Rate: 1e-5
Batch Size: 128 (Per Device)
Precision: BF16 (Brain Float 16) with TF32 enabled
Optimizer: AdamW
Warmup Steps: 500
Dataloader: 12 Workers with pinned memory
Hardware: NVIDIA A100-SXM4-80GB

Final Evaluation

The model was evaluated on a held-out test set after the final checkpoint.

Metric	Value
Test WER	20.76%

💻 How to Use

You can use this model directly with the Hugging Face pipeline.

from transformers import pipeline

# 1. Load the pipeline
# The model defaults to "yoruba" language settings via the generation config
pipe = pipeline("automatic-speech-recognition", model="LyngualLabs/whisper-small-yoruba")

# 2. Run transcription
# You do not need to specify forced_decoder_ids as they are handled in the config
result = pipe("path_to_your_audio.wav")

print(result["text"])

⚠️ Limitations & Bias

Code-Switching Sensitivity: While robust, the model may occasionally transcribe English loanwords using Yoruba orthography (or vice versa) depending on the speaker's accent.
Audio Environment: Performance is best on clear, near-field audio. Noisy backgrounds may increase the WER significantly compared to studio-quality recordings.
Dialects: The training data focuses on standard Yoruba mixed with English. Deep dialectal variations of Yoruba may result in lower accuracy.
Inference: Ensure forced_decoder_ids are set to None (or handled automatically by the pipeline) to allow the model to naturally predict the language tokens.

📜 Citation

If you use this model in your research, please cite it as follows:

@misc{whisper-small-yoruba-cs,
  author = {LyngualLabs},
  title = {Whisper Small Fine-tuned for Yoruba-English Code-Switching},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/LyngualLabs/whisper-small-yoruba}
}

Downloads last month: 77

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for LyngualLabs/whisper-small-yoruba

Base model

openai/whisper-small

Finetuned

(3349)

this model

Evaluation results

Final Test WER on Yoruba-English Code-Switched Dataset
self-reported

20.760