Whisper-Tiny Fine-Tuned for Kazakh Children Speech
Automatic Speech Recognition (ASR) for Kazakh Child Voices
This model is a fine-tuned version of OpenAI Whisper-tiny, specifically adapted for Kazakh childrenโs speech (ages 5โ12). Training was performed in two stages, using a total of 2172 audio samples:
- Stage 1: 236 samples (initial fine-tuning)
- Stage 2: 1936 samples (extended fine-tuning)
The model significantly improves recognition accuracy for short child-spoken Kazakh words and simple phrases.
๐ Key Improvements
| Metric | Base Whisper-tiny | Fine-tuned Model | Improvement |
|---|---|---|---|
| WER | 0.80 | 0.42 | -47.5% |
| Accuracy | 20.43% | 57.95% | +183% |
| Exact Match | ~20% | 61.11% | +199% |
The model performs best for single-word utterances, which reflects typical children's speech in the dataset.
๐ Dataset Overview
Total dataset used: 2172 audio files
- Stage 1: 236 samples
- Stage 2: 1936 samples
Audio characteristics:
- Format: WAV
- Sample rate: 16 kHz
- Channels: mono
- Bit depth: 16-bit PCM
- Speakers: children aged 5โ12
Transcript format:
Transcripts were automatically extracted from filenames
Example:
AUDIO-2024-03-22-20-11-05_าะฐัะปัาะฐั.wav
Dataset distribution:
- 1-word samples: 1479 (76.3%)
- 2โ5 words: 446 (23.0%)
- 6โ10 words: 6 (0.3%)
- 11+ words: 7 (0.4%)
Most frequent words are numerals (โะฑััโ, โะตะบัโ, โาฏัโ, โะพะฝโ, โะถาฏะทโ), which reflects an educational dataset.
๐ง Training Approach
Training was conducted on Google Colab (NVIDIA T4 GPU) using PyTorch + Hugging Face Transformers.
๐น Stage 1 โ Initial Fine-tuning (236 samples)
- Epochs: 1
- Batch size: 1
- Learning rate: 1e-5
- Loss decreased from 8.86 โ 0.46
Purpose: Teach the model basic acoustic structure of Kazakh child speech.
๐น Stage 2 โ Extended Fine-tuning (1936 samples)
- Started from Stage 1 model
- Loss decreased from 0.32 โ 0.23
- Checkpoints saved every 200 steps
- Automatic skipping of corrupted or invalid samples
Purpose: Improve generalization with a more diverse dataset.
๐งช Inference Example
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
model_path = "Yersultan/whisper-kk-children" # replace
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()
def transcribe(audio_path):
audio, sr = torchaudio.load(audio_path)
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
features = processor.feature_extractor(
audio.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt"
).input_features.to(device)
forced_ids = processor.get_decoder_prompt_ids(
language="kazakh",
task="transcribe"
)
with torch.no_grad():
pred_ids = model.generate(features, forced_decoder_ids=forced_ids)
return processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
๐ Error Analysis
Common errors:
- Phonetic substitutions (ัโะป, ะถโะท, าโะบ)
- Vowel confusion (ำฉโะพ, าฏโั)
- Dropped words in long sequences
- Noisy background affecting predictions
Strengths:
- Very strong on single-word recognition (66.73%)
- Good for short educational phrases
- Stable on most common Kazakh phonemes
๐ Limitations
- Poor performance on long phrases (6+ words)
- Dataset mostly consists of isolated words
- Only childrenโs voices โ not suitable for adults
- Whisper-tiny has limited capacity โ small model trade-offs
๐ง Future Improvements
- Fine-tune Whisper-small or Whisper-base
- Increase dataset to 5000+ samples
- Add noisy/augmented variants
- Improve long-phrase recognition
- Quantization (FP16/INT8) for mobile deployment
- Real-time streaming support
๐ค Author
Yersultan Kelesbayev Kazakh National University (KazNU)
Faculty of Information Technology
Email: [email protected]
Special thanks to Duisenbekkyzy Zhansaya for providing the dataset.
๐ License
MIT License
๐ Summary
This project provides one of the first fine-tuned Whisper models specifically for Kazakh childrenโs speech. It greatly boosts recognition accuracy and can serve as a foundation for:
- educational apps
- childrenโs speech interfaces
- Kazakh ASR research
- low-resource language modeling
Model tree for yersultan-kv/whisper-kk-children-tiny
Base model
openai/whisper-tiny