Whisper-Tiny Fine-Tuned for Kazakh Children Speech

Automatic Speech Recognition (ASR) for Kazakh Child Voices

This model is a fine-tuned version of OpenAI Whisper-tiny, specifically adapted for Kazakh children’s speech (ages 5–12). Training was performed in two stages, using a total of 2172 audio samples:

Stage 1: 236 samples (initial fine-tuning)
Stage 2: 1936 samples (extended fine-tuning)

The model significantly improves recognition accuracy for short child-spoken Kazakh words and simple phrases.

📌 Key Improvements

Metric	Base Whisper-tiny	Fine-tuned Model	Improvement
WER	0.80	0.42	-47.5%
Accuracy	20.43%	57.95%	+183%
Exact Match	~20%	61.11%	+199%

The model performs best for single-word utterances, which reflects typical children's speech in the dataset.

📂 Dataset Overview

Total dataset used: 2172 audio files

Stage 1: 236 samples
Stage 2: 1936 samples

Audio characteristics:

Format: WAV
Sample rate: 16 kHz
Channels: mono
Bit depth: 16-bit PCM
Speakers: children aged 5–12

Transcript format:

Transcripts were automatically extracted from filenames Example: AUDIO-2024-03-22-20-11-05_қарлығаш.wav

Dataset distribution:

1-word samples: 1479 (76.3%)
2–5 words: 446 (23.0%)
6–10 words: 6 (0.3%)
11+ words: 7 (0.4%)

Most frequent words are numerals (“бір”, “екі”, “үш”, “он”, “жүз”), which reflects an educational dataset.

🧠 Training Approach

Training was conducted on Google Colab (NVIDIA T4 GPU) using PyTorch + Hugging Face Transformers.

🔹 Stage 1 — Initial Fine-tuning (236 samples)

Epochs: 1
Batch size: 1
Learning rate: 1e-5
Loss decreased from 8.86 → 0.46

Purpose: Teach the model basic acoustic structure of Kazakh child speech.

🔹 Stage 2 — Extended Fine-tuning (1936 samples)

Started from Stage 1 model
Loss decreased from 0.32 → 0.23
Checkpoints saved every 200 steps
Automatic skipping of corrupted or invalid samples

Purpose: Improve generalization with a more diverse dataset.

🧪 Inference Example

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

model_path = "Yersultan/whisper-kk-children"  # replace
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()

def transcribe(audio_path):
    audio, sr = torchaudio.load(audio_path)
    if sr != 16000:
        audio = torchaudio.transforms.Resample(sr, 16000)(audio)

    features = processor.feature_extractor(
        audio.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt"
    ).input_features.to(device)

    forced_ids = processor.get_decoder_prompt_ids(
        language="kazakh",
        task="transcribe"
    )

    with torch.no_grad():
        pred_ids = model.generate(features, forced_decoder_ids=forced_ids)

    return processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

🔍 Error Analysis

Common errors:

Phonetic substitutions (р→л, ж→з, қ→к)
Vowel confusion (ө→о, ү→у)
Dropped words in long sequences
Noisy background affecting predictions

Strengths:

Very strong on single-word recognition (66.73%)
Good for short educational phrases
Stable on most common Kazakh phonemes

📉 Limitations

Poor performance on long phrases (6+ words)
Dataset mostly consists of isolated words
Only children’s voices — not suitable for adults
Whisper-tiny has limited capacity → small model trade-offs

🔧 Future Improvements

Fine-tune Whisper-small or Whisper-base
Increase dataset to 5000+ samples
Add noisy/augmented variants
Improve long-phrase recognition
Quantization (FP16/INT8) for mobile deployment
Real-time streaming support

👤 Author

Yersultan Kelesbayev Kazakh National University (KazNU)

Faculty of Information Technology

Email: [email protected]

Special thanks to Duisenbekkyzy Zhansaya for providing the dataset.

📄 License

MIT License

🌟 Summary

This project provides one of the first fine-tuned Whisper models specifically for Kazakh children’s speech. It greatly boosts recognition accuracy and can serve as a foundation for:

educational apps
children’s speech interfaces
Kazakh ASR research
low-resource language modeling

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for yersultan-kv/whisper-kk-children-tiny

Base model

openai/whisper-tiny

Finetuned

(1672)

this model