Whisper-Tiny Fine-Tuned for Kazakh Children Speech

Automatic Speech Recognition (ASR) for Kazakh Child Voices

This model is a fine-tuned version of OpenAI Whisper-tiny, specifically adapted for Kazakh childrenโ€™s speech (ages 5โ€“12). Training was performed in two stages, using a total of 2172 audio samples:

  • Stage 1: 236 samples (initial fine-tuning)
  • Stage 2: 1936 samples (extended fine-tuning)

The model significantly improves recognition accuracy for short child-spoken Kazakh words and simple phrases.


๐Ÿ“Œ Key Improvements

Metric Base Whisper-tiny Fine-tuned Model Improvement
WER 0.80 0.42 -47.5%
Accuracy 20.43% 57.95% +183%
Exact Match ~20% 61.11% +199%

The model performs best for single-word utterances, which reflects typical children's speech in the dataset.


๐Ÿ“‚ Dataset Overview

Total dataset used: 2172 audio files

  • Stage 1: 236 samples
  • Stage 2: 1936 samples

Audio characteristics:

  • Format: WAV
  • Sample rate: 16 kHz
  • Channels: mono
  • Bit depth: 16-bit PCM
  • Speakers: children aged 5โ€“12

Transcript format:

Transcripts were automatically extracted from filenames Example: AUDIO-2024-03-22-20-11-05_า›ะฐั€ะปั‹า“ะฐัˆ.wav

Dataset distribution:

  • 1-word samples: 1479 (76.3%)
  • 2โ€“5 words: 446 (23.0%)
  • 6โ€“10 words: 6 (0.3%)
  • 11+ words: 7 (0.4%)

Most frequent words are numerals (โ€œะฑั–ั€โ€, โ€œะตะบั–โ€, โ€œาฏัˆโ€, โ€œะพะฝโ€, โ€œะถาฏะทโ€), which reflects an educational dataset.


๐Ÿง  Training Approach

Training was conducted on Google Colab (NVIDIA T4 GPU) using PyTorch + Hugging Face Transformers.


๐Ÿ”น Stage 1 โ€” Initial Fine-tuning (236 samples)

  • Epochs: 1
  • Batch size: 1
  • Learning rate: 1e-5
  • Loss decreased from 8.86 โ†’ 0.46

Purpose: Teach the model basic acoustic structure of Kazakh child speech.


๐Ÿ”น Stage 2 โ€” Extended Fine-tuning (1936 samples)

  • Started from Stage 1 model
  • Loss decreased from 0.32 โ†’ 0.23
  • Checkpoints saved every 200 steps
  • Automatic skipping of corrupted or invalid samples

Purpose: Improve generalization with a more diverse dataset.


๐Ÿงช Inference Example

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

model_path = "Yersultan/whisper-kk-children"  # replace
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device).eval()

def transcribe(audio_path):
    audio, sr = torchaudio.load(audio_path)
    if sr != 16000:
        audio = torchaudio.transforms.Resample(sr, 16000)(audio)

    features = processor.feature_extractor(
        audio.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt"
    ).input_features.to(device)

    forced_ids = processor.get_decoder_prompt_ids(
        language="kazakh",
        task="transcribe"
    )

    with torch.no_grad():
        pred_ids = model.generate(features, forced_decoder_ids=forced_ids)

    return processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

๐Ÿ” Error Analysis

Common errors:

  • Phonetic substitutions (ั€โ†’ะป, ะถโ†’ะท, า›โ†’ะบ)
  • Vowel confusion (ำฉโ†’ะพ, าฏโ†’ัƒ)
  • Dropped words in long sequences
  • Noisy background affecting predictions

Strengths:

  • Very strong on single-word recognition (66.73%)
  • Good for short educational phrases
  • Stable on most common Kazakh phonemes

๐Ÿ“‰ Limitations

  • Poor performance on long phrases (6+ words)
  • Dataset mostly consists of isolated words
  • Only childrenโ€™s voices โ€” not suitable for adults
  • Whisper-tiny has limited capacity โ†’ small model trade-offs

๐Ÿ”ง Future Improvements

  • Fine-tune Whisper-small or Whisper-base
  • Increase dataset to 5000+ samples
  • Add noisy/augmented variants
  • Improve long-phrase recognition
  • Quantization (FP16/INT8) for mobile deployment
  • Real-time streaming support

๐Ÿ‘ค Author

Yersultan Kelesbayev Kazakh National University (KazNU)

Faculty of Information Technology

Email: [email protected]

Special thanks to Duisenbekkyzy Zhansaya for providing the dataset.


๐Ÿ“„ License

MIT License


๐ŸŒŸ Summary

This project provides one of the first fine-tuned Whisper models specifically for Kazakh childrenโ€™s speech. It greatly boosts recognition accuracy and can serve as a foundation for:

  • educational apps
  • childrenโ€™s speech interfaces
  • Kazakh ASR research
  • low-resource language modeling

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yersultan-kv/whisper-kk-children-tiny

Finetuned
(1672)
this model