Audio Classification
Transformers
Safetensors
model_hub_mixin
pytorch_model_hub_mixin
speaker_dialect_classification

Trustt-IndicVoices-Whisper-Large-v3

Model Description

Trustt-IndicVoices-Whisper-Large-v3 is an advanced language identification model developed by Trustt.com, designed to accurately classify regional languages spoken across India. Built on OpenAI's Whisper-Large-v3 architecture and fine-tuned using LoRA on the AI4Bharat IndicVoices dataset, this model delivers enterprise-grade language identification capabilities for multilingual speech processing applications.

This model is part of Trustt.com's commitment to open-source innovation in speech technology, providing SaaS platforms and enterprises serving the Indian market with robust, production-ready language identification.

Supported Languages

The model supports classification across 23 Indian languages:

label_list = [
    "assamese",
    "bengali",
    "bodo",
    "dogri",
    "english",
    "gujarati",
    "hindi",
    "kannada",
    "kashmiri",
    "konkani",
    "maithili",
    "malayalam",
    "manipuri",
    "marathi",
    "nepali",
    "odia",
    "punjabi",
    "sanskrit",
    "santali",
    "sindhi",
    "tamil",
    "telugu",
    "urdu"
]

Installation

Prerequisites

  • Python 3.8 or higher
  • PyTorch 1.9.0 or higher
  • CUDA-capable GPU (recommended for optimal performance)

Download Model

# Download the model repository
huggingface-cli download vignesh-trustt/whisper-v3-large-IndicVoices

Setup

# Create a virtual environment
conda create -n trustt_indic_whisper python=3.8
conda activate trustt_indic_whisper

# Navigate to the project directory
cd Trustt-IndicVoices

# Install the package and dependencies
pip install -e .

Usage

Model Loading

import torch
import torch.nn.functional as F
from src.model.dialect.whisper_dialect import WhisperWrapper

# Initialize device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load Trustt-IndicVoices-Whisper-Large-v3 from Hugging Face
model = WhisperWrapper.from_pretrained("vignesh-trustt/whisper-v3-large-IndicVoices").to(device)
model.train(False)  # Set to inference mode

Language Classification

label_list = [
    "assamese", "bengali", "bodo", "dogri", "english",
    "gujarati", "hindi", "kannada", "kashmiri", "konkani",
    "maithili", "malayalam", "manipuri", "marathi", "nepali",
    "odia", "punjabi", "sanskrit", "santali", "sindhi",
    "tamil", "telugu", "urdu"
]

# Prepare audio input (16kHz, mono, 3-15 seconds recommended)
max_audio_length = 15 * 16000  # 15 seconds at 16kHz
audio_data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]

# Perform inference
with torch.no_grad():
    logits, embeddings = model(audio_data, return_feature=True)

# Compute language probabilities
language_probs = F.softmax(logits, dim=1)
predicted_language_idx = torch.argmax(language_probs).detach().cpu().item()
predicted_language = label_list[predicted_language_idx]

print(f"Predicted language: {predicted_language}")
print(f"Confidence: {language_probs[0][predicted_language_idx]:.4f}")

Audio Preprocessing Requirements

For optimal model performance, ensure your audio input meets the following specifications:

  • Sample Rate: 16kHz
  • Channels: Mono (single channel)
  • Duration: 3-15 seconds (recommended)
    • Audio shorter than 3 seconds may yield unreliable predictions
    • Audio longer than 15 seconds will be truncated to the first 15 seconds

Model Architecture

This model uses LoRA (Low-Rank Adaptation) fine-tuning on top of Whisper Large v3:

Parameter Value
Base Model openai/whisper-large-v3
Fine-tune Method LoRA
LoRA Rank 64
Output Classes 23
Hidden Dim 256

Model Performance

The model has been trained and validated on diverse datasets including:

  • AI4Bharat IndicVoices
  • Mozilla Common Voice 11.0

Performance metrics are optimized for accuracy across the supported Indian languages.

Enterprise Integration

This model is designed for seamless integration into SaaS platforms and enterprise applications. For production deployments, consider:

  • Batch processing capabilities for high-throughput scenarios
  • GPU acceleration for real-time inference
  • Model quantization for resource-constrained environments
  • API wrapper implementation for microservices architecture

Responsible Use

Users are expected to:

  • Respect privacy and consent of data subjects
  • Comply with applicable data protection laws and regulations
  • Use the model in accordance with ethical AI practices
  • Ensure appropriate data handling and security measures

License

This model is released under the OpenRAIL license, enabling both research and commercial use while maintaining responsible AI principles.

About Trustt.com

Trustt.com is a leading SaaS platform committed to advancing speech technology and multilingual AI solutions. This model represents our ongoing contribution to the open-source community and our dedication to making advanced language technologies accessible to developers and enterprises.

Support

For technical support, feature requests, or commercial licensing inquiries, please visit Trustt.com or contact our support team.

Contributing

We welcome contributions from the community. Please refer to our contribution guidelines for more information on how to participate in improving this model.

Downloads last month
6
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vignesh-trustt/whisper-v3-large-IndicVoices

Finetuned
(864)
this model

Dataset used to train vignesh-trustt/whisper-v3-large-IndicVoices