Instructions to use dheeyantra/dhee-nxtgen-qwen3-indic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dheeyantra/dhee-nxtgen-qwen3-indic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="dheeyantra/dhee-nxtgen-qwen3-indic") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("dheeyantra/dhee-nxtgen-qwen3-indic") model = AutoModelForCausalLM.from_pretrained("dheeyantra/dhee-nxtgen-qwen3-indic") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use dheeyantra/dhee-nxtgen-qwen3-indic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dheeyantra/dhee-nxtgen-qwen3-indic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dheeyantra/dhee-nxtgen-qwen3-indic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dheeyantra/dhee-nxtgen-qwen3-indic
- SGLang
How to use dheeyantra/dhee-nxtgen-qwen3-indic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "dheeyantra/dhee-nxtgen-qwen3-indic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dheeyantra/dhee-nxtgen-qwen3-indic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "dheeyantra/dhee-nxtgen-qwen3-indic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dheeyantra/dhee-nxtgen-qwen3-indic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use dheeyantra/dhee-nxtgen-qwen3-indic with Docker Model Runner:
docker model run hf.co/dheeyantra/dhee-nxtgen-qwen3-indic
- Dhee-NxtGen-Qwen3-Indic (4B)
- Model Description
- The model is optimized for native-script generation, consistent multilingual behavior, and cross-lingual generalization.
- Supported Languages
- Key Features
- Example Usage
- Function/Tool Calling Example Usage
- Prompting Guidelines
- Intended Uses & Limitations
- vLLM / High-Performance Serving
- License
- Model Description
Dhee-NxtGen-Qwen3-Indic (4B)
Model Description
Dhee-NxtGen-Qwen3-Indic is a single, unified 4B-parameter multilingual large language model developed by DheeYantra in collaboration with NxtGen Cloud Technologies Pvt. Ltd.
Built on the Qwen3-4B architecture, the model is created to support assistant-style conversations, reasoning, and function-calling–compatible workflows across 14 Indian (Indic) languages within one shared model.
The model is optimized for native-script generation, consistent multilingual behavior, and cross-lingual generalization.
Supported Languages
This single model supports the following Indic languages:
- Hindi (hi)
- Bengali (bn)
- Tamil (ta)
- Telugu (te)
- Malayalam (ml)
- Gujarati (gu)
- Kannada (kn)
- Marathi (mr)
- Odia (or)
- Punjabi (pa)
- Assamese (as)
- Maithili (mai)
- Sanskrit (sa)
- Sindhi (sd)
Best results are achieved when prompts are written entirely in the target language.
Key Features
- Single multilingual 4B model (no per-language checkpoints)
- Fluent, native-script text generation across 14 Indic languages
- Optimized for assistant-style and reasoning-based dialogue
- Supports summarization, Q&A, and long-form generation
- Compatible with function-calling style prompting
- Fully compatible with Hugging Face Transformers
- Ready for high-throughput inference using vLLM
Example Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# 1. Configuration
model_name = "dheeyantra/dhee-nxtgen-qwen3-indic"
device = "cuda" if torch.cuda.is_available() else "cpu"
# 2. Load Model and Tokenizer
print(f"Loading model: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
# 3. Define the Prompt
# Using the ChatML format expected by Qwen-based architectures
prompt = """<|im_start|>system
You are a helpful multilingual assistant.<|im_end|>
<|im_start|>user
क्या आप मेरे लिए एक अपॉइंटमेंट बुक कर सकते हैं?
अगर हाँ, तो कृपया मुझसे ज़रूरी जानकारी जैसे तारीख, समय और उद्देश्य पूछिए।<|im_end|>
<|im_start|>assistant
"""
# 4. Process and Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
print("Generating response...")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
# 5. Decode and Print
# We only want to print the newly generated text (the assistant's reply)
full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = full_output.split("assistant")[-1].strip()
print("-" * 30)
print(f"Assistant: {response}")
print("-" * 30)
Function/Tool Calling Example Usage
import torch
import json
import re
from transformers import AutoTokenizer, AutoModelForCausalLM
# --- 1. MODEL SETUP ---
model_name = "dheeyantra/dhee-nxtgen-qwen3-indic"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
# --- 2. TOOLS & SYSTEM PROMPT ---
tools = [{
"name": "book_appointment",
"description": "Book an appointment for the user.",
"parameters": {
"type": "object",
"properties": {
"date": {"type": "string", "description": "Date in YYYY-MM-DD format"},
"time": {"type": "string", "description": "Time in HH:MM (24h) format"},
"purpose": {"type": "string", "description": "The medical reason or department"}
},
"required": ["date", "time", "purpose"]
}
}]
# We provide the current date so the model can resolve "tomorrow" or "next Monday"
SYSTEM_PROMPT = f"""You are a helpful AI assistant.
Today's Date: 2026-01-08 (Thursday).
Available Tools:
{json.dumps(tools, indent=2)}
Rules:
1. If details (date, time, purpose) are missing, ask the user in Hindi.
2. If all details are present, output ONLY a <tool_call> JSON.
3. After a tool result is provided, confirm the booking to the user in Hindi."""
# --- 3. BACKEND FUNCTION ---
def execute_booking(date, time, purpose):
# Simulated backend logic
if "10:00" in time: # Simulate a busy slot
return {"status": "error", "message": "यह समय पहले से बुक है। कृपया कोई और समय चुनें।"}
return {"status": "success", "id": "APP-9921", "doctor": "Dr. Verma"}
# --- 4. THE INTERACTION ENGINE ---
def run_conversation(user_input, history):
# Add user input to history
history.append({"role": "user", "content": user_input})
# Construct ChatML prompt
prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
for msg in history:
prompt += f"<|im_start|>{msg['role']}\n{msg['content']}<|im_end|>\n"
prompt += "<|im_start|>assistant\n"
# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True).split("assistant")[-1].strip()
# Check if the model wants to call a tool
tool_match = re.search(r"<tool_call>(.*?)</tool_call>", response, re.DOTALL)
if tool_match:
print(f"\n[MODEL REQUESTED TOOL]: {response}")
call_data = json.loads(tool_match.group(1).strip())
# Execute the function
result = execute_booking(**call_data['arguments'])
print(f"[TOOL RESULT]: {result}")
# Feed result back to model for final response
history.append({"role": "assistant", "content": response})
history.append({"role": "system", "content": f"Tool Result: {json.dumps(result)}"})
# Final confirmation generation
return run_conversation("Please confirm the result to the user.", history)
return response
# --- 5. TEST SCENARIO ---
chat_history = []
print("--- Chatbot Started (Today is 2026-01-08) ---")
# Step 1: User provides partial info
query_1 = "मेरे लिए डेंटिस्ट का अपॉइंटमेंट बुक करें।"
print(f"\nUser: {query_1}")
res_1 = run_conversation(query_1, chat_history)
chat_history.append({"role": "assistant", "content": res_1})
print(f"Assistant: {res_1}")
# Step 2: User provides the rest
query_2 = "कल दोपहर 2 बजे।"
print(f"\nUser: {query_2}")
res_2 = run_conversation(query_2, chat_history)
print(f"Assistant: {res_2}")
Prompting Guidelines
- Use pure native-language prompts for best fluency
- Avoid heavy code-mixing (e.g., Hinglish-heavy inputs)
- Include a system prompt to stabilize multilingual behavior
- Ask explicitly for step-by-step reasoning when required
Intended Uses & Limitations
Intended Uses
- Multilingual Indic chatbots and AI assistants
- Education, governance, and public-sector AI applications
- Content generation and summarization in Indian languages
- Cross-lingual conversational and reasoning systems
Limitations
- May occasionally hallucinate or produce inaccurate facts
- Performance may vary slightly across languages
- Not intended for medical, legal, or safety-critical use cases
- Code-mixed inputs may reduce output quality
vLLM / High-Performance Serving
Requirements
- NVIDIA GPU with compute capability ≥ 8.0 (A100 / H100 recommended)
- PyTorch 2.1+ with CUDA installed
- V100 (sm70) GPUs are not supported for vLLM GPU inference
Installation
pip install torch transformers vllm sentencepiece
Run vLLM Server
vllm serve --model dheeyantra/dhee-nxtgen-qwen3-indic --host 0.0.0.0 --port 8000
License
Released under the Apache 2.0 License.
Developed by DheeYantra in collaboration with NxtGen Cloud Technologies Pvt. Ltd.
- Downloads last month
- 73