Instructions to use dheeyantra/dhee-nxtgen-qwen3-indic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dheeyantra/dhee-nxtgen-qwen3-indic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dheeyantra/dhee-nxtgen-qwen3-indic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dheeyantra/dhee-nxtgen-qwen3-indic")
model = AutoModelForCausalLM.from_pretrained("dheeyantra/dhee-nxtgen-qwen3-indic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dheeyantra/dhee-nxtgen-qwen3-indic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dheeyantra/dhee-nxtgen-qwen3-indic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dheeyantra/dhee-nxtgen-qwen3-indic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dheeyantra/dhee-nxtgen-qwen3-indic

SGLang

How to use dheeyantra/dhee-nxtgen-qwen3-indic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dheeyantra/dhee-nxtgen-qwen3-indic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dheeyantra/dhee-nxtgen-qwen3-indic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dheeyantra/dhee-nxtgen-qwen3-indic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dheeyantra/dhee-nxtgen-qwen3-indic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dheeyantra/dhee-nxtgen-qwen3-indic with Docker Model Runner:
```
docker model run hf.co/dheeyantra/dhee-nxtgen-qwen3-indic
```

Dhee-NxtGen-Qwen3-Indic (4B)

Model Description

Dhee-NxtGen-Qwen3-Indic is a single, unified 4B-parameter multilingual large language model developed by DheeYantra in collaboration with NxtGen Cloud Technologies Pvt. Ltd.

Built on the Qwen3-4B architecture, the model is created to support assistant-style conversations, reasoning, and function-calling–compatible workflows across 14 Indian (Indic) languages within one shared model.

The model is optimized for native-script generation, consistent multilingual behavior, and cross-lingual generalization.

Supported Languages

This single model supports the following Indic languages:

Hindi (hi)
Bengali (bn)
Tamil (ta)
Telugu (te)
Malayalam (ml)
Gujarati (gu)
Kannada (kn)
Marathi (mr)
Odia (or)
Punjabi (pa)
Assamese (as)
Maithili (mai)
Sanskrit (sa)
Sindhi (sd)

Best results are achieved when prompts are written entirely in the target language.

Key Features

Single multilingual 4B model (no per-language checkpoints)
Fluent, native-script text generation across 14 Indic languages
Optimized for assistant-style and reasoning-based dialogue
Supports summarization, Q&A, and long-form generation
Compatible with function-calling style prompting
Fully compatible with Hugging Face Transformers
Ready for high-throughput inference using vLLM

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Configuration
model_name = "dheeyantra/dhee-nxtgen-qwen3-indic"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2. Load Model and Tokenizer
print(f"Loading model: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

# 3. Define the Prompt
# Using the ChatML format expected by Qwen-based architectures
prompt = """<|im_start|>system
You are a helpful multilingual assistant.<|im_end|>
<|im_start|>user
क्या आप मेरे लिए एक अपॉइंटमेंट बुक कर सकते हैं? 
अगर हाँ, तो कृपया मुझसे ज़रूरी जानकारी जैसे तारीख, समय और उद्देश्य पूछिए।<|im_end|>
<|im_start|>assistant
"""

# 4. Process and Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

print("Generating response...")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# 5. Decode and Print
# We only want to print the newly generated text (the assistant's reply)
full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = full_output.split("assistant")[-1].strip()

print("-" * 30)
print(f"Assistant: {response}")
print("-" * 30)

Function/Tool Calling Example Usage

import torch
import json
import re
from transformers import AutoTokenizer, AutoModelForCausalLM

# --- 1. MODEL SETUP ---
model_name = "dheeyantra/dhee-nxtgen-qwen3-indic"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

# --- 2. TOOLS & SYSTEM PROMPT ---
tools = [{
    "name": "book_appointment",
    "description": "Book an appointment for the user.",
    "parameters": {
        "type": "object",
        "properties": {
            "date": {"type": "string", "description": "Date in YYYY-MM-DD format"},
            "time": {"type": "string", "description": "Time in HH:MM (24h) format"},
            "purpose": {"type": "string", "description": "The medical reason or department"}
        },
        "required": ["date", "time", "purpose"]
    }
}]

# We provide the current date so the model can resolve "tomorrow" or "next Monday"
SYSTEM_PROMPT = f"""You are a helpful AI assistant.
Today's Date: 2026-01-08 (Thursday).

Available Tools:
{json.dumps(tools, indent=2)}

Rules:
1. If details (date, time, purpose) are missing, ask the user in Hindi.
2. If all details are present, output ONLY a <tool_call> JSON.
3. After a tool result is provided, confirm the booking to the user in Hindi."""

# --- 3. BACKEND FUNCTION ---
def execute_booking(date, time, purpose):
    # Simulated backend logic
    if "10:00" in time: # Simulate a busy slot
        return {"status": "error", "message": "यह समय पहले से बुक है। कृपया कोई और समय चुनें।"}
    return {"status": "success", "id": "APP-9921", "doctor": "Dr. Verma"}

# --- 4. THE INTERACTION ENGINE ---
def run_conversation(user_input, history):
    # Add user input to history
    history.append({"role": "user", "content": user_input})
    
    # Construct ChatML prompt
    prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
    for msg in history:
        prompt += f"<|im_start|>{msg['role']}\n{msg['content']}<|im_end|>\n"
    prompt += "<|im_start|>assistant\n"

    # Generate
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True).split("assistant")[-1].strip()

    # Check if the model wants to call a tool
    tool_match = re.search(r"<tool_call>(.*?)</tool_call>", response, re.DOTALL)
    
    if tool_match:
        print(f"\n[MODEL REQUESTED TOOL]: {response}")
        call_data = json.loads(tool_match.group(1).strip())
        
        # Execute the function
        result = execute_booking(**call_data['arguments'])
        print(f"[TOOL RESULT]: {result}")
        
        # Feed result back to model for final response
        history.append({"role": "assistant", "content": response})
        history.append({"role": "system", "content": f"Tool Result: {json.dumps(result)}"})
        
        # Final confirmation generation
        return run_conversation("Please confirm the result to the user.", history)
    
    return response

# --- 5. TEST SCENARIO ---
chat_history = []

print("--- Chatbot Started (Today is 2026-01-08) ---")

# Step 1: User provides partial info
query_1 = "मेरे लिए डेंटिस्ट का अपॉइंटमेंट बुक करें।"
print(f"\nUser: {query_1}")
res_1 = run_conversation(query_1, chat_history)
chat_history.append({"role": "assistant", "content": res_1})
print(f"Assistant: {res_1}")

# Step 2: User provides the rest
query_2 = "कल दोपहर 2 बजे।"
print(f"\nUser: {query_2}")
res_2 = run_conversation(query_2, chat_history)
print(f"Assistant: {res_2}")

Prompting Guidelines

Use pure native-language prompts for best fluency
Avoid heavy code-mixing (e.g., Hinglish-heavy inputs)
Include a system prompt to stabilize multilingual behavior
Ask explicitly for step-by-step reasoning when required

Intended Uses & Limitations

Intended Uses

Multilingual Indic chatbots and AI assistants
Education, governance, and public-sector AI applications
Content generation and summarization in Indian languages
Cross-lingual conversational and reasoning systems

Limitations

May occasionally hallucinate or produce inaccurate facts
Performance may vary slightly across languages
Not intended for medical, legal, or safety-critical use cases
Code-mixed inputs may reduce output quality

vLLM / High-Performance Serving

Requirements

NVIDIA GPU with compute capability ≥ 8.0 (A100 / H100 recommended)
PyTorch 2.1+ with CUDA installed
V100 (sm70) GPUs are not supported for vLLM GPU inference

Installation

pip install torch transformers vllm sentencepiece

Run vLLM Server

vllm serve   --model dheeyantra/dhee-nxtgen-qwen3-indic   --host 0.0.0.0   --port 8000

License

Released under the Apache 2.0 License.

Developed by DheeYantra in collaboration with NxtGen Cloud Technologies Pvt. Ltd.

Downloads last month: 73

Safetensors

Model size

4B params

Tensor type

F16

Collection including dheeyantra/dhee-nxtgen-qwen3-indic

Dhee-NxtGen-Indic

Collection

Indic Agentic LLMs with fluency in 15 Indian Languages • 1 item • Updated Jan 15 • 1