Instructions to use mygitphase/guhan-30b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mygitphase/guhan-30b-gguf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mygitphase/guhan-30b-gguf") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mygitphase/guhan-30b-gguf", dtype="auto") - llama-cpp-python
How to use mygitphase/guhan-30b-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mygitphase/guhan-30b-gguf", filename="sarvam-30b-Q4_K_M.gguf-00001-of-00006.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use mygitphase/guhan-30b-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mygitphase/guhan-30b-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf mygitphase/guhan-30b-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mygitphase/guhan-30b-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf mygitphase/guhan-30b-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf mygitphase/guhan-30b-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf mygitphase/guhan-30b-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf mygitphase/guhan-30b-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf mygitphase/guhan-30b-gguf:Q4_K_M
Use Docker
docker model run hf.co/mygitphase/guhan-30b-gguf:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use mygitphase/guhan-30b-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mygitphase/guhan-30b-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mygitphase/guhan-30b-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mygitphase/guhan-30b-gguf:Q4_K_M
- SGLang
How to use mygitphase/guhan-30b-gguf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mygitphase/guhan-30b-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mygitphase/guhan-30b-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mygitphase/guhan-30b-gguf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mygitphase/guhan-30b-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use mygitphase/guhan-30b-gguf with Ollama:
ollama run hf.co/mygitphase/guhan-30b-gguf:Q4_K_M
- Unsloth Studio new
How to use mygitphase/guhan-30b-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mygitphase/guhan-30b-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mygitphase/guhan-30b-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for mygitphase/guhan-30b-gguf to start chatting
- Pi new
How to use mygitphase/guhan-30b-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mygitphase/guhan-30b-gguf:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mygitphase/guhan-30b-gguf:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mygitphase/guhan-30b-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mygitphase/guhan-30b-gguf:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mygitphase/guhan-30b-gguf:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use mygitphase/guhan-30b-gguf with Docker Model Runner:
docker model run hf.co/mygitphase/guhan-30b-gguf:Q4_K_M
- Lemonade
How to use mygitphase/guhan-30b-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull mygitphase/guhan-30b-gguf:Q4_K_M
Run and chat with the model
lemonade run user.guhan-30b-gguf-Q4_K_M
List all available models
lemonade list
!!! This is the GGUF version of Sarvam-30B !!!
Download the original weights here!
Index
- Introduction
- Architecture
- Benchmarks
- Knowledge & Coding
- Reasoning & Math
- Agentic
- Inference
- Footnote
- Citation
Introduction
Sarvam-30B is an advanced Mixture-of-Experts (MoE) model with 2.4B non-embedding active parameters, designed primarily for practical deployment. It combines strong reasoning, reliable coding ability, and best-in-class conversational quality across Indian languages. Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls.
A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.
Sarvam-30B is open-sourced under the Apache License. For more details, see our blog.
Architecture
The 30B MoE model is designed for throughput and memory efficiency. It uses 19 layers, a dense FFN intermediate_size of 8192, moe_intermediate_size of 1024, top-6 routing, grouped KV heads (num_key_value_heads=4), and an extremely high rope_theta (8e6) for long-context stability without RoPE scaling. It has 128 experts with a shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. The 30B model focuses on throughput and memory efficiency through fewer layers, grouped KV attention, and smaller experts.
Benchmarks
Knowledge & Coding
| Benchmark | Sarvam-30B | Gemma 27B It | Mistral-3.2-24B | OLMo 3.1 32B Think | Nemotron-3-Nano-30B-A3B | Qwen3-30B-Thinking-2507 | GLM 4.7 Flash | GPT-OSS-20B |
|---|---|---|---|---|---|---|---|---|
| Math500 | 97.0 | 87.4 | 69.4 | 96.2 | 98.0 | 97.6 | 97.0 | 94.2 |
| HumanEval | 92.1 | 88.4 | 92.9 | 95.1 | 97.6 | 95.7 | 96.3 | 95.7 |
| MBPP | 92.7 | 81.8 | 78.3 | 58.7 | 91.9 | 94.3 | 91.8 | 95.3 |
| Live Code Bench v6 | 70.0 | 28.0 | 26.0 | 73.0 | 68.3 | 66.0 | 64.0 | 61.0 |
| MMLU | 85.1 | 81.2 | 80.5 | 86.4 | 84.0 | 88.4 | 86.9 | 85.3 |
| MMLU Pro | 80.0 | 68.1 | 69.1 | 72.0 | 78.3 | 80.9 | 73.6 | 75.0 |
| MILU | 76.8 | 69.2 | 67.9 | 69.9 | 64.8 | 82.6 | 75.6 | 73.7 |
| Arena Hard v2 | 49.0 | 50.1 | 43.1 | 42.0 | 67.7 | 72.1 | 58.1 | 62.9 |
| Writing Bench | 78.7 | 71.4 | 70.3 | 75.7 | 83.7 | 85.0 | 79.2 | 79.1 |
Reasoning & Math
| Benchmark | Sarvam-30B | OLMo 3.1 32B | Nemotron-3-Nano-30B | Qwen3-30B-Thinking-2507 | GLM 4.7 Flash | GPT-OSS-20B |
|---|---|---|---|---|---|---|
| GPQA Diamond | 66.5 | 57.5 | 73.0 | 73.4 | 75.2 | 71.5 |
| AIME 25 (w/ Tools) | 88.3 (96.7) | 78.1 (81.7) | 89.1 (99.2) | 85.0 (-) | 91.6 (-) | 91.7 (98.7) |
| HMMT (Feb 25) | 73.3 | 51.7 | 85.0 | 71.4 | 85.0 | 76.7 |
| HMMT (Nov 25) | 74.2 | 58.3 | 75.0 | 73.3 | 81.7 | 68.3 |
| Beyond AIME | 58.3 | 48.5 | 64.0 | 61.0 | 60.0 | 46.0 |
Agentic
| Benchmark | Sarvam-30B | Nemotron-3-Nano-30B | Qwen3-30B-Thinking-2507 | GLM 4.7 Flash | GPT-OSS-20B |
|---|---|---|---|---|---|
| BrowseComp | 35.5 | 23.8 | 2.9 | 42.8 | 28.3 |
| SWE Bench Verified | 34.0 | 38.8 | 22.0 | 59.2 | 34.0 |
| τ² Bench (avg.) | 45.7 | 49.0 | 47.7 | 79.5 | 48.7 |
See footnote for evaluation details.
Inference
Clone and build
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && cmake -B build && cmake --build build --config Release -j
Download the model (all shards)
huggingface-cli download sarvamai/sarvam-30b-gguf --local-dir sarvam-30b-gguf
Run interactive chat
./build/bin/llama-cli \
-m sarvam-30b-gguf/sarvam-30b-Q4_K_M.gguf-00001-of-00006.gguf \
-c 4096 \
-n 512 \
-p "You are a helpful assistant." \
--conversation
OpenAI-compatible API server
./build/bin/llama-server \
-m sarvam-30b-gguf/sarvam-30b-Q4_K_M.gguf-00001-of-00006.gguf \
-c 4096 \
--host 0.0.0.0 \
--port 8080
Then query it:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
"temperature": 0.8,
"max_tokens": 512
}'
Footnote
- General settings: All benchmarks are evaluated with a maximum context length of 65,536 tokens.
- Reasoning & Math benchmarks (Math500, MMLU, MMLU Pro, GPQA Diamond, AIME 25, Beyond AIME, HMMT, HumanEval, MBPP): Evaluated with
temperature=1.0, top_p=1.0, max_new_tokens=65536. - Coding & Knowledge benchmarks (Live Code Bench v6, Arena Hard v2, IF Eval):
Evaluated with
temperature=1.0, top_p=1.0, max_new_tokens=65536. - Writing Bench:
Responses generated using official Writing-Bench parameters:
temperature=0.7, top_p=0.8, top_k=20, max_length=16000. Scoring performed using the official Writing-Bench critic model with:temperature=1.0, top_p=0.95, max_length=2048. - Agentic benchmarks (BrowseComp, SWE Bench Verified, τ² Bench): Evaluated with
temperature=0.5, top_p=1.0, max_new_tokens=32768.
Citation
@misc{sarvam_sovereign_models,
title = {Introducing Sarvam's Sovereign Models},
author = {{Sarvam Foundation Models Team}},
year = {2026},
howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
note = {Accessed: 2026-03-03}
}
- Downloads last month
- 10
4-bit
