sarvam-105b-AWQ

Model Overview

  • Model Architecture: sarvamai/sarvam-105b
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: AWQ
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
  • Version: 1.0
  • Model Developers: QuantTrio

This model is quantized using llm-compressor. Calibration dataset sarvamai/indivibe

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

1: Hot-patch (easy)

Run hotpatch_vllm.py This will do the following:

  • install vllm=0.15.0
  • add 2 model entries to registry.py
  • download the model executors for sarvam-105b

2: Run vLLM

export OMP_NUM_THREADS=4

vllm serve 
    __YOUR_PATH__/QuantTrio/sarvam-105b-AWQ \
    --served-model-name MY_MODEL \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 32768  \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Model Files

File Size Last Updated
74GiB 2026-03-12

Logs

2026-03-12
1. Initial commit
Downloads last month
13
Safetensors
Model size
19B params
Tensor type
F32
I64
I32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for QuantTrio/sarvam-105b-AWQ

Quantized
(5)
this model