Sarvam-30B 4-Bit (BitsAndBytes)

This repository provides a 4-bit NF4 quantized version of the base model sarvamai/sarvam-30b using bitsandbytes. The quantization significantly reduces GPU memory usage while preserving strong inference performance.

Base model sarvamai/sarvam-30b

Architecture SarvamMoEForCausalLM

Quantization Details

Quantization method: BitsAndBytes 4-bit (NF4)

Configuration used:

load_in_4bit = True
bnb_4bit_quant_type = nf4
bnb_4bit_compute_dtype = float16
bnb_4bit_use_double_quant = True

Approximate GPU memory usage:

Model	GPU VRAM
FP16 original	~60 GB
4-bit NF4	~16-18 GB

This version is recommended for most users who want to run the model with reduced hardware requirements.

Installation

Install the required libraries.

pip install transformers accelerate bitsandbytes torch safetensors

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "neuralnets/sarvam-30b-4bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "neuralnets/sarvam-30b-4bit",
    trust_remote_code=True
)

Example Inference

prompt = "Explain mixture of experts in simple terms."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hardware Requirements

Recommended GPUs:

A100 40GB or 80GB
RTX 4090
RTX 3090 (with offloading)

CPU RAM recommendation:

32 GB or higher

Notes

This model uses bitsandbytes quantization integrated into Hugging Face Transformers.
The Sarvam architecture requires trust_remote_code=True.
Designed primarily for inference workloads.

Base Model

Original model:

sarvamai/sarvam-30b

Please refer to the base repository for model training details and benchmarks.

License

This repository distributes a quantized derivative of the original model.

Users must follow the license of the upstream model:

sarvamai/sarvam-30b

Downloads last month: 70

Safetensors

Model size

32B params

Tensor type

F32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support