YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Sarvam-30B 4-Bit (BitsAndBytes)

This repository provides a 4-bit NF4 quantized version of the base model sarvamai/sarvam-30b using bitsandbytes. The quantization significantly reduces GPU memory usage while preserving strong inference performance.

Base model sarvamai/sarvam-30b

Architecture SarvamMoEForCausalLM


Quantization Details

Quantization method: BitsAndBytes 4-bit (NF4)

Configuration used:

  • load_in_4bit = True
  • bnb_4bit_quant_type = nf4
  • bnb_4bit_compute_dtype = float16
  • bnb_4bit_use_double_quant = True

Approximate GPU memory usage:

Model GPU VRAM
FP16 original ~60 GB
4-bit NF4 ~16-18 GB

This version is recommended for most users who want to run the model with reduced hardware requirements.


Installation

Install the required libraries.

pip install transformers accelerate bitsandbytes torch safetensors

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "neuralnets/sarvam-30b-4bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "neuralnets/sarvam-30b-4bit",
    trust_remote_code=True
)

Example Inference

prompt = "Explain mixture of experts in simple terms."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hardware Requirements

Recommended GPUs:

  • A100 40GB or 80GB
  • RTX 4090
  • RTX 3090 (with offloading)

CPU RAM recommendation:

  • 32 GB or higher

Notes

  • This model uses bitsandbytes quantization integrated into Hugging Face Transformers.
  • The Sarvam architecture requires trust_remote_code=True.
  • Designed primarily for inference workloads.

Base Model

Original model:

sarvamai/sarvam-30b

Please refer to the base repository for model training details and benchmarks.


License

This repository distributes a quantized derivative of the original model.

Users must follow the license of the upstream model:

sarvamai/sarvam-30b

Downloads last month
70
Safetensors
Model size
32B params
Tensor type
F32
F16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support