BanglaLLM-7B

First native Bengali Large Language Model trained from scratch by undergraduate researchers at AIUB (American International University-Bangladesh).

Model Details

  • Parameters: 5,933,981,248 (5.93B)
  • Architecture: LLaMA-3 style decoder-only transformer
    • 32 layers, 4096 hidden dim
    • GQA: 32 query heads / 8 KV heads (4:1 ratio)
    • SwiGLU activation, RMSNorm, RoPE
    • Flash Attention 2
    • Weight tying (embedding/LM head)
  • Context length: 4096 tokens
  • Vocabulary: 64,000 (custom Bengali BPE tokenizer)
  • Training data: ~1B Bengali tokens (Wikipedia + IndicCorpV2 + textbooks)
  • Training compute: NVIDIA H100 80GB, bfloat16, gradient checkpointing
  • Best training loss: 1.8665
  • Held-out perplexity: 10.21 (Bengali Wikipedia held-out)

Limitations

This model is significantly undertrained per Chinchilla scaling laws (1B tokens vs ~120B optimal for 5.93B parameters). It has learned Bengali grammar, vocabulary, and syntactic patterns but lacks deep factual world knowledge. Future work: scale to 50B+ tokens.

Authors

  • Avoy Mollick (23-50066-1)
  • Apurba (23-50067-1)
  • Arpon (23-50068-1)

Course: NLP CSC4233, AIUB, Spring 2025-2026
Supervisor: Dr. MD. Saef Ullah Miah
GitHub: github.com/avoymollick/BanglaLLM-7B

Usage

import torch
from bangla_llm import BanglaLLM, Config
from tokenizers import Tokenizer

cfg = Config("7b")
model = BanglaLLM(cfg).to("cuda").to(torch.bfloat16)

ck = torch.load("base_checkpoint.pt", map_location="cuda", weights_only=False)
model.load_state_dict(ck["model"])
model.eval()

tok = Tokenizer.from_file("tokenizer/tokenizer.json")

prompt = "বাংলাদেশ একটি"
ids = torch.tensor([[1] + tok.encode(prompt, add_special_tokens=False).ids]).to("cuda")
out = model.generate(ids, max_new=100, temp=0.3, top_p=0.9, eos=2)
print(tok.decode(out[0].tolist(), skip_special_tokens=True))

Citation

@misc{banglallm7b2026,
  title={BanglaLLM-7B: First Native Bengali Language Model Trained From Scratch},
  author={Mollick, Avoy and Apurba and Arpon},
  year={2026},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/avoymollick/BanglaLLM-7B-base}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support