⚠️ Research Proof of Concept — NOT Production Ready

586M-parameter unified multimodal model. Validated architecture with Multi-Head Latent Attention (MLA). Low-quality image generation. No native ASR. Use Moonshine (included in Toolkit) for audio understanding.

What Actually Works

Component	Status	Evidence
Architecture	✅ Validated	585.8M params, verified GQA/MLA layer assignment
Checkpoint loading	✅ Verified	`SmolOmni.from_hub()` downloads and loads
Text understanding	✅ Functional	Coherent next-token predictions from prompts
Image generation pipeline	⚠️ Runs but quality near-random	CLIP score 0.11 (random ~0.15, good ~0.30-0.35)
VQA / Vision QA	❌ Not trained	0% on ChartQA (10 samples) — no VQA tuning ever done
Audio ASR (native)	❌ Does NOT exist in checkpoint	Stage 3 deleted (144% WER failure)
Audio ASR (Moonshine)	✅ Working via Toolkit	27M params, 3.5% WER, 0.1s latency
Speed	✅ Fast	15,901 tok/s AR, 0.40s / 50 flow steps
KV cache	✅ Compressed	12,160 floats/token (-40.6% vs baseline)
Memory	✅ Efficient	1,239 MB peak VRAM

Production Readiness: What's Missing & Cost

🔴 Image Generation (Requires significant investment)

Gap	What's needed	Estimated cost
Training data	LAION-5B or COYO-700M (50M+ image-text pairs)	$0 (download)
Training steps	20,000–100,000 (vs our 2,000 on small subset)	$500–$5,000 on A100
VAE latent normalization	Fix latent → VAE decoding scale mismatch	$0 (code fix, ~2 hours)
CFG + null conditioning	Train with 10% unconditional dropout	$50–$200 on A100
Total for good images		$550–$5,200

🟡 VQA / Vision Understanding (Low cost, straightforward)

Gap	What's needed	Estimated cost
Instruction data	ChartQA train (28K) + VQAv2 (200K) + DocVQA + TextVQA	$0 (download)
Training	2–5 epochs LoRA instruction tuning (rank 8–16)	$5–$20 on L4
Evaluation	Standard VQA benchmarks	$0 (CPU)
Total for working VQA		$5–$20

🟢 Audio ASR (Recommended: Moonshine — already working)

Approach	WER	Effort	Cost
Moonshine + VLM (current)	~3.5%	Zero — already implemented	$0
Native ASR (Whisper encoder + projector)	~2-5%	Requires training	$10–$50
Native ASR via discrete tokens (our attempt)	144%	❌ Failed, not recommended	N/A

Recommendation: Use Moonshine. It's better than Whisper-tiny, 4× faster, and requires zero training.

🟢 Text Understanding (Already functional)

The model's text decoder was SVD-initialized from SmolVLM-500M and underwent Stage 1 KL distillation. It produces coherent text. For competitive chat quality, add instruction fine-tuning on UltraChat or OpenHermes (~$10 on L4).

What This Checkpoint Actually Does

Capability	Status	Quality	Notes
Text → Text	✅ Functional	Coherent	SVD-initialized + KL distilled from SmolVLM
Image → Text (VQA)	❌ Not trained	N/A	Pipeline runs, answers are random
Text → Image	⚠️ Pipeline runs	Near-random (CLIP 0.11)	Architecture valid, needs 10×–50× more data/steps
Audio → Text (native)	❌ Deleted	N/A	Stage 3 failed; use Moonshine instead
Text → Speech	❌ Not trained	N/A	Not present in this checkpoint

Architecture Verified From Checkpoint Weights

Base: SmolVLM-500M-Instruct
32 layers (verified from 520 weight keys in state dict)
GQA layers 0–9, 30–31 (verified: q_proj.weight keys present)
MLA layers 10–29 (verified: q_a_proj.weight + kv_a_proj.weight keys present)
NoPE layers: 0, 4, 8, 12, 16, 20, 24, 28
SVD init: 312 of 520 weights copied from pretrained GQA via X-EcoMLA
Flow head: 8-layer DiT, adaLN-Zero, 165M params

📊 Benchmarks (Verified on NVIDIA L4)

Metric	Baseline	Tinman-MLA-500M	Change
KV Cache / token	20,480	12,160	-40.6%
AR Throughput	2,100 tok/s	15,901 tok/s	+657%
Peak VRAM	5,800 MB	1,239 MB	-79%
Image Gen (50 steps)	—	0.40 s	New
Parameters	507.5M	585.8M	+15% (flow head)

📥 Download and Run

pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit

import torch
from smolomni import SmolOmni
from transformers import AutoTokenizer

# 1.1 GB, auto-downloads from Hub
model = SmolOmni.from_hub(
    "TinmanLabSL/SmolOmni-MLA-500M",
    checkpoint="stage2_final/model.pt",
    config="mla-hybrid-ar-flow-500M",
    device="cuda",
    dtype=torch.bfloat16,
    strict=False,
)

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")

📁 Files

File	Size	Purpose
`stage2_final/model.pt`	1.1 GB	Production checkpoint: text + image generation
`config.json`	1.3 KB	Architecture metadata

🔗 Links

Toolkit (code, training scripts, Moonshine integration): TinmanLabSL/SmolOmni-MLA-Toolkit
256M Model: TinmanLabSL/SmolOmni-MLA-256M
Base Model: SmolVLM-500M-Instruct
Moonshine ASR: UsefulSensors/moonshine-tiny

📝 Citation

@software{smolomni500m2025,
  title = {Tinman-SmolOmni-MLA-500M: Research POC — MLA Multimodal VLM},
  author = {TinmanLabSL},
  year = {2025},
  url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-500M},
  note = {Research proof of concept. Not production ready. Image generation quality is poor.}
}

License

Apache 2.0

Downloads last month: 16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support