Transformers
English
mla-attention
multi-head-latent-attention
flow-matching
rectified-flow
on-device
efficient-attention
smol-scale
research
proof-of-concept
Instructions to use Tinman-Lab/Tinman-SmolOmni-MLA-500M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tinman-Lab/Tinman-SmolOmni-MLA-500M with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Tinman-Lab/Tinman-SmolOmni-MLA-500M", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- β οΈ Research Proof of Concept β NOT Production Ready
β οΈ Research Proof of Concept β NOT Production Ready
586M-parameter unified multimodal model. Validated architecture with Multi-Head Latent Attention (MLA). Low-quality image generation. No native ASR. Use Moonshine (included in Toolkit) for audio understanding.
What Actually Works
| Component | Status | Evidence |
|---|---|---|
| Architecture | β Validated | 585.8M params, verified GQA/MLA layer assignment |
| Checkpoint loading | β Verified | SmolOmni.from_hub() downloads and loads |
| Text understanding | β Functional | Coherent next-token predictions from prompts |
| Image generation pipeline | β οΈ Runs but quality near-random | CLIP score 0.11 (random ~0.15, good ~0.30-0.35) |
| VQA / Vision QA | β Not trained | 0% on ChartQA (10 samples) β no VQA tuning ever done |
| Audio ASR (native) | β Does NOT exist in checkpoint | Stage 3 deleted (144% WER failure) |
| Audio ASR (Moonshine) | β Working via Toolkit | 27M params, 3.5% WER, 0.1s latency |
| Speed | β Fast | 15,901 tok/s AR, 0.40s / 50 flow steps |
| KV cache | β Compressed | 12,160 floats/token (-40.6% vs baseline) |
| Memory | β Efficient | 1,239 MB peak VRAM |
Production Readiness: What's Missing & Cost
π΄ Image Generation (Requires significant investment)
| Gap | What's needed | Estimated cost |
|---|---|---|
| Training data | LAION-5B or COYO-700M (50M+ image-text pairs) | $0 (download) |
| Training steps | 20,000β100,000 (vs our 2,000 on small subset) | $500β$5,000 on A100 |
| VAE latent normalization | Fix latent β VAE decoding scale mismatch | $0 (code fix, ~2 hours) |
| CFG + null conditioning | Train with 10% unconditional dropout | $50β$200 on A100 |
| Total for good images | $550β$5,200 |
π‘ VQA / Vision Understanding (Low cost, straightforward)
| Gap | What's needed | Estimated cost |
|---|---|---|
| Instruction data | ChartQA train (28K) + VQAv2 (200K) + DocVQA + TextVQA | $0 (download) |
| Training | 2β5 epochs LoRA instruction tuning (rank 8β16) | $5β$20 on L4 |
| Evaluation | Standard VQA benchmarks | $0 (CPU) |
| Total for working VQA | $5β$20 |
π’ Audio ASR (Recommended: Moonshine β already working)
| Approach | WER | Effort | Cost |
|---|---|---|---|
| Moonshine + VLM (current) | ~3.5% | Zero β already implemented | $0 |
| Native ASR (Whisper encoder + projector) | ~2-5% | Requires training | $10β$50 |
| Native ASR via discrete tokens (our attempt) | 144% | β Failed, not recommended | N/A |
Recommendation: Use Moonshine. It's better than Whisper-tiny, 4Γ faster, and requires zero training.
π’ Text Understanding (Already functional)
The model's text decoder was SVD-initialized from SmolVLM-500M and underwent Stage 1 KL distillation. It produces coherent text. For competitive chat quality, add instruction fine-tuning on UltraChat or OpenHermes (~$10 on L4).
What This Checkpoint Actually Does
| Capability | Status | Quality | Notes |
|---|---|---|---|
| Text β Text | β Functional | Coherent | SVD-initialized + KL distilled from SmolVLM |
| Image β Text (VQA) | β Not trained | N/A | Pipeline runs, answers are random |
| Text β Image | β οΈ Pipeline runs | Near-random (CLIP 0.11) | Architecture valid, needs 10Γβ50Γ more data/steps |
| Audio β Text (native) | β Deleted | N/A | Stage 3 failed; use Moonshine instead |
| Text β Speech | β Not trained | N/A | Not present in this checkpoint |
Architecture Verified From Checkpoint Weights
- Base: SmolVLM-500M-Instruct
- 32 layers (verified from 520 weight keys in state dict)
- GQA layers 0β9, 30β31 (verified:
q_proj.weightkeys present) - MLA layers 10β29 (verified:
q_a_proj.weight+kv_a_proj.weightkeys present) - NoPE layers: 0, 4, 8, 12, 16, 20, 24, 28
- SVD init: 312 of 520 weights copied from pretrained GQA via X-EcoMLA
- Flow head: 8-layer DiT, adaLN-Zero, 165M params
π Benchmarks (Verified on NVIDIA L4)
| Metric | Baseline | Tinman-MLA-500M | Change |
|---|---|---|---|
| KV Cache / token | 20,480 | 12,160 | -40.6% |
| AR Throughput | 2,100 tok/s | 15,901 tok/s | +657% |
| Peak VRAM | 5,800 MB | 1,239 MB | -79% |
| Image Gen (50 steps) | β | 0.40 s | New |
| Parameters | 507.5M | 585.8M | +15% (flow head) |
π₯ Download and Run
pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
import torch
from smolomni import SmolOmni
from transformers import AutoTokenizer
# 1.1 GB, auto-downloads from Hub
model = SmolOmni.from_hub(
"TinmanLabSL/SmolOmni-MLA-500M",
checkpoint="stage2_final/model.pt",
config="mla-hybrid-ar-flow-500M",
device="cuda",
dtype=torch.bfloat16,
strict=False,
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
π Files
| File | Size | Purpose |
|---|---|---|
stage2_final/model.pt |
1.1 GB | Production checkpoint: text + image generation |
config.json |
1.3 KB | Architecture metadata |
π Links
- Toolkit (code, training scripts, Moonshine integration): TinmanLabSL/SmolOmni-MLA-Toolkit
- 256M Model: TinmanLabSL/SmolOmni-MLA-256M
- Base Model: SmolVLM-500M-Instruct
- Moonshine ASR: UsefulSensors/moonshine-tiny
π Citation
@software{smolomni500m2025,
title = {Tinman-SmolOmni-MLA-500M: Research POC β MLA Multimodal VLM},
author = {TinmanLabSL},
year = {2025},
url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-500M},
note = {Research proof of concept. Not production ready. Image generation quality is poor.}
}
License
Apache 2.0
- Downloads last month
- 16
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support