nyu-visionx/VSI-590K
Preview • Updated • 4.74k • 22
How to use nyu-visionx/Cambrian-S-7B with Transformers:
# Use a pipeline as a high-level helper
# Warning: Pipeline type "image-to-text" is no longer supported in transformers v5.
# You must load the model directly (see below) or downgrade to v4.x with:
# 'pip install "transformers<5.0.0'
from transformers import pipeline
pipe = pipeline("image-to-text", model="nyu-visionx/Cambrian-S-7B") # Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nyu-visionx/Cambrian-S-7B", dtype="auto")Website | Paper | GitHub | Cambrian-S Family
Authors: Shusheng Yang*, Jihan Yang*, Pinzhi Huang†, Ellis Brown†, et al.
Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks.
from cambrian.model.builder import load_pretrained_model
from cambrian.mm_utils import process_images, tokenizer_image_token
from cambrian.conversation import conv_templates
model_path = "nyu-visionx/Cambrian-S-7B"
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-7b", device_map="cuda")
# Process image/video
conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], "<image>\nWhat objects are in this scene?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
# Generate
output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes)
@article{yang2025cambrian,
title={Cambrian-S: Towards Spatial Supersensing in Video},
author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and others},
journal={arXiv preprint arXiv:2025},
year={2025}
}