Integrate with Sentence Transformers v5.4

This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with lasttoken Pooling and Normalize modules, producing 2048-dimensional normalized embeddings. The model supports text, image, video, and multimodal (any combination of the previous) inputs via a structured message format.

The sentence_bert_config.json configures each modality (text, image, video, message) to use the forward method extracting last_hidden_state, with the message modality using "structured" format. unpad_inputs is set to false as Qwen3 can't flatten inputs nicely.

Added files:

modules.json: pipeline: Transformer, Pooling & Normalize
sentence_bert_config.json: feature-extraction task, structured message format, multimodal config
config_sentence_transformers.json: default prompt ("Represent the user's input."), cosine similarity
1_Pooling/config.json: lasttoken pooling, 2048 embedding dimension

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-8B", revision="refs/pr/11")

# Text queries
queries = [
    "A woman playing with her dog on a beach at sunset.",
    "Pet owner training dog outdoors near water.",
    "Woman surfing on waves during a sunny day.",
    "City skyline view from a high-rise building at night.",
]

# Documents: text, image, and text+image
documents = [
    "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.",
    "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
]

# Encode queries and documents
query_embeddings = model.encode(queries)
doc_embeddings = model.encode(documents)
print(query_embeddings.shape, doc_embeddings.shape)
# (4, 4096) (3, 4096)

# Compute similarities
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.7438, 0.6556, 0.6244],
#         [0.4430, 0.3323, 0.3929],
#         [0.3685, 0.2310, 0.2874],
#         [0.0602, -0.0162, 0.0167]])

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

If you are able to merge this before tomorrow's Sentence Transformers v5.4 release, then I will be able to include this in my blogpost and documentation as a release model without revision. Otherwise I'll document it with revision and I can drop that later.

Tom Aarsen

tomaarsen changed pull request status to open 25 days ago

thenlper changed pull request status to merged 17 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment