Integrate with Sentence Transformers v5.4

This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with lasttoken Pooling and Normalize modules, producing 2048-dimensional normalized embeddings. The model supports text, image, video, and multimodal (any combination of the previous) inputs via a structured message format.

The sentence_bert_config.json configures each modality (text, image, video, message) to use the forward method extracting last_hidden_state, with the message modality using "structured" format. unpad_inputs is set to false as Qwen3 can't flatten inputs nicely.

Added files:

modules.json: pipeline: Transformer, Pooling & Normalize
sentence_bert_config.json: feature-extraction task, structured message format, multimodal config
config_sentence_transformers.json: default prompt ("Represent the user's input."), cosine similarity
1_Pooling/config.json: lasttoken pooling, 2048 embedding dimension

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

# Text queries
queries = [
    "A woman playing with her dog on a beach at sunset.",
    "Pet owner training dog outdoors near water.",
    "Woman surfing on waves during a sunny day.",
    "City skyline view from a high-rise building at night.",
]

# Documents: text, image, and text+image
documents = [
    "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.",
    "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
]

# Encode queries and documents
query_embeddings = model.encode(queries)
doc_embeddings = model.encode(documents)
print(query_embeddings.shape, doc_embeddings.shape)
# (4, 2048) (3, 2048)

# Compute similarities
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.8160, 0.7155, 0.7054],
#         [0.5173, 0.3295, 0.4446],
#         [0.3863, 0.2987, 0.3312],
#         [0.1061, 0.0433, 0.0839]])

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

If you are able to merge this before tomorrow's Sentence Transformers v5.4 release, then I will be able to include this in my blogpost and documentation as a release model without revision. Otherwise I'll document it with revision and I can drop that later.

Tom Aarsen

tomaarsen changed pull request status to open 20 days ago

tomaarsen

18 days ago

•

edited 18 days ago

Hello @littlebird13 , @thenlper , and the rest of the Qwen team!

Sentence Transformers maintainer here, I wanted to reach out that Sentence Transformers has gone fully multimodal in yesterday's release. I published this blogpost to announce the new changes: https://huggingface.co/blog/multimodal-sentence-transformers
The response has been great, and your models are featured prominently alongside BAAI/BGE-VL, nvidia/llama-nemotron & omni-nemotron, and jina-reranker-m0. There's one difference, though: all of those other models work out of the box, while the Qwen models still require users to pass revision="refs/pr/...", which adds some unfortunate friction.

See e.g. these usage tables:

I've already done the integration work and created PRs for each model. You'll notice that they're written to leave the existing functionality completely untouched, so the PRs are purely additive:

Starting from this release, Sentence Transformers also supports the text-only rerankers, with PRs here:

These will all be very familiar for users who already use the Qwen3-Embedding models via Sentence Transformers. Once these are merged, I can update my blogpost and documentation to drop the revision workaround!

Let me know if you have any questions!

Tom Aarsen

thenlper changed pull request status to merged 12 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment