Integrate with Sentence Transformers v5.4
Hello!
Pull Request overview
- Integrate this model using a Sentence Transformers
SentenceTransformer
Details
This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with lasttoken Pooling and Normalize modules, producing 2048-dimensional normalized embeddings. The model supports text, image, video, and multimodal (any combination of the previous) inputs via a structured message format.
The sentence_bert_config.json configures each modality (text, image, video, message) to use the forward method extracting last_hidden_state, with the message modality using "structured" format. unpad_inputs is set to false as Qwen3 can't flatten inputs nicely.
Added files:
modules.json: pipeline:Transformer,Pooling&Normalizesentence_bert_config.json:feature-extractiontask, structured message format, multimodal configconfig_sentence_transformers.json: default prompt ("Represent the user's input."), cosine similarity1_Pooling/config.json:lasttokenpooling, 2048 embedding dimension
Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-8B", revision="refs/pr/11")
# Text queries
queries = [
"A woman playing with her dog on a beach at sunset.",
"Pet owner training dog outdoors near water.",
"Woman surfing on waves during a sunny day.",
"City skyline view from a high-rise building at night.",
]
# Documents: text, image, and text+image
documents = [
"A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.",
"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
{"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
]
# Encode queries and documents
query_embeddings = model.encode(queries)
doc_embeddings = model.encode(documents)
print(query_embeddings.shape, doc_embeddings.shape)
# (4, 4096) (3, 4096)
# Compute similarities
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.7438, 0.6556, 0.6244],
# [0.4430, 0.3323, 0.3929],
# [0.3685, 0.2310, 0.2874],
# [0.0602, -0.0162, 0.0167]])
And after merging, the revision argument can be dropped.
Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.
If you are able to merge this before tomorrow's Sentence Transformers v5.4 release, then I will be able to include this in my blogpost and documentation as a release model without revision. Otherwise I'll document it with revision and I can drop that later.
- Tom Aarsen