Integrate with Sentence Transformers v5.4

#11
by tomaarsen HF Staff - opened

Hello!

Pull Request overview

  • Integrate this model using a Sentence Transformers SentenceTransformer

Details

This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with lasttoken Pooling and Normalize modules, producing 2048-dimensional normalized embeddings. The model supports text, image, video, and multimodal (any combination of the previous) inputs via a structured message format.

The sentence_bert_config.json configures each modality (text, image, video, message) to use the forward method extracting last_hidden_state, with the message modality using "structured" format. unpad_inputs is set to false as Qwen3 can't flatten inputs nicely.

Added files:

  • modules.json: pipeline: Transformer, Pooling & Normalize
  • sentence_bert_config.json: feature-extraction task, structured message format, multimodal config
  • config_sentence_transformers.json: default prompt ("Represent the user's input."), cosine similarity
  • 1_Pooling/config.json: lasttoken pooling, 2048 embedding dimension

Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-8B", revision="refs/pr/11")

# Text queries
queries = [
    "A woman playing with her dog on a beach at sunset.",
    "Pet owner training dog outdoors near water.",
    "Woman surfing on waves during a sunny day.",
    "City skyline view from a high-rise building at night.",
]

# Documents: text, image, and text+image
documents = [
    "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.",
    "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
    {"text": "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset, as the dog offers its paw in a heartwarming display of companionship and trust.", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
]

# Encode queries and documents
query_embeddings = model.encode(queries)
doc_embeddings = model.encode(documents)
print(query_embeddings.shape, doc_embeddings.shape)
# (4, 4096) (3, 4096)

# Compute similarities
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.7438, 0.6556, 0.6244],
#         [0.4430, 0.3323, 0.3929],
#         [0.3685, 0.2310, 0.2874],
#         [0.0602, -0.0162, 0.0167]])

And after merging, the revision argument can be dropped.

Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.

If you are able to merge this before tomorrow's Sentence Transformers v5.4 release, then I will be able to include this in my blogpost and documentation as a release model without revision. Otherwise I'll document it with revision and I can drop that later.

  • Tom Aarsen
tomaarsen changed pull request status to open
thenlper changed pull request status to merged

Sign up or log in to comment