Wiki Screenshot Embedding LoRA Checkpoints

LoRA / DoRA adapter checkpoints for Qwen3-VL-Embedding-2B, fine-tuned on Wikipedia screenshot tiles for visual document retrieval.

All evals on mini-v8 (400 queries, 7426 tiles).


Headline (Cross-reader / Cross-thinking comparison @ best ckpts)

ckpt R@1 R@3 Qwen3-VL-4B reader (no-think mt=200) Qwen3.5-4B no-think mt=200 Qwen3.5-4B think mt=8192
Base (no LoRA) 0.6875 0.830 0.7300 0.7925
hyper3/ckpt250 (LLM-only LoRA) 0.7000 0.840 0.7775 0.7700 0.8225
lora_vit/ckpt200 0.7575 0.8900 0.7875 0.7850 0.8375
lora_vit/ckpt250 0.7625 0.8925 0.7800 0.8375
lora_vit/ckpt300 0.7625 0.8925 0.7825
dora_ls005/ckpt150 0.7575 0.8825 0.7875 0.7875 0.8525
dora_ls005/ckpt225 0.7675 0.8825 0.7800 0.7725 0.8450
dora_ls005/ckpt250 0.7575 0.8850 0.7750 0.7750 0.8325

Best per metric:

  • R@1: dora_ls005/ckpt225 (0.7675)
  • R@3: lora_vit/ckpt250 / ckpt300 (0.8925)
  • Qwen3-VL no-think QA: dora_ls005/ckpt150 / lora_vit/ckpt200 tied (0.7875)
  • Qwen3.5 no-think QA: dora_ls005/ckpt150 (0.7875)
  • Qwen3.5 think QA (mt=8192): dora_ls005/ckpt150 (0.8525) ★ overall SOTA

dora_ls005/ckpt150 is the most well-rounded — wins all 3 QA metrics, only marginally behind on retrieval (R@1 -0.01, R@3 -0.01).


Training Configs

dora_ls005 ★ NEW — best QA across all readers

v8r-style + DoRA + label smoothing 0.05 + tokens=4096

  • Base model: Qwen/Qwen3-VL-Embedding-2B
  • Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
  • LoRA: rank 32, alpha 32, DoRA enabled (use_dora: true), targets LLM (q/k/v/o_proj) + ViT (attn.qkv, attn.proj, mlp.linear_fc1/fc2)
  • Loss: InfoNCE + label smoothing = 0.05
  • LR: 7e-6, cosine schedule, warmup 20 (image) + 50 (text warmup)
  • Batch size: 64 (single GPU; effective batch 64 — smaller than lora_vit's 256)
  • Max steps: 350, save every 25
  • Visual tokens: 4096
  • Text warmup: 50 steps text-only before image training

Eval results (mini-v8, 400 queries / 7426 tiles)

Step R@1 R@3 Qwen3-VL-4B (no-think) Qwen3.5-4B (no-think) Qwen3.5-4B (think mt=8192)
125 0.7425 0.8750 0.7800 0.8400
150 0.7575 0.8825 0.7875 0.7875 0.8525
175 0.7550 0.8750 0.7800 0.8375
200 0.7525 0.8750 0.7775 0.8325
225 0.7675 0.8825 0.7800 0.7725 0.8450
250 0.7575 0.8850 0.7750 0.7750 0.8325

Why thinking helps so much

The Qwen3.5-4B reader is a reasoning model. With enable_thinking=True and large max_tokens (≥8192) the answer quality jumps substantially:

Reader config Base (no LoRA) Best ckpt (dora_ls005/ckpt150)
Qwen3.5 no-think mt=200 0.7300 0.7875 (+5.75%)
Qwen3.5 think mt=8192 0.7925 (+6.25%) 0.8525 (+12.25%)

max_tokens sensitivity for dora_ls005/ckpt150 + thinking:

  • mt=2048 → 0.8175
  • mt=4096 → 0.8325
  • mt=8192 → 0.8525 ← sweet spot
  • mt=16384 → 0.8500 (plateau)

must use max_tokens ≥ 8192 for thinking, otherwise reasoning gets truncated.


lora_vit (v8_r_warmup50_lr7e6_lora_vit_350) — original Best

  • Base model: Qwen/Qwen3-VL-Embedding-2B
  • Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
  • LoRA: rank 32, alpha 32, target includes ViT layers (attn.qkv, attn.proj, mlp.linear_fc1/fc2)
  • LR: 7e-6, cosine schedule, warmup 50 steps
  • Batch size: 256 effective
  • Max steps: 350
  • Visual tokens: 4096

Eval Results

Step v6 recall@1 v6 recall@3 v6 QA v6 vs base v8 recall@1 v8 recall@3 v8 QA v8 vs base
base 0.655 0.800 0.645 0.690 0.828 0.708
100 0.690 0.855 0.715 +0.070 0.725 0.870 0.760 +0.052
150 0.720 0.870 0.735 +0.090 0.750 0.888 0.780 +0.072
200 0.740 0.860 0.750 +0.105 0.753 0.890 0.793 +0.085

Best: ckpt200 — v8 QA = 0.793 (+8.5%), v6 QA = 0.750 (+10.5%)

hyper3 (v8_i_warmup50_lr7e6_hardswitch_350)

  • Base model: Qwen/Qwen3-VL-Embedding-2B
  • Data: natural_filtered_v2 (104k pairs, 2 hard negatives)
  • LoRA: rank 32, alpha 32 (LLM layers only, no ViT)
  • LR: 7e-6, cosine schedule, warmup 50 steps
  • Batch size: 256 effective
  • Max steps: 350 (hard switch)
  • Visual tokens: 4096

Eval Results

Step v6 QA v6 vs base v8 QA v8 vs base
base 0.645 0.708
100 0.715 +0.070 0.745 +0.038
150 0.710 +0.065 0.748 +0.040
200 0.725 +0.080 0.753 +0.045
250 0.715 +0.070 0.770 +0.063
300 0.715 +0.070 0.763 +0.055
350 0.715 +0.070 0.763 +0.055

Best: ckpt250 (v8 QA = 0.770, +6.3% over base)

Usage

from peft import PeftModel
from transformers import AutoModel

base = AutoModel.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")

# Best overall (DoRA + LS=0.05): wins QA on every reader
model = PeftModel.from_pretrained(
    base,
    "Chrisyichuan/wiki-screenshot-embedding-lora",
    subfolder="dora_ls005/ckpt150",
)

DoRA is auto-detected from adapter_config.json (use_dora: true) — no extra code needed; PEFT loads lora_A, lora_B, and lora_magnitude_vector automatically.

Eval Benchmarks

  • v6: 200 queries, 5291 tiles (hard-mini-v6)
  • v8: 400 queries, 7426 tiles (hard-mini-v8, preferred benchmark)
  • QA score pipeline: retrieval top-3 → VQA reader (Qwen3-VL-4B-Instruct or Qwen3.5-4B) → GPT-4.1 grader (correct/incorrect)
  • For Qwen3.5 reader: enable_thinking=True + max_tokens=8192 recommended; enable_thinking=False + max_tokens=200 is the fast/cheap baseline.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chrisyichuan/wiki-screenshot-embedding-lora

Adapter
(7)
this model