Perhaps an known bug…?
What’s likely happening (specific to your code + model)
Your symptom matches a newly filed PEFT bug report for PEFT 0.18.1 + Qwen/Qwen3-4B-Instruct-2507 + device_map="auto" + LoRA on q/k/v/o taking ~3 minutes inside get_peft_model(). (GitHub)
In practice, get_peft_model() becomes “minutes slow” when adapter injection is interacting badly with Accelerate’s dispatch/offload behavior (which is what device_map="auto" uses under the hood).
Background: why device_map="auto" can make injection slow
device_map="auto" means Accelerate computes a placement for modules across devices. (Hugging Face)
- That placement may spread layers across GPUs and can offload layers to CPU or even disk. (Hugging Face)
- When some weights are offloaded, Accelerate uses hooks to move weights in/out as needed (great for inference that doesn’t fit in VRAM, but expensive if a process needs to “touch” lots of modules). (Hugging Face)
LoRA injection “touches” many modules (in your case: every q/k/v/o projection in every transformer block), so if any of those modules are on CPU/disk or heavily sharded, injection can devolve into repeated transfers/synchronization.
Confirm it in 10 seconds: check your device map
Run this right after from_pretrained(...) and before get_peft_model(...):
from collections import Counter
print("has hf_device_map:", hasattr(model, "hf_device_map"))
if hasattr(model, "hf_device_map"):
print("hf_device_map counts:", Counter(model.hf_device_map.values()))
print("unique placements:", sorted(set(model.hf_device_map.values())))
Interpretation
- If you see
'cpu' or 'disk' in the map, that is the most common reason get_peft_model() takes minutes (offload thrash). Accelerate explicitly supports CPU/disk offload as part of dispatch. (Hugging Face)
- If everything is only
cuda:* and it’s still minutes, you may be hitting a PEFT/Qwen3-specific regression (consistent with the issue report). (GitHub)
Causes (ranked by likelihood)
1) CPU/disk offload from device_map="auto"
Most likely. Accelerate may place some layers on CPU or disk when VRAM is tight; dispatch can include CPU/disk offload. (Hugging Face)
LoRA injection then becomes slow because it must replace/wrap many modules that aren’t resident on GPU.
2) “Large-touch” injection: you target 4 linears per layer (q/k/v/o)
This increases the number of wrapped modules substantially. On its own it shouldn’t be minutes, but it amplifies any offload/sharding overhead.
3) Expensive LoRA initialization modes (less likely with your config)
- In PEFT,
init_lora_weights=True is the fast default (Kaiming-uniform for A, zeros for B). (Hugging Face)
- If you ever set
init_lora_weights="pissa" (or similar), it may run SVD and can take minutes. (GitHub)
4) Loading adapters is also slow in this model (separate but related)
The same report also says PeftModel.from_pretrained(...) is extremely slow for this model. (GitHub)
That path can often be improved with low_cpu_mem_usage=True (details below). (Hugging Face)
Solutions / Workarounds
A) Fastest: inject LoRA with the model fully on GPU (no auto device map)
Transformers explicitly documents that device_map=0 places the whole model on GPU 0. (Hugging Face)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map=0, # or {"": 0}
)
peft_model = get_peft_model(model, lora_config, adapter_name=adapter_name, autocast_adapter_dtype=False)
If this drops from minutes → seconds, your root cause was dispatch/offload interaction.
B) If you need "auto" (multi-GPU), prevent CPU/disk offload with max_memory
Give "auto" enough GPU headroom so it doesn’t spill to CPU/disk.
max_memory = {
0: "23GiB",
1: "23GiB",
# omit "cpu" to discourage CPU offload
}
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
max_memory=max_memory,
)
Then re-check hf_device_map—you want only cuda:*. Accelerate’s device map can offload to CPU/disk, so avoiding that is the key. (Hugging Face)
C) Inject before dispatch/offload (advanced, but robust when VRAM is tight)
If you truly must offload, restructure so LoRA injection doesn’t happen on a fully dispatched/offloaded model.
Accelerate’s recommended big-model workflow uses:
init_empty_weights() to build a “meta” skeleton, then
load_checkpoint_and_dispatch() to load/dispatch weights. (Hugging Face)
Conceptually, you want LoRA modules created before the expensive device movement hooks create repeated paging during injection.
D) Reduce the number of targeted modules (diagnostic + workaround)
Try adapting fewer projections temporarily to see if time scales with number of replacements:
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"], # fewer modules
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
init_lora_weights=True,
)
If this becomes much faster, you’re bottlenecked on “number of touched modules × device transfers”.
E) If your slow path includes loading adapters: use low_cpu_mem_usage=True
This is for loading existing adapters (from_pretrained, load_adapter, or low-level injection), not for “create fresh LoRA then train”, but it matters because you (and the issue report) also saw slow PeftModel.from_pretrained(...). (GitHub)
PEFT docs: low_cpu_mem_usage=True creates empty adapter weights on the meta device to speed loading. (Hugging Face)
F) Track/try upstream fixes
Historically, PEFT has had performance work around adapter initialization (e.g., older issues about get_peft_model() slowness and initialization loops). (GitHub)
For your exact model/version combination, the best “upstream” move is to follow the active report and test newer PEFT revisions once a fix lands. (GitHub)
Recommended “decision tree” for your case
-
Print Counter(model.hf_device_map.values())
- If any
cpu/disk → do A or B first.
-
A/B test:
- Load with
device_map=0 and time get_peft_model() vs device_map="auto".
-
If GPU-only still minutes:
- You’re likely in the same regression as the current issue; use A as a workaround (inject on one device first), or C if you must offload. (GitHub)