Get_peft_model() is extremely slow

I am on peft version 0.18.1

get_peft_model() is extremely slow, and it takes around 3min to load it on top of a Qwen/Qwen3-4B-Instruct-2507.

This is the code below that is super slow.

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

lora_config = LoraConfig(

    r=16,

lora_alpha=32,

target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],

lora_dropout=0.05,

bias="none",

task_type="CAUSAL_LM",

init_lora_weights=True

)

peft_model = get_peft_model(model, lora_config, adapter_name=adapter_name, autocast_adapter_dtype=False)
1 Like

Perhaps an known bug…?


What’s likely happening (specific to your code + model)

Your symptom matches a newly filed PEFT bug report for PEFT 0.18.1 + Qwen/Qwen3-4B-Instruct-2507 + device_map="auto" + LoRA on q/k/v/o taking ~3 minutes inside get_peft_model(). (GitHub)

In practice, get_peft_model() becomes “minutes slow” when adapter injection is interacting badly with Accelerate’s dispatch/offload behavior (which is what device_map="auto" uses under the hood).

Background: why device_map="auto" can make injection slow

  • device_map="auto" means Accelerate computes a placement for modules across devices. (Hugging Face)
  • That placement may spread layers across GPUs and can offload layers to CPU or even disk. (Hugging Face)
  • When some weights are offloaded, Accelerate uses hooks to move weights in/out as needed (great for inference that doesn’t fit in VRAM, but expensive if a process needs to “touch” lots of modules). (Hugging Face)

LoRA injection “touches” many modules (in your case: every q/k/v/o projection in every transformer block), so if any of those modules are on CPU/disk or heavily sharded, injection can devolve into repeated transfers/synchronization.


Confirm it in 10 seconds: check your device map

Run this right after from_pretrained(...) and before get_peft_model(...):

from collections import Counter

print("has hf_device_map:", hasattr(model, "hf_device_map"))
if hasattr(model, "hf_device_map"):
    print("hf_device_map counts:", Counter(model.hf_device_map.values()))
    print("unique placements:", sorted(set(model.hf_device_map.values())))

Interpretation

  • If you see 'cpu' or 'disk' in the map, that is the most common reason get_peft_model() takes minutes (offload thrash). Accelerate explicitly supports CPU/disk offload as part of dispatch. (Hugging Face)
  • If everything is only cuda:* and it’s still minutes, you may be hitting a PEFT/Qwen3-specific regression (consistent with the issue report). (GitHub)

Causes (ranked by likelihood)

1) CPU/disk offload from device_map="auto"

Most likely. Accelerate may place some layers on CPU or disk when VRAM is tight; dispatch can include CPU/disk offload. (Hugging Face)
LoRA injection then becomes slow because it must replace/wrap many modules that aren’t resident on GPU.

2) “Large-touch” injection: you target 4 linears per layer (q/k/v/o)

This increases the number of wrapped modules substantially. On its own it shouldn’t be minutes, but it amplifies any offload/sharding overhead.

3) Expensive LoRA initialization modes (less likely with your config)

  • In PEFT, init_lora_weights=True is the fast default (Kaiming-uniform for A, zeros for B). (Hugging Face)
  • If you ever set init_lora_weights="pissa" (or similar), it may run SVD and can take minutes. (GitHub)

4) Loading adapters is also slow in this model (separate but related)

The same report also says PeftModel.from_pretrained(...) is extremely slow for this model. (GitHub)
That path can often be improved with low_cpu_mem_usage=True (details below). (Hugging Face)


Solutions / Workarounds

A) Fastest: inject LoRA with the model fully on GPU (no auto device map)

Transformers explicitly documents that device_map=0 places the whole model on GPU 0. (Hugging Face)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map=0,  # or {"": 0}
)

peft_model = get_peft_model(model, lora_config, adapter_name=adapter_name, autocast_adapter_dtype=False)

If this drops from minutes → seconds, your root cause was dispatch/offload interaction.


B) If you need "auto" (multi-GPU), prevent CPU/disk offload with max_memory

Give "auto" enough GPU headroom so it doesn’t spill to CPU/disk.

max_memory = {
    0: "23GiB",
    1: "23GiB",
    # omit "cpu" to discourage CPU offload
}

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    max_memory=max_memory,
)

Then re-check hf_device_map—you want only cuda:*. Accelerate’s device map can offload to CPU/disk, so avoiding that is the key. (Hugging Face)


C) Inject before dispatch/offload (advanced, but robust when VRAM is tight)

If you truly must offload, restructure so LoRA injection doesn’t happen on a fully dispatched/offloaded model.

Accelerate’s recommended big-model workflow uses:

  • init_empty_weights() to build a “meta” skeleton, then
  • load_checkpoint_and_dispatch() to load/dispatch weights. (Hugging Face)

Conceptually, you want LoRA modules created before the expensive device movement hooks create repeated paging during injection.


D) Reduce the number of targeted modules (diagnostic + workaround)

Try adapting fewer projections temporarily to see if time scales with number of replacements:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # fewer modules
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    init_lora_weights=True,
)

If this becomes much faster, you’re bottlenecked on “number of touched modules × device transfers”.


E) If your slow path includes loading adapters: use low_cpu_mem_usage=True

This is for loading existing adapters (from_pretrained, load_adapter, or low-level injection), not for “create fresh LoRA then train”, but it matters because you (and the issue report) also saw slow PeftModel.from_pretrained(...). (GitHub)

PEFT docs: low_cpu_mem_usage=True creates empty adapter weights on the meta device to speed loading. (Hugging Face)


F) Track/try upstream fixes

Historically, PEFT has had performance work around adapter initialization (e.g., older issues about get_peft_model() slowness and initialization loops). (GitHub)
For your exact model/version combination, the best “upstream” move is to follow the active report and test newer PEFT revisions once a fix lands. (GitHub)


Recommended “decision tree” for your case

  1. Print Counter(model.hf_device_map.values())

    • If any cpu/disk → do A or B first.
  2. A/B test:

    • Load with device_map=0 and time get_peft_model() vs device_map="auto".
  3. If GPU-only still minutes:

    • You’re likely in the same regression as the current issue; use A as a workaround (inject on one device first), or C if you must offload. (GitHub)