Get_peft_model() is extremely slow

MCoder10 · February 13, 2026, 6:27am

I am on peft version 0.18.1

get_peft_model() is extremely slow, and it takes around 3min to load it on top of a Qwen/Qwen3-4B-Instruct-2507.

This is the code below that is super slow.

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

lora_config = LoraConfig(

    r=16,

lora_alpha=32,

target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],

lora_dropout=0.05,

bias="none",

task_type="CAUSAL_LM",

init_lora_weights=True

)

peft_model = get_peft_model(model, lora_config, adapter_name=adapter_name, autocast_adapter_dtype=False)

John6666 · February 13, 2026, 10:11am

Perhaps an known bug…?

What’s likely happening (specific to your code + model)

Your symptom matches a newly filed PEFT bug report for PEFT 0.18.1 + Qwen/Qwen3-4B-Instruct-2507 + device_map="auto" + LoRA on q/k/v/o taking ~3 minutes inside get_peft_model(). (GitHub)

In practice, get_peft_model() becomes “minutes slow” when adapter injection is interacting badly with Accelerate’s dispatch/offload behavior (which is what device_map="auto" uses under the hood).

Background: why `device_map="auto"` can make injection slow

device_map="auto" means Accelerate computes a placement for modules across devices. (Hugging Face)
That placement may spread layers across GPUs and can offload layers to CPU or even disk. (Hugging Face)
When some weights are offloaded, Accelerate uses hooks to move weights in/out as needed (great for inference that doesn’t fit in VRAM, but expensive if a process needs to “touch” lots of modules). (Hugging Face)

LoRA injection “touches” many modules (in your case: every q/k/v/o projection in every transformer block), so if any of those modules are on CPU/disk or heavily sharded, injection can devolve into repeated transfers/synchronization.

Confirm it in 10 seconds: check your device map

Run this right after from_pretrained(...) and before get_peft_model(...):

from collections import Counter

print("has hf_device_map:", hasattr(model, "hf_device_map"))
if hasattr(model, "hf_device_map"):
    print("hf_device_map counts:", Counter(model.hf_device_map.values()))
    print("unique placements:", sorted(set(model.hf_device_map.values())))

Interpretation

If you see 'cpu' or 'disk' in the map, that is the most common reason get_peft_model() takes minutes (offload thrash). Accelerate explicitly supports CPU/disk offload as part of dispatch. (Hugging Face)
If everything is only cuda:* and it’s still minutes, you may be hitting a PEFT/Qwen3-specific regression (consistent with the issue report). (GitHub)

Causes (ranked by likelihood)

1) CPU/disk offload from `device_map="auto"`

Most likely. Accelerate may place some layers on CPU or disk when VRAM is tight; dispatch can include CPU/disk offload. (Hugging Face)
LoRA injection then becomes slow because it must replace/wrap many modules that aren’t resident on GPU.

2) “Large-touch” injection: you target 4 linears per layer (`q/k/v/o`)

This increases the number of wrapped modules substantially. On its own it shouldn’t be minutes, but it amplifies any offload/sharding overhead.

3) Expensive LoRA initialization modes (less likely with your config)

In PEFT, init_lora_weights=True is the fast default (Kaiming-uniform for A, zeros for B). (Hugging Face)
If you ever set init_lora_weights="pissa" (or similar), it may run SVD and can take minutes. (GitHub)

4) Loading adapters is also slow in this model (separate but related)

The same report also says PeftModel.from_pretrained(...) is extremely slow for this model. (GitHub)
That path can often be improved with low_cpu_mem_usage=True (details below). (Hugging Face)

Solutions / Workarounds

A) Fastest: inject LoRA with the model fully on GPU (no auto device map)

Transformers explicitly documents that device_map=0 places the whole model on GPU 0. (Hugging Face)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map=0,  # or {"": 0}
)

peft_model = get_peft_model(model, lora_config, adapter_name=adapter_name, autocast_adapter_dtype=False)

If this drops from minutes → seconds, your root cause was dispatch/offload interaction.

B) If you need `"auto"` (multi-GPU), prevent CPU/disk offload with `max_memory`

Give "auto" enough GPU headroom so it doesn’t spill to CPU/disk.

max_memory = {
    0: "23GiB",
    1: "23GiB",
    # omit "cpu" to discourage CPU offload
}

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    max_memory=max_memory,
)

Then re-check hf_device_map—you want only cuda:*. Accelerate’s device map can offload to CPU/disk, so avoiding that is the key. (Hugging Face)

C) Inject before dispatch/offload (advanced, but robust when VRAM is tight)

If you truly must offload, restructure so LoRA injection doesn’t happen on a fully dispatched/offloaded model.

Accelerate’s recommended big-model workflow uses:

init_empty_weights() to build a “meta” skeleton, then
load_checkpoint_and_dispatch() to load/dispatch weights. (Hugging Face)

Conceptually, you want LoRA modules created before the expensive device movement hooks create repeated paging during injection.

D) Reduce the number of targeted modules (diagnostic + workaround)

Try adapting fewer projections temporarily to see if time scales with number of replacements:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],  # fewer modules
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    init_lora_weights=True,
)

If this becomes much faster, you’re bottlenecked on “number of touched modules × device transfers”.

E) If your slow path includes loading adapters: use `low_cpu_mem_usage=True`

This is for loading existing adapters (from_pretrained, load_adapter, or low-level injection), not for “create fresh LoRA then train”, but it matters because you (and the issue report) also saw slow PeftModel.from_pretrained(...). (GitHub)

PEFT docs: low_cpu_mem_usage=True creates empty adapter weights on the meta device to speed loading. (Hugging Face)

F) Track/try upstream fixes

Historically, PEFT has had performance work around adapter initialization (e.g., older issues about get_peft_model() slowness and initialization loops). (GitHub)
For your exact model/version combination, the best “upstream” move is to follow the active report and test newer PEFT revisions once a fix lands. (GitHub)

Recommended “decision tree” for your case

Print Counter(model.hf_device_map.values())
- If any cpu/disk → do A or B first.
A/B test:
- Load with device_map=0 and time get_peft_model() vs device_map="auto".
If GPU-only still minutes:
- You’re likely in the same regression as the current issue; use A as a workaround (inject on one device first), or C if you must offload. (GitHub)

Topic		Replies	Views
Loading adapter merged models Intermediate	0	490	May 29, 2023
GPTQ+PEFT model running very slowly at inference Intermediate	4	1770	October 24, 2023
PEFT fine-tuning as slow as full model fine-tuning Intermediate	3	1671	December 6, 2023
Multi-gpu inference llama-3.2 vision with QLoRA 🤗Accelerate	4	217	April 25, 2025
`get_peft_model` or `model.add_adapter` Beginners	2	1399	February 17, 2025