Gradience: Principled LoRA Compression Through Spectral Auditing

Community Article Published January 22, 2026

LoRA adapters are routinely over‑provisioned. Rank is chosen by convention (“r=16 worked before”), by headroom (“I can afford it”), or by grid search. After training, most practitioners still can’t answer a basic question:

Did this adapter actually use the rank I paid for?

Gradience makes that question measurable—and makes compression claims defensible.

This post describes two pieces:

Audit: spectral analysis of trained LoRA weights that estimates effective dimensionality and utilization.
Bench: a protocol that turns audit output into testable compression hypotheses—retrain at suggested ranks, evaluate, aggregate across seeds, apply a safety policy.

We focus on a validation at scale: Mistral‑7B + GSM8K, where we observe 50% fewer LoRA adapter parameters while staying within a worst‑seed tolerance of −2.5% across three seeds under a fixed protocol.

Theoretical Foundation

The core idea isn’t exotic: constrained hypotheses often generalize better.

A model that fits training data can do so by learning transferable structure—or by memorizing quirks. Memorization typically needs more degrees of freedom: capacity to encode arbitrary associations. Learning structure admits more compact representation.

This intuition shows up across several formalisms:

Minimum Description Length (MDL): better hypotheses compress the data more.
PAC‑Bayes: generalization bounds include a complexity term measuring distance from a prior.
Flat minima: solutions robust to perturbation tend to transfer better than brittle fits.

LoRA already operationalizes this principle. By restricting updates to a low‑rank subspace, it limits degrees of freedom during fine‑tuning. Rank is the knob controlling that constraint.

Gradience extends this by asking: given a trained adapter, how much of the allocated capacity was actually used? If a rank‑64 adapter concentrates its energy in far fewer directions, much of that rank is unused capacity. Tightening rank to match observed effective dimensionality is a principled regularization move—one that can also reduce serving cost.

Method: Spectral Auditing

For a LoRA update matrix ΔW (the BA product), Gradience computes spectral summaries quantifying how energy distributes across directions.

Stable rank

Stable rank measures energy concentration:

stable_rank(ΔW) = ||ΔW||_F^2 / σ_max(ΔW)^2

Low stable rank means energy is concentrated in a few directions; high means it’s diffuse. Stable rank is bounded by true rank but often substantially smaller.

Energy rank (k@90%)

Energy rank answers: how many singular directions capture 90% of the adapter’s energy?

k@τ = min { k : Σ_{i≤k} σ_i^2 ≥ τ · ||ΔW||_F^2 }

(We typically use τ = 0.90.)

Utilization

Utilization expresses “how much of the allocated rank is doing work”:

utilization = stable_rank(ΔW) / r_allocated

A utilization of 0.15 means “we trained a 64‑seat bus to carry about 10 passengers.” No moral judgment—sometimes that’s appropriate. But it’s measurable.

What Audit produces

Layer‑level stable rank and energy rank summaries
Global statistics (median, p90)
Suggested ranks:
- suggested_r_global_median (covers typical layers)
- suggested_r_global_90 (more conservative, covers the tail)

These are hypotheses, not guarantees. Which is why Bench exists.

Bench: From Hypothesis to Evidence

Audit tells you “this looks compressible.” Bench tests whether that impression survives evaluation.

For each seed:

Probe: train an adapter at generous rank (baseline)
Audit: compute spectral summaries, generate rank suggestions
Compress: retrain at suggested ranks
- uniform_median (global median suggestion)
- uniform_p90 (global p90 suggestion)
- per_layer (heterogeneous rank pattern)
Evaluate: measure performance on held‑out data
Aggregate: combine across seeds, apply a safety policy

Safety policy:

PASS iff: (pass_rate ≥ 67%) AND (worst_seed_Δ ≥ −0.025)

Bench produces canonical artifacts: bench.json and bench.md per seed, plus bench_aggregate.json and bench_aggregate.md for the combined result.

When I say “certifiable” in this post, I mean: fixed protocol + multi‑seed aggregation + explicit safety policy + reproducible artifacts.

Validation: Mistral‑7B + GSM8K

We validated on mistralai/Mistral‑7B‑v0.1 fine‑tuned for GSM8K (mathematical reasoning). Evaluation uses deterministic generation and exact‑match accuracy (numerical answer extraction + match).

Configuration

Parameter	Value
Probe rank	r=64
Training steps	1200
Seeds	3 (42, 123, 456)
Metric	GSM8K exact‑match accuracy
Validation level	3‑seed aggregate + safety policy

Results

Probe baseline (r=64): 0.285 ± 0.012 (range: 0.270–0.300)

Variant	Pass Rate	Worst Δ	Mean Accuracy	LoRA Param Reduction
per_layer	100%	−0.020	0.320	2.8%
uniform_median (r=32)	100%	−0.025	0.287	50%
uniform_p90 (r=32)	100%	−0.015	0.300	50%

Important: the “50% reduction” here is LoRA adapter parameters (rank 64 → rank 32), not a 50% reduction in total model parameters.

Interpretation

Uniform compression: Cutting LoRA parameters in half (r=64 → r=32) was policy‑compliant across all seeds.
Boundary case: uniform_median hits exactly −2.5% on its worst seed—a pass, but informative. The policy boundary corresponds to real run‑to‑run variance.
Per‑layer behavior: per_layer barely compresses (2.8%), so its “win” is not efficiency—it’s the mean accuracy improvement (+3.5 points). With three seeds, that improvement is suggestive rather than conclusive, but it’s consistent with a regularization/allocation story: targeted constraint may help generalization more than uniform constraint.

Cross‑scale picture

Much LoRA compression work focuses on small encoder classifiers where experiments are cheap. The Mistral/GSM8K result tests the same methodology on a 7B decoder doing generation.

Model	Parameters	Task	LoRA Compression	Status
DistilBERT	66M	SST‑2 (sentiment)	61%	✅ Validated
Mistral‑7B	7B	GSM8K (reasoning)	50%	✅ Validated

The point isn’t “this always works.” The point is: we have a reproducible protocol that produces evidence, not vibes.

Integration

Gradience integrates with Hugging Face Trainer:

from transformers import Trainer
from gradience.vnext.integrations.hf import GradienceCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    callbacks=[GradienceCallback()],
)
trainer.train()

Post‑training analysis:

# Check compression headroom
gradience audit --peft-dir ./adapter --layers

# Summarize training dynamics
gradience monitor ./run.jsonl --verbose

For validated compression claims, run Bench:

python -m gradience.bench.run_bench \
  --config gradience/bench/configs/mistral_gsm8k_certifiable_seed42.yaml \
  --output bench_runs/mistral_gsm8k_seed42

Limitations

Audit is not an oracle. Spectral metrics generate hypotheses; evaluation decides.
Budgets are noisy. GSM8K shows meaningful variance at small step counts. Multi‑seed aggregation matters.
QLoRA complicates interpretation. Under quantization, adapters may compensate for quantization error alongside task learning. The metrics remain useful, but claims require care.

Installation

git clone https://github.com/johntnanney/gradience.git
cd gradience
pip install -e ".[hf]"

For Bench with Hugging Face models:

pip install transformers peft datasets accelerate safetensors

Documentation: github.com/johntnanney/gradience

Citation

@software{gradience,
  title  = {Gradience: Spectral Auditing for LoRA Compression},
  author = {Nanney, John T.},
  year   = {2026},
  url    = {https://github.com/johntnanney/gradience}
}

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote