Gradience: Principled LoRA Compression Through Spectral Auditing

Community Article Published January 22, 2026

LoRA adapters are routinely over‑provisioned. Rank is chosen by convention (“r=16 worked before”), by headroom (“I can afford it”), or by grid search. After training, most practitioners still can’t answer a basic question:

Did this adapter actually use the rank I paid for?

Gradience makes that question measurable—and makes compression claims defensible.

This post describes two pieces:

  • Audit: spectral analysis of trained LoRA weights that estimates effective dimensionality and utilization.
  • Bench: a protocol that turns audit output into testable compression hypotheses—retrain at suggested ranks, evaluate, aggregate across seeds, apply a safety policy.

We focus on a validation at scale: Mistral‑7B + GSM8K, where we observe 50% fewer LoRA adapter parameters while staying within a worst‑seed tolerance of −2.5% across three seeds under a fixed protocol.


Theoretical Foundation

The core idea isn’t exotic: constrained hypotheses often generalize better.

A model that fits training data can do so by learning transferable structure—or by memorizing quirks. Memorization typically needs more degrees of freedom: capacity to encode arbitrary associations. Learning structure admits more compact representation.

This intuition shows up across several formalisms:

  • Minimum Description Length (MDL): better hypotheses compress the data more.
  • PAC‑Bayes: generalization bounds include a complexity term measuring distance from a prior.
  • Flat minima: solutions robust to perturbation tend to transfer better than brittle fits.

LoRA already operationalizes this principle. By restricting updates to a low‑rank subspace, it limits degrees of freedom during fine‑tuning. Rank is the knob controlling that constraint.

Gradience extends this by asking: given a trained adapter, how much of the allocated capacity was actually used? If a rank‑64 adapter concentrates its energy in far fewer directions, much of that rank is unused capacity. Tightening rank to match observed effective dimensionality is a principled regularization move—one that can also reduce serving cost.


Method: Spectral Auditing

For a LoRA update matrix ΔW (the BA product), Gradience computes spectral summaries quantifying how energy distributes across directions.

Stable rank

Stable rank measures energy concentration:

stable_rank(ΔW) = ||ΔW||_F^2 / σ_max(ΔW)^2

Low stable rank means energy is concentrated in a few directions; high means it’s diffuse. Stable rank is bounded by true rank but often substantially smaller.

Energy rank (k@90%)

Energy rank answers: how many singular directions capture 90% of the adapter’s energy?

k@τ = min { k : Σ_{i≤k} σ_i^2 ≥ τ · ||ΔW||_F^2 }

(We typically use τ = 0.90.)

Utilization

Utilization expresses “how much of the allocated rank is doing work”:

utilization = stable_rank(ΔW) / r_allocated

A utilization of 0.15 means “we trained a 64‑seat bus to carry about 10 passengers.” No moral judgment—sometimes that’s appropriate. But it’s measurable.

What Audit produces

  • Layer‑level stable rank and energy rank summaries
  • Global statistics (median, p90)
  • Suggested ranks:
    • suggested_r_global_median (covers typical layers)
    • suggested_r_global_90 (more conservative, covers the tail)

These are hypotheses, not guarantees. Which is why Bench exists.


Bench: From Hypothesis to Evidence

Audit tells you “this looks compressible.” Bench tests whether that impression survives evaluation.

For each seed:

  1. Probe: train an adapter at generous rank (baseline)
  2. Audit: compute spectral summaries, generate rank suggestions
  3. Compress: retrain at suggested ranks
    • uniform_median (global median suggestion)
    • uniform_p90 (global p90 suggestion)
    • per_layer (heterogeneous rank pattern)
  4. Evaluate: measure performance on held‑out data
  5. Aggregate: combine across seeds, apply a safety policy

Safety policy:

PASS iff: (pass_rate ≥ 67%) AND (worst_seed_Δ ≥ −0.025)

Bench produces canonical artifacts: bench.json and bench.md per seed, plus bench_aggregate.json and bench_aggregate.md for the combined result.

When I say “certifiable” in this post, I mean: fixed protocol + multi‑seed aggregation + explicit safety policy + reproducible artifacts.


Validation: Mistral‑7B + GSM8K

We validated on mistralai/Mistral‑7B‑v0.1 fine‑tuned for GSM8K (mathematical reasoning). Evaluation uses deterministic generation and exact‑match accuracy (numerical answer extraction + match).

Configuration

Parameter Value
Probe rank r=64
Training steps 1200
Seeds 3 (42, 123, 456)
Metric GSM8K exact‑match accuracy
Validation level 3‑seed aggregate + safety policy

Results

Probe baseline (r=64): 0.285 ± 0.012 (range: 0.270–0.300)

Variant Pass Rate Worst Δ Mean Accuracy LoRA Param Reduction
per_layer 100% −0.020 0.320 2.8%
uniform_median (r=32) 100% −0.025 0.287 50%
uniform_p90 (r=32) 100% −0.015 0.300 50%

Important: the “50% reduction” here is LoRA adapter parameters (rank 64 → rank 32), not a 50% reduction in total model parameters.

Interpretation

  • Uniform compression: Cutting LoRA parameters in half (r=64 → r=32) was policy‑compliant across all seeds.
  • Boundary case: uniform_median hits exactly −2.5% on its worst seed—a pass, but informative. The policy boundary corresponds to real run‑to‑run variance.
  • Per‑layer behavior: per_layer barely compresses (2.8%), so its “win” is not efficiency—it’s the mean accuracy improvement (+3.5 points). With three seeds, that improvement is suggestive rather than conclusive, but it’s consistent with a regularization/allocation story: targeted constraint may help generalization more than uniform constraint.

Cross‑scale picture

Much LoRA compression work focuses on small encoder classifiers where experiments are cheap. The Mistral/GSM8K result tests the same methodology on a 7B decoder doing generation.

Model Parameters Task LoRA Compression Status
DistilBERT 66M SST‑2 (sentiment) 61% ✅ Validated
Mistral‑7B 7B GSM8K (reasoning) 50% ✅ Validated

The point isn’t “this always works.” The point is: we have a reproducible protocol that produces evidence, not vibes.


Integration

Gradience integrates with Hugging Face Trainer:

from transformers import Trainer
from gradience.vnext.integrations.hf import GradienceCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    callbacks=[GradienceCallback()],
)
trainer.train()

Post‑training analysis:

# Check compression headroom
gradience audit --peft-dir ./adapter --layers

# Summarize training dynamics
gradience monitor ./run.jsonl --verbose

For validated compression claims, run Bench:

python -m gradience.bench.run_bench \
  --config gradience/bench/configs/mistral_gsm8k_certifiable_seed42.yaml \
  --output bench_runs/mistral_gsm8k_seed42

Limitations

  • Audit is not an oracle. Spectral metrics generate hypotheses; evaluation decides.
  • Budgets are noisy. GSM8K shows meaningful variance at small step counts. Multi‑seed aggregation matters.
  • QLoRA complicates interpretation. Under quantization, adapters may compensate for quantization error alongside task learning. The metrics remain useful, but claims require care.

Installation

git clone https://github.com/johntnanney/gradience.git
cd gradience
pip install -e ".[hf]"

For Bench with Hugging Face models:

pip install transformers peft datasets accelerate safetensors

Documentation: github.com/johntnanney/gradience


Citation

@software{gradience,
  title  = {Gradience: Spectral Auditing for LoRA Compression},
  author = {Nanney, John T.},
  year   = {2026},
  url    = {https://github.com/johntnanney/gradience}
}

Community

Sign up or log in to comment