Gradience: Principled LoRA Compression Through Spectral Auditing
LoRA adapters are routinely over‑provisioned. Rank is chosen by convention (“r=16 worked before”), by headroom (“I can afford it”), or by grid search. After training, most practitioners still can’t answer a basic question:
Did this adapter actually use the rank I paid for?
Gradience makes that question measurable—and makes compression claims defensible.
This post describes two pieces:
- Audit: spectral analysis of trained LoRA weights that estimates effective dimensionality and utilization.
- Bench: a protocol that turns audit output into testable compression hypotheses—retrain at suggested ranks, evaluate, aggregate across seeds, apply a safety policy.
We focus on a validation at scale: Mistral‑7B + GSM8K, where we observe 50% fewer LoRA adapter parameters while staying within a worst‑seed tolerance of −2.5% across three seeds under a fixed protocol.
Theoretical Foundation
The core idea isn’t exotic: constrained hypotheses often generalize better.
A model that fits training data can do so by learning transferable structure—or by memorizing quirks. Memorization typically needs more degrees of freedom: capacity to encode arbitrary associations. Learning structure admits more compact representation.
This intuition shows up across several formalisms:
- Minimum Description Length (MDL): better hypotheses compress the data more.
- PAC‑Bayes: generalization bounds include a complexity term measuring distance from a prior.
- Flat minima: solutions robust to perturbation tend to transfer better than brittle fits.
LoRA already operationalizes this principle. By restricting updates to a low‑rank subspace, it limits degrees of freedom during fine‑tuning. Rank is the knob controlling that constraint.
Gradience extends this by asking: given a trained adapter, how much of the allocated capacity was actually used? If a rank‑64 adapter concentrates its energy in far fewer directions, much of that rank is unused capacity. Tightening rank to match observed effective dimensionality is a principled regularization move—one that can also reduce serving cost.
Method: Spectral Auditing
For a LoRA update matrix ΔW (the BA product), Gradience computes spectral summaries quantifying how energy distributes across directions.
Stable rank
Stable rank measures energy concentration:
stable_rank(ΔW) = ||ΔW||_F^2 / σ_max(ΔW)^2
Low stable rank means energy is concentrated in a few directions; high means it’s diffuse. Stable rank is bounded by true rank but often substantially smaller.
Energy rank (k@90%)
Energy rank answers: how many singular directions capture 90% of the adapter’s energy?
k@τ = min { k : Σ_{i≤k} σ_i^2 ≥ τ · ||ΔW||_F^2 }
(We typically use τ = 0.90.)
Utilization
Utilization expresses “how much of the allocated rank is doing work”:
utilization = stable_rank(ΔW) / r_allocated
A utilization of 0.15 means “we trained a 64‑seat bus to carry about 10 passengers.” No moral judgment—sometimes that’s appropriate. But it’s measurable.
What Audit produces
- Layer‑level stable rank and energy rank summaries
- Global statistics (median, p90)
- Suggested ranks:
suggested_r_global_median(covers typical layers)suggested_r_global_90(more conservative, covers the tail)
These are hypotheses, not guarantees. Which is why Bench exists.
Bench: From Hypothesis to Evidence
Audit tells you “this looks compressible.” Bench tests whether that impression survives evaluation.
For each seed:
- Probe: train an adapter at generous rank (baseline)
- Audit: compute spectral summaries, generate rank suggestions
- Compress: retrain at suggested ranks
uniform_median(global median suggestion)uniform_p90(global p90 suggestion)per_layer(heterogeneous rank pattern)
- Evaluate: measure performance on held‑out data
- Aggregate: combine across seeds, apply a safety policy
Safety policy:
PASS iff: (pass_rate ≥ 67%) AND (worst_seed_Δ ≥ −0.025)
Bench produces canonical artifacts: bench.json and bench.md per seed, plus bench_aggregate.json and bench_aggregate.md for the combined result.
When I say “certifiable” in this post, I mean: fixed protocol + multi‑seed aggregation + explicit safety policy + reproducible artifacts.
Validation: Mistral‑7B + GSM8K
We validated on mistralai/Mistral‑7B‑v0.1 fine‑tuned for GSM8K (mathematical reasoning). Evaluation uses deterministic generation and exact‑match accuracy (numerical answer extraction + match).
Configuration
| Parameter | Value |
|---|---|
| Probe rank | r=64 |
| Training steps | 1200 |
| Seeds | 3 (42, 123, 456) |
| Metric | GSM8K exact‑match accuracy |
| Validation level | 3‑seed aggregate + safety policy |
Results
Probe baseline (r=64): 0.285 ± 0.012 (range: 0.270–0.300)
| Variant | Pass Rate | Worst Δ | Mean Accuracy | LoRA Param Reduction |
|---|---|---|---|---|
| per_layer | 100% | −0.020 | 0.320 | 2.8% |
| uniform_median (r=32) | 100% | −0.025 | 0.287 | 50% |
| uniform_p90 (r=32) | 100% | −0.015 | 0.300 | 50% |
Important: the “50% reduction” here is LoRA adapter parameters (rank 64 → rank 32), not a 50% reduction in total model parameters.
Interpretation
- Uniform compression: Cutting LoRA parameters in half (r=64 → r=32) was policy‑compliant across all seeds.
- Boundary case:
uniform_medianhits exactly −2.5% on its worst seed—a pass, but informative. The policy boundary corresponds to real run‑to‑run variance. - Per‑layer behavior:
per_layerbarely compresses (2.8%), so its “win” is not efficiency—it’s the mean accuracy improvement (+3.5 points). With three seeds, that improvement is suggestive rather than conclusive, but it’s consistent with a regularization/allocation story: targeted constraint may help generalization more than uniform constraint.
Cross‑scale picture
Much LoRA compression work focuses on small encoder classifiers where experiments are cheap. The Mistral/GSM8K result tests the same methodology on a 7B decoder doing generation.
| Model | Parameters | Task | LoRA Compression | Status |
|---|---|---|---|---|
| DistilBERT | 66M | SST‑2 (sentiment) | 61% | ✅ Validated |
| Mistral‑7B | 7B | GSM8K (reasoning) | 50% | ✅ Validated |
The point isn’t “this always works.” The point is: we have a reproducible protocol that produces evidence, not vibes.
Integration
Gradience integrates with Hugging Face Trainer:
from transformers import Trainer
from gradience.vnext.integrations.hf import GradienceCallback
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
callbacks=[GradienceCallback()],
)
trainer.train()
Post‑training analysis:
# Check compression headroom
gradience audit --peft-dir ./adapter --layers
# Summarize training dynamics
gradience monitor ./run.jsonl --verbose
For validated compression claims, run Bench:
python -m gradience.bench.run_bench \
--config gradience/bench/configs/mistral_gsm8k_certifiable_seed42.yaml \
--output bench_runs/mistral_gsm8k_seed42
Limitations
- Audit is not an oracle. Spectral metrics generate hypotheses; evaluation decides.
- Budgets are noisy. GSM8K shows meaningful variance at small step counts. Multi‑seed aggregation matters.
- QLoRA complicates interpretation. Under quantization, adapters may compensate for quantization error alongside task learning. The metrics remain useful, but claims require care.
Installation
git clone https://github.com/johntnanney/gradience.git
cd gradience
pip install -e ".[hf]"
For Bench with Hugging Face models:
pip install transformers peft datasets accelerate safetensors
Documentation: github.com/johntnanney/gradience
Citation
@software{gradience,
title = {Gradience: Spectral Auditing for LoRA Compression},
author = {Nanney, John T.},
year = {2026},
url = {https://github.com/johntnanney/gradience}
}