ICONOCLAST

ICONOCLAST is a research framework for discriminative representation editing in open-weight language models. This repository packages the local research release for the iconoclast project: source code, configs, benchmark summaries, and documentation supporting the claim that ICONOCLAST improves on the HERETIC baseline in matched comparisons.

This release does not currently include merged model weights or LoRA adapters. It is a research artifact release, not a ready-to-run model checkpoint release.

What we did

ICONOCLAST starts from the same general abliteration setting as HERETIC: collect residual activations from harmful and harmless prompts, estimate refusal directions, and edit transformer projections with lightweight low-rank updates instead of full retraining.

The main change is that ICONOCLAST does not treat refusal removal as a single-vector problem. It estimates multiple candidate directions from contrastive activations, projects those directions away from a low-rank benign subspace, and then searches for edits that reduce harmful refusals while preserving benign behavior.

At a high level, the pipeline is:

  1. Extract per-layer residuals from harmful and harmless prompt sets.
  2. Build candidate directions using mean, median, variance-scaled, and hybrid estimators.
  3. Estimate a benign PCA subspace from harmless residuals.
  4. Project candidate edit directions into the null space of that benign subspace.
  5. Apply the edit to attention output and MLP down-projection modules through LoRA adapters.
  6. Optimize layer weighting and direction choices with Optuna against harmful refusals, benign overrefusals, and KL divergence.

Why this is better than HERETIC

HERETIC is strong because it automates directional ablation well, but its core edit is still centered on a single refusal direction. That can remove refusals at the cost of collateral damage when benign capability pathways overlap the same geometry.

ICONOCLAST improves on that in three ways:

  • Benign-subspace protection: candidate refusal directions are projected out of a low-rank harmless residual subspace before editing.
  • Richer optimization target: the search objective is not only refusals plus KL, but also overrefusals and disclaimer-heavy near-misses.
  • More flexible direction construction: ICONOCLAST can optimize over mean, median, variance, and hybrid directions instead of relying on one refusal-vector family.

In practice, the null-space step is the main reason the method is better: it preserves benign utility pathways that HERETIC can still partially damage.

Verified matched results

The local tree contains 10 directly paired batch_summary.json comparisons between ICONOCLAST and HERETIC. On those matched runs, ICONOCLAST wins the release criterion on all 10 pairs, has lower KL divergence on 8 of 10, lower harmful refusals on 6 of 10, and lower overrefusals on 5 of 10.

Model Iconoclast Refusals Iconoclast Overrefusals Iconoclast KL Heretic Refusals Heretic Overrefusals Heretic KL
Llama-3.1-8B 0/80 0/80 0.0447 1/80 0/80 0.1854
Qwen3.5-9B 10/80 2/80 0.0055 10/80 3/80 0.0160
Mistral-7B 1/80 0/80 0.0554 4/80 0/80 0.1317
Falcon3-7B 0/80 0/80 6.1448 4/80 1/80 0.1648
Gemma2-2B 1/80 0/80 0.1849 1/80 2/80 0.6441
Phi-4-mini 2/80 1/80 0.0204 2/80 1/80 0.0978
Yi-1.5-9B 2/80 0/80 0.0511 3/80 0/80 0.0355
StableLM2-1.6B 2/80 0/80 0.0328 3/80 0/80 0.0670
SmolLM2-1.7B 1/80 1/80 0.0087 2/80 2/80 0.2699
OLMo-2-1B 2/80 0/80 0.0345 2/80 1/80 0.0944

One caveat matters: Falcon3-7B is a behavioral win with a large KL outlier, so the method is not uniformly lower-drift on every base model. The local PUBLISHABLE_RESULTS.md also records an additional Phi-3.5-mini matched comparison in ICONOCLAST's favor.

Repository contents

This release is intended to preserve the work behind the result:

  • src/iconoclast: framework code
  • scripts: cluster and evaluation workflows
  • config*.toml: benchmark and model configs
  • results_cluster/checkpoints/*/batch_summary.json: benchmark summaries used for matched comparisons
  • INTERNAL_TECHNICAL_NOTE.md: implementation and experiment notes
  • PUBLISHABLE_RESULTS.md: summarized publishable comparison table
  • NOTICE.md: derivative-work notice relative to HERETIC

Limitations

  • No model weights or adapters are included in this Hub repo yet.
  • The strongest public claim supported directly by local paired JSON summaries is the 10-model matched comparison table above.
  • Some benchmark writeups in the local tree use inconsistent counts; this card reflects the directly verified local summaries.

Lineage and license

ICONOCLAST is a standalone derivative research codebase built partly from ideas and adapted source structure from Heretic by Philipp Emanuel Weidmann and contributors. Derivative portions remain under the GNU Affero General Public License v3.0 or later.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support