# Internal Technical Note: ICONOCLAST

## What

ICONOCLAST is a command-line research framework for discriminative representation editing of open-weight language models. The implemented system evolved from the proposal's privacy-motivated abliteration plan into a broader alignment and utility-preservation benchmark: it estimates behavior directions from contrastive prompt activations, edits model output projections through LoRA adapters, and searches for edits that reduce refusal behavior while minimizing collateral change on benign prompts.

The core architecture has five first-party modules:

- `src/iconoclast/config.py`: Pydantic settings for model loading, datasets, direction construction, objectives, benchmarks, and checkpointing.
- `src/iconoclast/model.py`: Hugging Face model/tokenizer loading, dtype fallback, 4-bit quantization support, PEFT LoRA setup, residual/logprob extraction, generation, merging, and chat.
- `src/iconoclast/direction.py`: tensor routines for mean, median, variance-scaled, hybrid, orthogonalized, and benign-subspace-projected direction estimates.
- `src/iconoclast/evaluator.py`: marker-based refusal evaluation, benign overrefusal counts, disclaimer counts, heuristic compliance scoring, first-token KL divergence, and optional per-axis harmful metrics.
- `src/iconoclast/main.py`: the end-to-end pipeline, including dataset loading, residual extraction, Optuna optimization, LoRA abliteration, Pareto selection, batch summaries, and optional export/benchmark/chat actions.

## How

The pipeline first loads harmful and harmless prompt datasets, typically `JailbreakBench/JBB-Behaviors` for harmful behavior and `mlabonne/harmless_alpaca` for benign behavior in the benchmark configs. For each prompt, the system performs one-token generation with hidden-state collection and stacks per-layer residual vectors. Given benign residuals `G_l` and harmful residuals `B_l`, it builds candidate directions per layer:

- mean difference: normalized `mean(B_l) - mean(G_l)`;
- median difference: normalized `median(B_l) - median(G_l)`;
- variance-scaled difference: normalized `(mean(B_l)-mean(G_l)) / sqrt(var_pool + eps)`;
- hybrid: a linear interpolation between mean and variance directions.

Utility preservation is implemented by two geometric filters. First, directions can be orthogonalized against the benign mean residual. Second, ICONOCLAST can compute a low-rank benign PCA basis from harmless residuals and project candidate edit directions out of that subspace. The benchmark configs usually set `orthogonalize_direction = true`, `row_normalization = "pre"`, and `benign_subspace_rank = 8`; the HERETIC baseline disables the benign subspace and orthogonalization in its generated runs.

The model edit is applied to attention output and MLP down-projection modules discovered dynamically across architectures. For a direction `v` and matrix `W`, the basic LoRA edit encodes the rank-one update

`Delta W = -lambda v (v^T W)`.

`lora_A` stores `v^T W`; `lora_B` stores `-lambda v`. Layer and component strength are sampled through `AbliterationParameters`: `max_weight`, `max_weight_position`, `min_weight`, and `min_weight_distance`. A `pre` row-normalization mode scales the edit by original row norms; `full` approximates norm-preserving biprojected abliteration by constructing the normalized adjusted matrix, restoring row norms, subtracting the original matrix, and compressing the delta through low-rank SVD.

Optimization uses Optuna's TPE sampler with a two-objective minimization. The evaluator records harmful refusals, benign overrefusals, disclaimer marker hits, heuristic harmful compliance, and KL divergence between edited and base first-token log-probability distributions on benign prompts. Completed trials are ordered by lower harmful refusals, then lower overrefusals, then lower KL divergence.

Cluster execution is Slurm-based. Scripts stage source into per-job directories, isolate Hugging Face and dataset caches, run batch mode with `ICONOCLAST_EXIT_AFTER_OPTIMIZATION=true`, write `batch_summary.json`, and clean caches. Sequential Slurm jobs were introduced to manage Rutgers iLabs disk quota pressure from concurrent model downloads.

## Why

The proposal framed the problem as PII unlearning: identify a privacy-related activation direction and remove it without retraining. The implemented codebase generalizes that idea into a reusable representation-editing system for removing refusal behavior. The rationale is the same geometric premise: some model behaviors are mediated by low-dimensional directions in residual space, so behavior can be attenuated through projection-like edits rather than full retraining.

The key design choice is benign-subspace protection. Standard single-direction abliteration can remove a target behavior but may also damage nearby useful representations. ICONOCLAST therefore estimates the principal harmless-prompt subspace and removes the benign component from candidate refusal directions before editing weights. This trades a small amount of editing freedom for lower overrefusal and lower semantic drift, measured by KL divergence.

## Development Timeline

The timeline is reconstructed from shallow git history plus filesystem birth/modification times. Birth times can be distorted by copies, rsync, checkout, and history rewrites, so exact chronology should be treated as best-effort.

- `2026-02-19 23:48:21 EST`: proposal PDF internal metadata indicates creation/modification of the privacy-oriented proposal.
- `2026-03-23 21:00:04 -0400`: local filesystem birth time for the proposal PDF.
- `2026-03-25 09:16:48 -0400`: earliest core project files and tests appear locally, including source modules, license, and initial configs.
- `2026-03-25 14:23` to `2026-03-26 16:17 -0400`: cached Hugging Face datasets and early Llama/Qwen result checkpoints appear under `results_cluster`.
- `2026-04-02`: Qwen3-1.7B configs and `HANDOVER_ILABS.md` begin, marking the move toward Rutgers iLabs scaling.
- `2026-04-05`: benchmark configs and tests expand for Qwen2.5, Qwen3-4B, and Phi-3.5; summary tooling appears.
- `2026-04-21`: major benchmark expansion adds nullspace configs and results for Gemma, Llama, Mistral, Falcon, OLMo, StableLM, SmolLM, Yi, and Qwen variants.
- `2026-04-22 18:08:27 -0400`: the only git commit, `725af9b`, records the initial tracked ICONOCLAST benchmark suite with 82 files and 9,361 insertions.
- `2026-04-23`: `ACL_REPORT_DRAFT.md` appears and is modified.

## Empirical Snapshot

The strongest matched results are the 10-model `batch_summary.json` comparisons. Most use 20 harmful JBB holdout prompts and 64 harmless holdout prompts. ICONOCLAST improves the lexicographic refusal/overrefusal/KL criterion on all 10 matched rows; it has lower KL in 8 of 10 rows. Two caveats matter: Falcon3 achieves zero refusals but has a large KL outlier, and Yi-1.5 has fewer refusals under ICONOCLAST but lower KL under HERETIC.

No completed large-N evaluator JSON outputs were found in the local tree.