Iconoclast / INTERNAL_TECHNICAL_NOTE.md
OpenAI Codex
Publish Iconoclast research release
3236af9

Internal Technical Note: ICONOCLAST

What

ICONOCLAST is a command-line research framework for discriminative representation editing of open-weight language models. The implemented system evolved from the proposal's privacy-motivated abliteration plan into a broader alignment and utility-preservation benchmark: it estimates behavior directions from contrastive prompt activations, edits model output projections through LoRA adapters, and searches for edits that reduce refusal behavior while minimizing collateral change on benign prompts.

The core architecture has five first-party modules:

  • src/iconoclast/config.py: Pydantic settings for model loading, datasets, direction construction, objectives, benchmarks, and checkpointing.
  • src/iconoclast/model.py: Hugging Face model/tokenizer loading, dtype fallback, 4-bit quantization support, PEFT LoRA setup, residual/logprob extraction, generation, merging, and chat.
  • src/iconoclast/direction.py: tensor routines for mean, median, variance-scaled, hybrid, orthogonalized, and benign-subspace-projected direction estimates.
  • src/iconoclast/evaluator.py: marker-based refusal evaluation, benign overrefusal counts, disclaimer counts, heuristic compliance scoring, first-token KL divergence, and optional per-axis harmful metrics.
  • src/iconoclast/main.py: the end-to-end pipeline, including dataset loading, residual extraction, Optuna optimization, LoRA abliteration, Pareto selection, batch summaries, and optional export/benchmark/chat actions.

How

The pipeline first loads harmful and harmless prompt datasets, typically JailbreakBench/JBB-Behaviors for harmful behavior and mlabonne/harmless_alpaca for benign behavior in the benchmark configs. For each prompt, the system performs one-token generation with hidden-state collection and stacks per-layer residual vectors. Given benign residuals G_l and harmful residuals B_l, it builds candidate directions per layer:

  • mean difference: normalized mean(B_l) - mean(G_l);
  • median difference: normalized median(B_l) - median(G_l);
  • variance-scaled difference: normalized (mean(B_l)-mean(G_l)) / sqrt(var_pool + eps);
  • hybrid: a linear interpolation between mean and variance directions.

Utility preservation is implemented by two geometric filters. First, directions can be orthogonalized against the benign mean residual. Second, ICONOCLAST can compute a low-rank benign PCA basis from harmless residuals and project candidate edit directions out of that subspace. The benchmark configs usually set orthogonalize_direction = true, row_normalization = "pre", and benign_subspace_rank = 8; the HERETIC baseline disables the benign subspace and orthogonalization in its generated runs.

The model edit is applied to attention output and MLP down-projection modules discovered dynamically across architectures. For a direction v and matrix W, the basic LoRA edit encodes the rank-one update

Delta W = -lambda v (v^T W).

lora_A stores v^T W; lora_B stores -lambda v. Layer and component strength are sampled through AbliterationParameters: max_weight, max_weight_position, min_weight, and min_weight_distance. A pre row-normalization mode scales the edit by original row norms; full approximates norm-preserving biprojected abliteration by constructing the normalized adjusted matrix, restoring row norms, subtracting the original matrix, and compressing the delta through low-rank SVD.

Optimization uses Optuna's TPE sampler with a two-objective minimization. The evaluator records harmful refusals, benign overrefusals, disclaimer marker hits, heuristic harmful compliance, and KL divergence between edited and base first-token log-probability distributions on benign prompts. Completed trials are ordered by lower harmful refusals, then lower overrefusals, then lower KL divergence.

Cluster execution is Slurm-based. Scripts stage source into per-job directories, isolate Hugging Face and dataset caches, run batch mode with ICONOCLAST_EXIT_AFTER_OPTIMIZATION=true, write batch_summary.json, and clean caches. Sequential Slurm jobs were introduced to manage Rutgers iLabs disk quota pressure from concurrent model downloads.

Why

The proposal framed the problem as PII unlearning: identify a privacy-related activation direction and remove it without retraining. The implemented codebase generalizes that idea into a reusable representation-editing system for removing refusal behavior. The rationale is the same geometric premise: some model behaviors are mediated by low-dimensional directions in residual space, so behavior can be attenuated through projection-like edits rather than full retraining.

The key design choice is benign-subspace protection. Standard single-direction abliteration can remove a target behavior but may also damage nearby useful representations. ICONOCLAST therefore estimates the principal harmless-prompt subspace and removes the benign component from candidate refusal directions before editing weights. This trades a small amount of editing freedom for lower overrefusal and lower semantic drift, measured by KL divergence.

Development Timeline

The timeline is reconstructed from shallow git history plus filesystem birth/modification times. Birth times can be distorted by copies, rsync, checkout, and history rewrites, so exact chronology should be treated as best-effort.

  • 2026-02-19 23:48:21 EST: proposal PDF internal metadata indicates creation/modification of the privacy-oriented proposal.
  • 2026-03-23 21:00:04 -0400: local filesystem birth time for the proposal PDF.
  • 2026-03-25 09:16:48 -0400: earliest core project files and tests appear locally, including source modules, license, and initial configs.
  • 2026-03-25 14:23 to 2026-03-26 16:17 -0400: cached Hugging Face datasets and early Llama/Qwen result checkpoints appear under results_cluster.
  • 2026-04-02: Qwen3-1.7B configs and HANDOVER_ILABS.md begin, marking the move toward Rutgers iLabs scaling.
  • 2026-04-05: benchmark configs and tests expand for Qwen2.5, Qwen3-4B, and Phi-3.5; summary tooling appears.
  • 2026-04-21: major benchmark expansion adds nullspace configs and results for Gemma, Llama, Mistral, Falcon, OLMo, StableLM, SmolLM, Yi, and Qwen variants.
  • 2026-04-22 18:08:27 -0400: the only git commit, 725af9b, records the initial tracked ICONOCLAST benchmark suite with 82 files and 9,361 insertions.
  • 2026-04-23: ACL_REPORT_DRAFT.md appears and is modified.

Empirical Snapshot

The strongest matched results are the 10-model batch_summary.json comparisons. Most use 20 harmful JBB holdout prompts and 64 harmless holdout prompts. ICONOCLAST improves the lexicographic refusal/overrefusal/KL criterion on all 10 matched rows; it has lower KL in 8 of 10 rows. Two caveats matter: Falcon3 achieves zero refusals but has a large KL outlier, and Yi-1.5 has fewer refusals under ICONOCLAST but lower KL under HERETIC.

No completed large-N evaluator JSON outputs were found in the local tree.