zhiyuw
/

plasma

@@ -28,10 +28,9 @@ embeddings.
 This repository hosts the trained **PLASMA** heads for every (task, backbone)
 combination from the paper, plus instructions for the parameter-free
-**PLASMA-PF** baseline (which has no learned weights). PLASMA was published at
-**ICLR 2026**.
-- **Paper:** <https://arxiv.org/abs/2510.11752> (ICLR 2026)
 - **Code:** <https://github.com/ZW471/PLASMA-Protein-Local-Alignment>
 - **License:** MIT
@@ -61,10 +60,20 @@ All heads share the same architecture: a small `LRL` non-linearity
 parameter-free Sinkhorn iteration (`temperature=0.1`, `n_iters=20`). The
 checkpoint files are ~3 MB each.
-## Quickstart
-Install the PLASMA package from source (the model class is shipped with the
-GitHub repo):
 ```bash
 git clone https://github.com/ZW471/PLASMA-Protein-Local-Alignment
@@ -72,83 +81,150 @@ cd PLASMA-Protein-Local-Alignment
 uv sync
 ```
-Then load any trained head with the high-level helper:
 ```python
-import torch
 from alignment import load_plasma
-model = load_plasma(task="active_site", backbone="prot_bert")
 model.eval()
-# Feed pre-computed AA-level embeddings from the matching backbone.
-# H_q / H_c are residue-level embeddings; batch_q / batch_c assign each
-# residue to a sample (use zeros if you only have one pair).
-H_q = torch.randn(120, 1024)            # query: 120 residues, ProtBERT dim
-H_c = torch.randn(180, 1024)            # candidate: 180 residues
-batch_q = torch.zeros(120, dtype=torch.long)
-batch_c = torch.zeros(180, dtype=torch.long)
 with torch.no_grad():
-    alignment_matrix = model(H_q, H_c, batch_q, batch_c)  # (120, 180)
 ```
-The output is a doubly-stochastic transport plan describing the residue-level
-correspondence between the two substructures. To reduce it to a similarity
-score, reuse `utils.alignment_score` from the GitHub repo (it applies the
-diagonal convolution + threshold described in the paper).
 ## PLASMA-PF (parameter-free)
-PLASMA-PF is a hinge / Sinkhorn baseline with **no learned weights**. There is
-nothing to download — just instantiate it from the same `Alignment` class:
 ```python
 from alignment import load_plasma_pf
-model = load_plasma_pf()  # Alignment(eta='hinge', omega='sinkhorn', ...)
 ```
-It accepts the same forward signature as the trained heads above.
 ## Available variants & evaluation results
-Numbers below are 3-seed averages (mean ± std) reported in the paper. The seven
-backbone columns correspond to the seven subfolders under each task.
 ### Interpolation (in-distribution test split)
 | Task | Metric | Ankh | ESM-2 | ProstT5 | ProtBERT | ProtSSN | ProtT5 | TM-Vec |
 | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| **Motif** | ROC-AUC | .925 ± .002 | .933 ± .005 | .954 ± .002 | .854 ± .003 | .922 ± .002 | **.972 ± .001** | .910 ± .003 |
-|  | F1-Max | .885 ± .002 | .877 ± .005 | .885 ± .003 | .784 ± .002 | .866 ± .002 | **.918 ± .003** | .853 ± .003 |
-|  | PR-AUC | .921 ± .002 | .931 ± .004 | .953 ± .003 | .872 ± .003 | .920 ± .002 | **.971 ± .002** | .914 ± .003 |
-|  | Label Match Score | .921 ± .004 | .890 ± .008 | .929 ± .001 | .746 ± .007 | .767 ± .008 | **.937 ± .001** | .792 ± .008 |
-| **Binding Site** | ROC-AUC | **.995 ± .000** | .992 ± .000 | .993 ± .001 | .981 ± .001 | .992 ± .001 | .993 ± .000 | .980 ± .001 |
-|  | F1-Max | .987 ± .001 | .986 ± .001 | .983 ± .001 | .948 ± .002 | .982 ± .001 | **.988 ± .001** | .970 ± .001 |
-|  | PR-AUC | **.996 ± .001** | .994 ± .001 | .995 ± .001 | .985 ± .001 | .993 ± .001 | .995 ± .000 | .984 ± .001 |
-|  | Label Match Score | **.951 ± .002** | .950 ± .002 | **.951 ± .002** | .880 ± .008 | .872 ± .005 | **.951 ± .001** | .900 ± .004 |
-| **Active Site** | ROC-AUC | **.994 ± .001** | .991 ± .001 | .993 ± .001 | .986 ± .001 | .992 ± .001 | **.994 ± .001** | .991 ± .001 |
-|  | F1-Max | **.989 ± .001** | .985 ± .001 | .987 ± .001 | .967 ± .001 | .987 ± .001 | .987 ± .001 | .982 ± .001 |
-|  | PR-AUC | **.994 ± .001** | .992 ± .001 | **.994 ± .001** | .988 ± .001 | **.994 ± .001** | **.994 ± .001** | .992 ± .001 |
-|  | Label Match Score | **.975 ± .001** | .969 ± .002 | **.975 ± .001** | .904 ± .003 | .885 ± .013 | .972 ± .001 | .938 ± .001 |
 ### Extrapolation (held-out hard test split)
 | Task | Metric | Ankh | ESM-2 | ProstT5 | ProtBERT | ProtSSN | ProtT5 | TM-Vec |
 | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
-| **Motif** | ROC-AUC | .960 ± .011 | .972 ± .010 | **.975 ± .009** | .870 ± .030 | .949 ± .013 | .968 ± .012 | .954 ± .013 |
-|  | F1-Max | .915 ± .021 | **.931 ± .016** | .926 ± .020 | .799 ± .039 | .896 ± .023 | .922 ± .023 | .903 ± .026 |
-|  | PR-AUC | .948 ± .020 | **.970 ± .010** | .969 ± .016 | .873 ± .036 | .940 ± .020 | .962 ± .018 | .944 ± .022 |
-|  | Label Match Score | **.842 ± .025** | .786 ± .032 | .801 ± .022 | .541 ± .060 | .537 ± .025 | .738 ± .028 | .704 ± .020 |
-| **Binding Site** | ROC-AUC | .995 ± .005 | **.999 ± .001** | .993 ± .005 | .951 ± .014 | **.999 ± .001** | **.999 ± .001** | .990 ± .008 |
-|  | F1-Max | .992 ± .005 | .991 ± .005 | .985 ± .009 | .896 ± .019 | .988 ± .006 | **.996 ± .003** | .983 ± .011 |
-|  | PR-AUC | .997 ± .003 | **.999 ± .001** | .995 ± .003 | .958 ± .012 | .998 ± .001 | **.999 ± .000** | .992 ± .006 |
-|  | Label Match Score | .894 ± .026 | .851 ± .031 | .891 ± .029 | .603 ± .041 | .753 ± .041 | **.902 ± .019** | .824 ± .031 |
-| **Active Site** | ROC-AUC | .995 ± .002 | .996 ± .003 | .996 ± .003 | .980 ± .004 | .997 ± .001 | **.999 ± .000** | .995 ± .002 |
-|  | F1-Max | **.992 ± .002** | .986 ± .004 | .991 ± .004 | .950 ± .005 | .991 ± .003 | .991 ± .002 | .985 ± .003 |
-|  | PR-AUC | .995 ± .003 | .997 ± .002 | .997 ± .002 | .984 ± .003 | .998 ± .001 | **.999 ± .000** | .996 ± .002 |
-|  | Label Match Score | **.938 ± .014** | .882 ± .027 | .931 ± .026 | .697 ± .019 | .737 ± .011 | .893 ± .017 | .880 ± .023 |
 Each subfolder also contains a `metadata.json` with the full hyperparameter
 config in machine-readable form.

 This repository hosts the trained **PLASMA** heads for every (task, backbone)
 combination from the paper, plus instructions for the parameter-free
+**PLASMA-PF** baseline (which has no learned weights).
+- **Paper:** <https://arxiv.org/abs/2510.11752>
 - **Code:** <https://github.com/ZW471/PLASMA-Protein-Local-Alignment>
 - **License:** MIT
 parameter-free Sinkhorn iteration (`temperature=0.1`, `n_iters=20`). The
 checkpoint files are ~3 MB each.
+## How to use
+PLASMA is a *head*: it consumes per-residue embeddings from a frozen protein
+language model and returns a soft alignment matrix between two
+sub-structures. The end-to-end pipeline is therefore three steps:
+1. Embed each protein with the backbone the head was trained on (one of the
+   seven listed above).
+2. Run the PLASMA head on the (residue × residue) embeddings to get a soft
+   alignment matrix `M ∈ [0, 1]^{n_q × n_c}`.
+3. Optionally reduce `M` to a scalar similarity score with
+   `utils.alignment_score`.
+### 1. Install
 ```bash
 git clone https://github.com/ZW471/PLASMA-Protein-Local-Alignment
 uv sync
 ```
+The `Alignment` class and the `load_plasma` helper live in the `alignment`
+package shipped by that repo.
+### 2. Load a trained head
 ```python
 from alignment import load_plasma
+# task ∈ {"active_site", "binding_site", "motif"}
+# backbone is the PLM whose embeddings the head was trained on
+model = load_plasma(task="active_site", backbone="esm2_t33_650M_UR50D")
 model.eval()
+```
+`load_plasma` downloads the matching `config.json` + `model.safetensors` from
+this repo via `huggingface_hub` and rebuilds the `Alignment` module.
+### 3. Compute embeddings with the matching backbone
+PLASMA does not embed sequences itself. The example below shows how to do it
+with **ESM-2** via `transformers`; the same pattern works for any other
+backbone (`Ankh`, `ProstT5`, `ProtBERT`, `ProtT5`, `TM-Vec`, `ProtSSN` —
+their loaders are documented in `embed.py` in the GitHub repo).
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+device = "cuda" if torch.cuda.is_available() else "cpu"
+tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
+backbone = AutoModel.from_pretrained("facebook/esm2_t33_650M_UR50D").to(device).eval()
+@torch.no_grad()
+def embed(sequence: str) -> torch.Tensor:
+    """Return per-residue embeddings of shape (L, 1280) — no special tokens."""
+    tokens = tokenizer(sequence, return_tensors="pt", add_special_tokens=True).to(device)
+    h = backbone(**tokens).last_hidden_state[0]   # (L+2, 1280): <cls> ... <eos>
+    return h[1:-1].cpu()                          # drop <cls> and <eos>
+seq_q = "MKTAYIAKQRQISFVKSHFSRQDILDLWIYHTQGYFP"
+seq_c = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNI"
+H_q = embed(seq_q)            # (n_q, 1280)
+H_c = embed(seq_c)            # (n_c, 1280)
+```
+### 4. Run PLASMA and read the alignment matrix
+```python
+# `batch_q` / `batch_c` assign each residue to a sample. Use zeros for a
+# single pair; use [0, 0, ..., 1, 1, ...] to score multiple pairs in one batch.
+batch_q = torch.zeros(H_q.size(0), dtype=torch.long)
+batch_c = torch.zeros(H_c.size(0), dtype=torch.long)
 with torch.no_grad():
+    M = model(H_q, H_c, batch_q, batch_c)        # (n_q, n_c) in [0, 1]
+# Hard residue-residue assignment (top of column / row in the transport plan)
+q_to_c = M.argmax(dim=1)        # for each query residue, the best candidate residue
+c_to_q = M.argmax(dim=0)        # for each candidate residue, the best query residue
 ```
+`M` is a (near-)doubly-stochastic transport plan: rows and columns each sum
+to ~1, so `M[i, j]` is the soft probability that query residue `i` aligns to
+candidate residue `j`. Thresholding at `0.5` gives a sparse local alignment;
+plotting `M` as a heatmap gives the canonical PLASMA visualisation (the
+diagonal stripe in the visual abstract above).
+### 5. Reduce to a similarity score
+To collapse the alignment matrix into a single number per protein pair (the
+quantity used to compute ROC-AUC / F1-Max in the tables above), use
+`utils.alignment_score` from the GitHub repo. It applies the diagonal
+convolution + thresholding described in the paper:
+```python
+from utils.alignment_utils import alignment_score
+score = alignment_score(
+    H_q, H_c, M, batch_c,
+    threshold=0.5,           # gating on max-row / max-col residues
+    K=10,                    # diagonal-convolution window
+)                            # -> shape (num_pairs_in_batch,), here (1,)
+print(float(score))
+```
 ## PLASMA-PF (parameter-free)
+PLASMA-PF is a hinge / Sinkhorn baseline with **no learned weights**. Use it
+when you want a strong zero-training baseline on top of any backbone — there
+is nothing to download:
 ```python
 from alignment import load_plasma_pf
+model = load_plasma_pf()         # Alignment(eta='hinge', omega='sinkhorn', ...)
+with torch.no_grad():
+    M_pf = model(H_q, H_c, batch_q, batch_c)
 ```
+It accepts the same forward signature as the trained heads above and pairs
+with any of the seven supported backbones.
 ## Available variants & evaluation results
+Numbers below are 3-seed averages reported in the paper. The seven backbone
+columns correspond to the seven subfolders under each task. **Bold** marks the
+best backbone for each row.
 ### Interpolation (in-distribution test split)
 | Task | Metric | Ankh | ESM-2 | ProstT5 | ProtBERT | ProtSSN | ProtT5 | TM-Vec |
 | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| **Motif** | ROC-AUC | .925 | .933 | .954 | .854 | .922 | **.972** | .910 |
+|  | F1-Max | .885 | .877 | .885 | .784 | .866 | **.918** | .853 |
+|  | PR-AUC | .921 | .931 | .953 | .872 | .920 | **.971** | .914 |
+|  | Label Match Score | .921 | .890 | .929 | .746 | .767 | **.937** | .792 |
+| **Binding Site** | ROC-AUC | **.995** | .992 | .993 | .981 | .992 | .993 | .980 |
+|  | F1-Max | .987 | .986 | .983 | .948 | .982 | **.988** | .970 |
+|  | PR-AUC | **.996** | .994 | .995 | .985 | .993 | .995 | .984 |
+|  | Label Match Score | **.951** | .950 | **.951** | .880 | .872 | **.951** | .900 |
+| **Active Site** | ROC-AUC | **.994** | .991 | .993 | .986 | .992 | **.994** | .991 |
+|  | F1-Max | **.989** | .985 | .987 | .967 | .987 | .987 | .982 |
+|  | PR-AUC | **.994** | .992 | **.994** | .988 | **.994** | **.994** | .992 |
+|  | Label Match Score | **.975** | .969 | **.975** | .904 | .885 | .972 | .938 |
 ### Extrapolation (held-out hard test split)
 | Task | Metric | Ankh | ESM-2 | ProstT5 | ProtBERT | ProtSSN | ProtT5 | TM-Vec |
 | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| **Motif** | ROC-AUC | .960 | .972 | **.975** | .870 | .949 | .968 | .954 |
+|  | F1-Max | .915 | **.931** | .926 | .799 | .896 | .922 | .903 |
+|  | PR-AUC | .948 | **.970** | .969 | .873 | .940 | .962 | .944 |
+|  | Label Match Score | **.842** | .786 | .801 | .541 | .537 | .738 | .704 |
+| **Binding Site** | ROC-AUC | .995 | **.999** | .993 | .951 | **.999** | **.999** | .990 |
+|  | F1-Max | .992 | .991 | .985 | .896 | .988 | **.996** | .983 |
+|  | PR-AUC | .997 | **.999** | .995 | .958 | .998 | **.999** | .992 |
+|  | Label Match Score | .894 | .851 | .891 | .603 | .753 | **.902** | .824 |
+| **Active Site** | ROC-AUC | .995 | .996 | .996 | .980 | .997 | **.999** | .995 |
+|  | F1-Max | **.992** | .986 | .991 | .950 | .991 | .991 | .985 |
+|  | PR-AUC | .995 | .997 | .997 | .984 | .998 | **.999** | .996 |
+|  | Label Match Score | **.938** | .882 | .931 | .697 | .737 | .893 | .880 |
 Each subfolder also contains a `metadata.json` with the full hyperparameter
 config in machine-readable form.