CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization
Paper • 2603.18571 • Published
A lightweight, CPU-friendly protein subcellular localization classifier built for the OmniBiMol bioinformatics platform. Trained entirely on free-tier Hugging Face compute and public datasets.
This repository contains:
mila-intel/ProtST-SubcellularLocalization (DeepLoc-based 10-class)| Variant | Method | Accuracy | Macro-F1 |
|---|---|---|---|
| A | Heuristic features + Logistic Regression | 48.0% | 0.315 |
| A2 | Heuristic features + Random Forest | 53.7% | 0.361 |
| B | ESM-2-8M embeddings + Logistic Regression | 60.5% | 0.469 |
| B2 | ESM-2-8M embeddings + Random Forest | 57.7% | 0.394 |
| C | Evidence-Filtered Ensemble (Best) | 62.0% | 0.470 |
Hardware: CPU-only (2 vCPU Hugging Face sandbox)
Total runtime: ~4 minutes (features) + ~4 minutes (embeddings)
Backbone: facebook/esm2_t6_8M_UR50D (8M parameters, 29MB)
from transformers import EsmTokenizer, EsmModel
import pickle, numpy as np
import torch
# Load model components
tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
esm = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
with open("esm_lr.pkl", "rb") as f:
classifier = pickle.load(f)
# Predict
seq = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
inputs = tokenizer(seq, return_tensors="pt", max_length=128, truncation=True)
with torch.no_grad():
emb = esm(**inputs).last_hidden_state.mean(dim=1).numpy()
proba = classifier.predict_proba(emb)
pred = proba.argmax()
labels = ["Cytoplasm","Nucleus","Extracellular","Cell membrane","Mitochondrion","Plastid","Endoplasmic reticulum","Lysosome/Vacuole","Golgi apparatus","Peroxisome"]
print(f"Predicted: {labels[pred]}, Confidence: {proba.max():.3f}")
Use the membrane vs non-membrane classifier (binary_esm_lr.pkl) to prioritize proteins for expression:
See inference.py for the OmniBiMol backend integration wrapper.