OmniBiMol Protein Subcellular Localization Classifier

A lightweight, CPU-friendly protein subcellular localization classifier built for the OmniBiMol bioinformatics platform. Trained entirely on free-tier Hugging Face compute and public datasets.

Model Description

This repository contains:

  • Frozen ESM-2-8M sequence embeddings + Logistic Regression classifier
  • Heuristic feature baseline (AA composition + physicochemical properties)
  • Ensemble model combining both approaches with confidence-based evidence filtering
  • Membrane vs. Non-membrane binary classifier for wet-lab prioritization

Dataset

  • Source: mila-intel/ProtST-SubcellularLocalization (DeepLoc-based 10-class)
  • Size: 8,420 train / 2,811 validation / 2,773 test
  • Classes: Cytoplasm, Nucleus, Extracellular, Cell membrane, Mitochondrion, Plastid, Endoplasmic reticulum, Lysosome/Vacuole, Golgi apparatus, Peroxisome
  • License: Public dataset, free use

Training Recipe

Variant Method Accuracy Macro-F1
A Heuristic features + Logistic Regression 48.0% 0.315
A2 Heuristic features + Random Forest 53.7% 0.361
B ESM-2-8M embeddings + Logistic Regression 60.5% 0.469
B2 ESM-2-8M embeddings + Random Forest 57.7% 0.394
C Evidence-Filtered Ensemble (Best) 62.0% 0.470

Hardware: CPU-only (2 vCPU Hugging Face sandbox) Total runtime: ~4 minutes (features) + ~4 minutes (embeddings) Backbone: facebook/esm2_t6_8M_UR50D (8M parameters, 29MB)

Key Citations

  • ESM-2: Lin et al., "Language models of protein sequences at the scale of evolution enable accurate structure prediction", bioRxiv 2022
  • DeepLoc: Almagro Armenteros et al., Bioinformatics 2017
  • Ankh: Elnaggar et al., arXiv:2301.06568
  • CAPSUL: Hu et al., arXiv:2603.18571 (structure-aware localization)

Usage

from transformers import EsmTokenizer, EsmModel
import pickle, numpy as np
import torch

# Load model components
tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
esm = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
with open("esm_lr.pkl", "rb") as f:
    classifier = pickle.load(f)

# Predict
seq = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
inputs = tokenizer(seq, return_tensors="pt", max_length=128, truncation=True)
with torch.no_grad():
    emb = esm(**inputs).last_hidden_state.mean(dim=1).numpy()
proba = classifier.predict_proba(emb)
pred = proba.argmax()
labels = ["Cytoplasm","Nucleus","Extracellular","Cell membrane","Mitochondrion","Plastid","Endoplasmic reticulum","Lysosome/Vacuole","Golgi apparatus","Peroxisome"]
print(f"Predicted: {labels[pred]}, Confidence: {proba.max():.3f}")

Wet-Lab Prioritization

Use the membrane vs non-membrane classifier (binary_esm_lr.pkl) to prioritize proteins for expression:

  • Non-membrane proteins: Standard E. coli expression, soluble tags
  • Membrane proteins: Specialized detergents, nanodiscs, or cell-free systems

Integration

See inference.py for the OmniBiMol backend integration wrapper.

Downloads last month
88
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using omshrivastava/omnibimol-protein-localization 2

Papers for omshrivastava/omnibimol-protein-localization