OmniBiMol Protein Subcellular Localization Classifier

A lightweight, CPU-friendly protein subcellular localization classifier built for the OmniBiMol bioinformatics platform. Trained entirely on free-tier Hugging Face compute and public datasets.

Model Description

This repository contains:

Frozen ESM-2-8M sequence embeddings + Logistic Regression classifier
Heuristic feature baseline (AA composition + physicochemical properties)
Ensemble model combining both approaches with confidence-based evidence filtering
Membrane vs. Non-membrane binary classifier for wet-lab prioritization

Dataset

Source: mila-intel/ProtST-SubcellularLocalization (DeepLoc-based 10-class)
Size: 8,420 train / 2,811 validation / 2,773 test
Classes: Cytoplasm, Nucleus, Extracellular, Cell membrane, Mitochondrion, Plastid, Endoplasmic reticulum, Lysosome/Vacuole, Golgi apparatus, Peroxisome
License: Public dataset, free use

Training Recipe

Variant	Method	Accuracy	Macro-F1
A	Heuristic features + Logistic Regression	48.0%	0.315
A2	Heuristic features + Random Forest	53.7%	0.361
B	ESM-2-8M embeddings + Logistic Regression	60.5%	0.469
B2	ESM-2-8M embeddings + Random Forest	57.7%	0.394
C	Evidence-Filtered Ensemble (Best)	62.0%	0.470

Hardware: CPU-only (2 vCPU Hugging Face sandbox) Total runtime: ~4 minutes (features) + ~4 minutes (embeddings) Backbone: facebook/esm2_t6_8M_UR50D (8M parameters, 29MB)

Key Citations

ESM-2: Lin et al., "Language models of protein sequences at the scale of evolution enable accurate structure prediction", bioRxiv 2022
DeepLoc: Almagro Armenteros et al., Bioinformatics 2017
Ankh: Elnaggar et al., arXiv:2301.06568
CAPSUL: Hu et al., arXiv:2603.18571 (structure-aware localization)

Usage

from transformers import EsmTokenizer, EsmModel
import pickle, numpy as np
import torch

# Load model components
tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
esm = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")
with open("esm_lr.pkl", "rb") as f:
    classifier = pickle.load(f)

# Predict
seq = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
inputs = tokenizer(seq, return_tensors="pt", max_length=128, truncation=True)
with torch.no_grad():
    emb = esm(**inputs).last_hidden_state.mean(dim=1).numpy()
proba = classifier.predict_proba(emb)
pred = proba.argmax()
labels = ["Cytoplasm","Nucleus","Extracellular","Cell membrane","Mitochondrion","Plastid","Endoplasmic reticulum","Lysosome/Vacuole","Golgi apparatus","Peroxisome"]
print(f"Predicted: {labels[pred]}, Confidence: {proba.max():.3f}")

Wet-Lab Prioritization

Use the membrane vs non-membrane classifier (binary_esm_lr.pkl) to prioritize proteins for expression:

Non-membrane proteins: Standard E. coli expression, soluble tags
Membrane proteins: Specialized detergents, nanodiscs, or cell-free systems

Integration

See inference.py for the OmniBiMol backend integration wrapper.

Downloads last month: 88

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using omshrivastava/omnibimol-protein-localization 2

Papers for omshrivastava/omnibimol-protein-localization

CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization

Paper • 2603.18571 • Published Mar 19

Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling

Paper • 2301.06568 • Published Jan 16, 2023