---
license: mit
language:
- en
- zh
base_model:
- EvolutionaryScale/esmc-300m-2024-12
- google-bert/bert-base-uncased
new_version: Noora68/PLiCat-0.4B
tags:
- biology
- protein
- protein classification
- lipid binding
- lipid binding site
- recognition
---
#
💜 PLiCat (Protein–Lipid interaction Categorization tool)
we present a robust prediction tool termed PLiCat(Protein–Lipid interaction Categorization Tool)
for predicting the lipid categories that interact with proteins, utilizing
protein sequences as the only input. Using a combined model architecture by
the fusion of ESM C and BERT models, our method enables accurate and
interpretable prediction to distinguish lipid-binding signature among
the 8 major lipid categories defined by LIPID MAPS.
PLiCat will serve as a powerful tool to facilitate the exploration of
lipid-binding specificity and rational protein design.
---
- **Paper**: [https://academic.oup.com/bib/article/26/6/bbaf665/8377155](https://academic.oup.com/bib/article/26/6/bbaf665/8377155)
- **GitHub Repository**: [https://github.com/Noora68/PLiCat](https://github.com/Noora68/PLiCat)
- **Online Demo**: [https://colab.research.google.com/drive/1wGSZsy7KyYoJf2PiXzP4SVLXonl-cWb9?usp=sharing](https://colab.research.google.com/drive/1wGSZsy7KyYoJf2PiXzP4SVLXonl-cWb9?usp=sharing)
- **Datasets**: [https://huggingface.co/datasets/Noora68/PLiCat-0.1.0](https://huggingface.co/datasets/Noora68/PLiCat-0.1.0)
---
## Overall schematic framework of PLiCat**:

## Model Details
- **Architecture**: ESM Cambrian + BERT + classification head
- **Task**: Multi-label protein-lipid binding prediction
- **Fine-tuned from**: `ESMC_300m` + `bert-base-uncased`
- **Developed by**: Noora68
- **Framework**: PyTorch + HuggingFace Transformers
---
**Model usage workflow:**
1. Load the model and tokenizer
2. Process the input sequence (tokenize → batch → pad → mask)
3. Run inference to obtain logits → probabilities
4. Output the results and mark high-confidence categories
---
## install the latest version:
```python
pip install plicat_model
````
---
## Usage:
```python
from plicat_model import PLiCat
import torch
from torch.nn.utils.rnn import pad_sequence
from esm.tokenization import EsmSequenceTokenizer
# Set device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = EsmSequenceTokenizer()
# Default lipid type dictionary
default_dict = {
"0": "NotLipidType",
"1": "Fatty Acyl (FA)",
"2": "Prenol Lipid (PR)",
"3": "Glycerophospholipid (GP)",
"4": "Sterol Lipid (ST)",
"5": "Polyketide (PK)",
"6": "Glycerolipid (GL)",
"7": "Sphingolipid (SP)",
"8": "Saccharolipid (SL)"
}
# Load pretrained PLiCat model
model = PLiCat.from_pretrained("Noora68/PLiCat-0.4B").to(device)
# Example protein sequence
sequence = "MDSNFLKYLSTAPVLFTVWLSFTASFIIEANRFFPDMLYFPM"
# Tokenize the sequence -> input_ids
input_ids = torch.tensor(tokenizer.encode(sequence))
# Add batch dimension: (batch_size=1, length)
input_ids = input_ids.unsqueeze(0)
# Pad to the longest sequence in the batch
input_ids_padded = pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
# Build attention mask: 1 for real tokens, 0 for padding
attention_mask = (input_ids_padded != tokenizer.pad_token_id).long()
# Move tensors to the same device as model
input_ids_padded = input_ids_padded.to(device)
attention_mask = attention_mask.to(device)
# Forward pass (no gradient needed during inference)
with torch.no_grad():
outputs = model(input_ids_padded, attention_mask)
# Convert logits to probabilities using sigmoid
probs = torch.sigmoid(outputs['logits'])
# Convert to CPU and numpy array
probs = probs.squeeze().detach().cpu().numpy()
# Print results: add a check mark if probability > 0.6
for i, p in enumerate(probs):
mark = " √" if p > 0.6 else ""
print(f"{default_dict[str(i)]:<25}: {p:.4f}{mark}")
````
## output of the above example is:
```
NotLipidType : 0.0007
Fatty Acyl (FA) : 0.1092
Prenol Lipid (PR) : 0.9178 √
Glycerophospholipid (GP) : 0.6059 √
Sterol Lipid (ST) : 0.0083
Polyketide (PK) : 0.0026
Glycerolipid (GL) : 0.0771
Sphingolipid (SP) : 0.0002
Saccharolipid (SL) : 0.0000
```
---
## Limitations
* Trained only on lipid-binding protein data and may not generalize to other functions.
* Model performance is best with sequence lengths under 500.
* Dataset size is limited compared to large-scale protein corpora.
* Model may reflect biases present in training data (e.g., under-representation of certain lipid types).
---
## Citing this work
If you use the code or data in this package, please cite:
```bibtex
@Article{PLiCat2025,
author = {Feitong Dong,Jingrou Wu},
journal = {Briefings in Bioinformatics},
title = {PLiCat:Decoding protein-lipid interactions by large language model},
year = {2025},
volume = {26},
issue = {6},
doi = {https://doi.org/10.1093/bib/bbaf665},
URL = {https://academic.oup.com/bib/article/26/6/bbaf665/8377155}
}
```
---
## License
This project is licensed under the MIT License.
---