--- license: mit language: - en - zh base_model: - EvolutionaryScale/esmc-300m-2024-12 - google-bert/bert-base-uncased new_version: Noora68/PLiCat-0.4B tags: - biology - protein - protein classification - lipid binding - lipid binding site - recognition --- #

💜 PLiCat (Protein–Lipid interaction Categorization tool)

we present a robust prediction tool termed PLiCat(Protein–Lipid interaction Categorization Tool) for predicting the lipid categories that interact with proteins, utilizing protein sequences as the only input. Using a combined model architecture by the fusion of ESM C and BERT models, our method enables accurate and interpretable prediction to distinguish lipid-binding signature among the 8 major lipid categories defined by LIPID MAPS. PLiCat will serve as a powerful tool to facilitate the exploration of lipid-binding specificity and rational protein design. --- - **Paper**: [https://academic.oup.com/bib/article/26/6/bbaf665/8377155](https://academic.oup.com/bib/article/26/6/bbaf665/8377155) - **GitHub Repository**: [https://github.com/Noora68/PLiCat](https://github.com/Noora68/PLiCat) - **Online Demo**: [https://colab.research.google.com/drive/1wGSZsy7KyYoJf2PiXzP4SVLXonl-cWb9?usp=sharing](https://colab.research.google.com/drive/1wGSZsy7KyYoJf2PiXzP4SVLXonl-cWb9?usp=sharing) - **Datasets**: [https://huggingface.co/datasets/Noora68/PLiCat-0.1.0](https://huggingface.co/datasets/Noora68/PLiCat-0.1.0) --- ## Overall schematic framework of PLiCat**: ![Plicat-Architecture.jpg](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F6892b516dceae2d25089c45b%2FBNvkExfATbnzD1JOKuDVL.jpeg) ## Model Details - **Architecture**: ESM Cambrian + BERT + classification head - **Task**: Multi-label protein-lipid binding prediction - **Fine-tuned from**: `ESMC_300m` + `bert-base-uncased` - **Developed by**: Noora68 - **Framework**: PyTorch + HuggingFace Transformers --- **Model usage workflow:** 1. Load the model and tokenizer 2. Process the input sequence (tokenize → batch → pad → mask) 3. Run inference to obtain logits → probabilities 4. Output the results and mark high-confidence categories --- ## install the latest version: ```python pip install plicat_model ```` --- ## Usage: ```python from plicat_model import PLiCat import torch from torch.nn.utils.rnn import pad_sequence from esm.tokenization import EsmSequenceTokenizer # Set device (GPU if available, otherwise CPU) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = EsmSequenceTokenizer() # Default lipid type dictionary default_dict = { "0": "NotLipidType", "1": "Fatty Acyl (FA)", "2": "Prenol Lipid (PR)", "3": "Glycerophospholipid (GP)", "4": "Sterol Lipid (ST)", "5": "Polyketide (PK)", "6": "Glycerolipid (GL)", "7": "Sphingolipid (SP)", "8": "Saccharolipid (SL)" } # Load pretrained PLiCat model model = PLiCat.from_pretrained("Noora68/PLiCat-0.4B").to(device) # Example protein sequence sequence = "MDSNFLKYLSTAPVLFTVWLSFTASFIIEANRFFPDMLYFPM" # Tokenize the sequence -> input_ids input_ids = torch.tensor(tokenizer.encode(sequence)) # Add batch dimension: (batch_size=1, length) input_ids = input_ids.unsqueeze(0) # Pad to the longest sequence in the batch input_ids_padded = pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id) # Build attention mask: 1 for real tokens, 0 for padding attention_mask = (input_ids_padded != tokenizer.pad_token_id).long() # Move tensors to the same device as model input_ids_padded = input_ids_padded.to(device) attention_mask = attention_mask.to(device) # Forward pass (no gradient needed during inference) with torch.no_grad(): outputs = model(input_ids_padded, attention_mask) # Convert logits to probabilities using sigmoid probs = torch.sigmoid(outputs['logits']) # Convert to CPU and numpy array probs = probs.squeeze().detach().cpu().numpy() # Print results: add a check mark if probability > 0.6 for i, p in enumerate(probs): mark = " √" if p > 0.6 else "" print(f"{default_dict[str(i)]:<25}: {p:.4f}{mark}") ```` ## output of the above example is: ``` NotLipidType : 0.0007 Fatty Acyl (FA) : 0.1092 Prenol Lipid (PR) : 0.9178 √ Glycerophospholipid (GP) : 0.6059 √ Sterol Lipid (ST) : 0.0083 Polyketide (PK) : 0.0026 Glycerolipid (GL) : 0.0771 Sphingolipid (SP) : 0.0002 Saccharolipid (SL) : 0.0000 ``` --- ## Limitations * Trained only on lipid-binding protein data and may not generalize to other functions. * Model performance is best with sequence lengths under 500. * Dataset size is limited compared to large-scale protein corpora. * Model may reflect biases present in training data (e.g., under-representation of certain lipid types). --- ## Citing this work If you use the code or data in this package, please cite: ```bibtex @Article{PLiCat2025, author = {Feitong Dong,Jingrou Wu}, journal = {Briefings in Bioinformatics}, title = {PLiCat:Decoding protein-lipid interactions by large language model}, year = {2025}, volume = {26}, issue = {6}, doi = {https://doi.org/10.1093/bib/bbaf665}, URL = {https://academic.oup.com/bib/article/26/6/bbaf665/8377155} } ``` --- ## License This project is licensed under the MIT License. ---