SafeGuardX-v1
Labels
| ID | Label | Meaning |
|---|---|---|
| 0 | SAFE | No violation |
| 1 | VIOLATION | Prompt injection / policy violation |
Challenge I/O
Input: {"conversation": [{"role": "...", "content": "..."}]}
Output: {"violation": true, "confidence": 0.97}
Quick usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("achrafness/SafeGuardX-v1")
model = AutoModelForSequenceClassification.from_pretrained("achrafness/SafeGuardX-v1")
model.eval()
def classify(conversation):
import json
text = json.dumps(conversation) # must match training format exactly
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
conf = float(torch.softmax(logits, dim=-1)[0][1])
return {"violation": conf > 0.5, "confidence": round(conf, 4)}
- Downloads last month
- 4