SafeGuardX-v1

Labels

ID Label Meaning
0 SAFE No violation
1 VIOLATION Prompt injection / policy violation

Challenge I/O

Input: {"conversation": [{"role": "...", "content": "..."}]} Output: {"violation": true, "confidence": 0.97}

Quick usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("achrafness/SafeGuardX-v1")
model = AutoModelForSequenceClassification.from_pretrained("achrafness/SafeGuardX-v1")
model.eval()

def classify(conversation):
    import json
    text = json.dumps(conversation)  # must match training format exactly
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = model(**inputs).logits
    conf = float(torch.softmax(logits, dim=-1)[0][1])
    return {"violation": conf > 0.5, "confidence": round(conf, 4)}
Downloads last month
4
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support