GLiHR — Multi-Label HR Conversation Classifier

GLiHR is a fine-tuned GLiNER2-large model for multi-label classification of HR workplace conversations across 18 intent categories. It supports English, French, German, Spanish, and Italian.

Benchmark

Evaluated on xwjzds/hr_multiwoz_tod_sgd — 550 single-label HR dialogues, 8 active categories, single-label classification task:

Model Accuracy Macro F1 Precision Recall
GLiHR (fine-tuned) 69.3% 0.807 0.954 0.716
GLiNER2-large (base, no fine-tuning) 42.9% 0.535 0.728 0.459

GLiHR outperforms the untuned base model by +26.4 points accuracy and +27.1 points Macro F1 at threshold 0.3.

Per-label F1 on benchmark:

Label GLiHR Base Delta
Harassment 0.925 0.800 +0.125
Leave & Absence 0.892 0.787 +0.105
Benefits 0.868 0.764 +0.104
Mobility 0.838 0.000 +0.838
Training & Development 0.807 0.563 +0.244
Health & Safety 0.817 0.682 +0.135
Performance Management 0.775 0.206 +0.569
IT & Equipment 0.530 0.478 +0.052

Supported Labels (18 categories)

Benefits · Compliance & Legal · Contracts · DEI · Expense Management · Harassment · Health & Safety · IT & Equipment · Leave & Absence · Mobility · Offboarding · Onboarding · Payroll · Performance Management · Recruitment · Timetracking · Training & Development · Work Arrangements

Usage

from gliner import GLiNER

model = GLiNER.from_pretrained("AurelPx/glihr")

labels = [
    "Benefits", "Compliance & Legal", "Contracts", "DEI", "Expense Management",
    "Harassment", "Health & Safety", "IT & Equipment", "Leave & Absence", "Mobility",
    "Offboarding", "Onboarding", "Payroll", "Performance Management", "Recruitment",
    "Timetracking", "Training & Development", "Work Arrangements"
]

text = """USER: I have not received my payslip for March yet.
AGENT: I checked the payroll system, it was generated on the 28th.
USER: Also, I would like to understand my pension contributions."""

entities = model.predict_entities(text, labels, threshold=0.3)
predicted_labels = list({e["label"] for e in entities})
# -> ["Payroll", "Benefits"]

Recommended Threshold

  • 0.3 — best accuracy (69.3%) and F1 (0.807), recommended for most use cases
  • 0.5 — slightly lower recall, higher precision (P=0.954 at 0.3 already very high)

Training Data

Fine-tuned on 2,559 synthetic multi-turn HR conversations (2–5 turns, USER: ... AGENT: ... format) covering all 18 categories, generated in English, French, German, Spanish, and Italian. Labels were assigned during generation — no post-hoc keyword matching.

Citation

@misc{glihr2025,
  title  = {GLiHR: Multi-Label HR Conversation Classification with GLiNER2},
  year   = {2025},
  note   = {Fine-tuned GLiNER2-large on synthetic multilingual HR dialogues. https://huggingface.co/AurelPx/glihr}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AurelPx/glihr

Finetuned
(2)
this model