GLiHR — Multi-Label HR Conversation Classifier

GLiHR is a fine-tuned GLiNER2-large model for multi-label classification of HR workplace conversations across 18 intent categories. It supports English, French, German, Spanish, and Italian.

Benchmark

Evaluated on xwjzds/hr_multiwoz_tod_sgd — 550 single-label HR dialogues, 8 active categories, single-label classification task:

Model	Accuracy	Macro F1	Precision	Recall
GLiHR (fine-tuned)	69.3%	0.807	0.954	0.716
GLiNER2-large (base, no fine-tuning)	42.9%	0.535	0.728	0.459

GLiHR outperforms the untuned base model by +26.4 points accuracy and +27.1 points Macro F1 at threshold 0.3.

Per-label F1 on benchmark:

Label	GLiHR	Base	Delta
Harassment	0.925	0.800	+0.125
Leave & Absence	0.892	0.787	+0.105
Benefits	0.868	0.764	+0.104
Mobility	0.838	0.000	+0.838
Training & Development	0.807	0.563	+0.244
Health & Safety	0.817	0.682	+0.135
Performance Management	0.775	0.206	+0.569
IT & Equipment	0.530	0.478	+0.052

Supported Labels (18 categories)

Benefits · Compliance & Legal · Contracts · DEI · Expense Management · Harassment · Health & Safety · IT & Equipment · Leave & Absence · Mobility · Offboarding · Onboarding · Payroll · Performance Management · Recruitment · Timetracking · Training & Development · Work Arrangements

Usage

from gliner import GLiNER

model = GLiNER.from_pretrained("AurelPx/glihr")

labels = [
    "Benefits", "Compliance & Legal", "Contracts", "DEI", "Expense Management",
    "Harassment", "Health & Safety", "IT & Equipment", "Leave & Absence", "Mobility",
    "Offboarding", "Onboarding", "Payroll", "Performance Management", "Recruitment",
    "Timetracking", "Training & Development", "Work Arrangements"
]

text = """USER: I have not received my payslip for March yet.
AGENT: I checked the payroll system, it was generated on the 28th.
USER: Also, I would like to understand my pension contributions."""

entities = model.predict_entities(text, labels, threshold=0.3)
predicted_labels = list({e["label"] for e in entities})
# -> ["Payroll", "Benefits"]

Recommended Threshold

0.3 — best accuracy (69.3%) and F1 (0.807), recommended for most use cases
0.5 — slightly lower recall, higher precision (P=0.954 at 0.3 already very high)

Training Data

Fine-tuned on 2,559 synthetic multi-turn HR conversations (2–5 turns, USER: ... AGENT: ... format) covering all 18 categories, generated in English, French, German, Spanish, and Italian. Labels were assigned during generation — no post-hoc keyword matching.

Citation

@misc{glihr2025,
  title  = {GLiHR: Multi-Label HR Conversation Classification with GLiNER2},
  year   = {2025},
  note   = {Fine-tuned GLiNER2-large on synthetic multilingual HR dialogues. https://huggingface.co/AurelPx/glihr}
}

Downloads last month: -

Model tree for AurelPx/glihr

Base model

urchade/gliner_large-v2.1

Finetuned

(2)

this model