YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents

This contains the model card for our paper accepted at IEEE Transactions on Audio, Speech and Language Processing.

📋 Overview

Large Language Models (LLMs) have transformed conversational AI, yet their susceptibility to dialogue breakdowns—irrelevant, contradictory, or incoherent responses—poses significant challenges to deployment reliability and user trust.

We introduce the DEE Framework (Detect, Explain, Escalate), a resource-efficient approach to managing dialogue breakdowns in LLM-powered agents:

Key Components

Detect & Explain: A fine-tuned 8B-parameter model serves as a real-time breakdown detector, providing classification and interpretable justifications
Escalate: A hierarchical architecture that defers to larger models only when necessary, reducing costs by 54%

🎯 Key Contributions

State-of-the-art performance on DBDC5 English (85.5% accuracy) and Japanese (89.0% accuracy) tracks
Comprehensive evaluation of frontier LLMs (GPT-4, Claude, Llama, DeepSeek, Mixtral) on dialogue breakdown detection
Novel prompting strategies including few-shot, chain-of-thought, and curriculum learning with analogical reasoning
Resource-efficient deployment architecture achieving 54% cost reduction
End-to-end repair capability resolving 97% of detected breakdowns

📦 Model

Our fine-tuned Dialogue Disruption Monitor is available on HuggingFace:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "meta-llama/Llama-3.1-8B-Instruct"
adapter = "aghassel/dialogue_disruption_monitor"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

💻 Usage

Basic Breakdown Detection

from dee import DialogueMonitor

monitor = DialogueMonitor()

dialogue_history = [
    {"role": "assistant", "content": "It's nice to go shopping alone."},
    {"role": "user", "content": "I agree. That's nice."},
    {"role": "assistant", "content": "Shopping takes time."},
    {"role": "user", "content": "Window shopping is also fun."},
]

response = "It's fun to go shopping with somebody."

result = monitor.detect(dialogue_history, response)
# Returns: {
#   "breakdown": True,
#   "confidence": 0.87,
#   "justification": "The response contradicts the earlier statement..."
# }

Hierarchical Deployment with Escalation

from dee import DEEPipeline

pipeline = DEEPipeline(
    monitor_model="aghassel/dialogue_disruption_monitor",
    superior_model="claude-3-5-sonnet",  # or "gpt-4o", "llama-3.1-405b"
    escalation_threshold=0.5
)

# Automatically monitors and escalates when needed
safe_response = pipeline.generate(dialogue_history, user_input)

Prompting Strategies

from dee import PromptStrategies

# Zero-shot
result = detector.detect(dialogue, strategy="zero-shot")

# Few-shot with hard examples
result = detector.detect(dialogue, strategy="few-shot", difficulty="hard", n_shots=2)

# Chain-of-thought
result = detector.detect(dialogue, strategy="cot")

# Analogical reasoning
result = detector.detect(dialogue, strategy="analogical")

# Curriculum learning + Analogical reasoning
result = detector.detect(dialogue, strategy="cl+ar")

📊 Datasets

We evaluate on three benchmarks:

Dataset	Language	Type	Annotation Level
DBDC5	English	Open-domain	Utterance
DBDC5	Japanese	Open-domain	Utterance
BETOLD	English	Task-oriented	Conversation

📈 Results

DBDC5 Performance

Model	English Acc.	English F1_B	Japanese Acc.	Japanese F1_B
Previous SOTA (S2T2)	77.9	82.4	76.7	75.4
Claude-3.5 Sonnet (AR)	85.5	89.8	88.0	91.7
Claude-3.5 Sonnet (CL+AR)	83.5	88.5	89.0	92.4
Llama-3.3 70B (CL+AR)	85.5	89.5	77.0	83.1
Ours (8B Monitor)	81.5	86.2	67.9	68.8

Cost Efficiency

Our hierarchical architecture achieves 54% cost reduction compared to running all queries on large models, while maintaining high accuracy.

🔧 Training

Fine-tune the Dialogue Disruption Monitor

python train.py \
    --base_model meta-llama/Llama-3.1-8B-Instruct \
    --dataset dbdc5 \
    --output_dir ./checkpoints \
    --lora_rank 16 \
    --learning_rate 2e-4 \
    --num_epochs 3 \
    --batch_size 8

Generate Teacher Reasoning Traces

python generate_traces.py \
    --teacher_model meta-llama/Llama-3.3-70B-Instruct \
    --dataset dbdc5 \
    --output_path ./data/reasoning_traces.json

📖 Citation

If you find this work useful, please cite our paper:

@article{ghassel2025dee,
  title={Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents},
  author={Ghassel, Abdellah and Li, Xianzhi and Zhu, Xiaodan},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2025},
  publisher={IEEE}
}

📧 Contact

For questions or issues, please:

Contact: abdellah.ghassel@queensu.ca

Downloads last month: 6

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support