Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents
This contains the model card for our paper accepted at IEEE Transactions on Audio, Speech and Language Processing.
π Overview
Large Language Models (LLMs) have transformed conversational AI, yet their susceptibility to dialogue breakdownsβirrelevant, contradictory, or incoherent responsesβposes significant challenges to deployment reliability and user trust.
We introduce the DEE Framework (Detect, Explain, Escalate), a resource-efficient approach to managing dialogue breakdowns in LLM-powered agents:
Key Components
- Detect & Explain: A fine-tuned 8B-parameter model serves as a real-time breakdown detector, providing classification and interpretable justifications
- Escalate: A hierarchical architecture that defers to larger models only when necessary, reducing costs by 54%
π― Key Contributions
- State-of-the-art performance on DBDC5 English (85.5% accuracy) and Japanese (89.0% accuracy) tracks
- Comprehensive evaluation of frontier LLMs (GPT-4, Claude, Llama, DeepSeek, Mixtral) on dialogue breakdown detection
- Novel prompting strategies including few-shot, chain-of-thought, and curriculum learning with analogical reasoning
- Resource-efficient deployment architecture achieving 54% cost reduction
- End-to-end repair capability resolving 97% of detected breakdowns
π¦ Model
Our fine-tuned Dialogue Disruption Monitor is available on HuggingFace:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = "meta-llama/Llama-3.1-8B-Instruct"
adapter = "aghassel/dialogue_disruption_monitor"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
π» Usage
Basic Breakdown Detection
from dee import DialogueMonitor
monitor = DialogueMonitor()
dialogue_history = [
{"role": "assistant", "content": "It's nice to go shopping alone."},
{"role": "user", "content": "I agree. That's nice."},
{"role": "assistant", "content": "Shopping takes time."},
{"role": "user", "content": "Window shopping is also fun."},
]
response = "It's fun to go shopping with somebody."
result = monitor.detect(dialogue_history, response)
# Returns: {
# "breakdown": True,
# "confidence": 0.87,
# "justification": "The response contradicts the earlier statement..."
# }
Hierarchical Deployment with Escalation
from dee import DEEPipeline
pipeline = DEEPipeline(
monitor_model="aghassel/dialogue_disruption_monitor",
superior_model="claude-3-5-sonnet", # or "gpt-4o", "llama-3.1-405b"
escalation_threshold=0.5
)
# Automatically monitors and escalates when needed
safe_response = pipeline.generate(dialogue_history, user_input)
Prompting Strategies
from dee import PromptStrategies
# Zero-shot
result = detector.detect(dialogue, strategy="zero-shot")
# Few-shot with hard examples
result = detector.detect(dialogue, strategy="few-shot", difficulty="hard", n_shots=2)
# Chain-of-thought
result = detector.detect(dialogue, strategy="cot")
# Analogical reasoning
result = detector.detect(dialogue, strategy="analogical")
# Curriculum learning + Analogical reasoning
result = detector.detect(dialogue, strategy="cl+ar")
π Datasets
We evaluate on three benchmarks:
| Dataset | Language | Type | Annotation Level |
|---|---|---|---|
| DBDC5 | English | Open-domain | Utterance |
| DBDC5 | Japanese | Open-domain | Utterance |
| BETOLD | English | Task-oriented | Conversation |
π Results
DBDC5 Performance
| Model | English Acc. | English F1_B | Japanese Acc. | Japanese F1_B |
|---|---|---|---|---|
| Previous SOTA (S2T2) | 77.9 | 82.4 | 76.7 | 75.4 |
| Claude-3.5 Sonnet (AR) | 85.5 | 89.8 | 88.0 | 91.7 |
| Claude-3.5 Sonnet (CL+AR) | 83.5 | 88.5 | 89.0 | 92.4 |
| Llama-3.3 70B (CL+AR) | 85.5 | 89.5 | 77.0 | 83.1 |
| Ours (8B Monitor) | 81.5 | 86.2 | 67.9 | 68.8 |
Cost Efficiency
Our hierarchical architecture achieves 54% cost reduction compared to running all queries on large models, while maintaining high accuracy.
π§ Training
Fine-tune the Dialogue Disruption Monitor
python train.py \
--base_model meta-llama/Llama-3.1-8B-Instruct \
--dataset dbdc5 \
--output_dir ./checkpoints \
--lora_rank 16 \
--learning_rate 2e-4 \
--num_epochs 3 \
--batch_size 8
Generate Teacher Reasoning Traces
python generate_traces.py \
--teacher_model meta-llama/Llama-3.3-70B-Instruct \
--dataset dbdc5 \
--output_path ./data/reasoning_traces.json
π Citation
If you find this work useful, please cite our paper:
@article{ghassel2025dee,
title={Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents},
author={Ghassel, Abdellah and Li, Xianzhi and Zhu, Xiaodan},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2025},
publisher={IEEE}
}
π§ Contact
For questions or issues, please:
- Contact: abdellah.ghassel@queensu.ca
- Downloads last month
- 6