YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents

Paper HuggingFace License

This contains the model card for our paper accepted at IEEE Transactions on Audio, Speech and Language Processing.


πŸ“‹ Overview

Large Language Models (LLMs) have transformed conversational AI, yet their susceptibility to dialogue breakdownsβ€”irrelevant, contradictory, or incoherent responsesβ€”poses significant challenges to deployment reliability and user trust.

We introduce the DEE Framework (Detect, Explain, Escalate), a resource-efficient approach to managing dialogue breakdowns in LLM-powered agents:

Key Components

  1. Detect & Explain: A fine-tuned 8B-parameter model serves as a real-time breakdown detector, providing classification and interpretable justifications
  2. Escalate: A hierarchical architecture that defers to larger models only when necessary, reducing costs by 54%

🎯 Key Contributions

  • State-of-the-art performance on DBDC5 English (85.5% accuracy) and Japanese (89.0% accuracy) tracks
  • Comprehensive evaluation of frontier LLMs (GPT-4, Claude, Llama, DeepSeek, Mixtral) on dialogue breakdown detection
  • Novel prompting strategies including few-shot, chain-of-thought, and curriculum learning with analogical reasoning
  • Resource-efficient deployment architecture achieving 54% cost reduction
  • End-to-end repair capability resolving 97% of detected breakdowns

πŸ“¦ Model

Our fine-tuned Dialogue Disruption Monitor is available on HuggingFace:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "meta-llama/Llama-3.1-8B-Instruct"
adapter = "aghassel/dialogue_disruption_monitor"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

πŸ’» Usage

Basic Breakdown Detection

from dee import DialogueMonitor

monitor = DialogueMonitor()

dialogue_history = [
    {"role": "assistant", "content": "It's nice to go shopping alone."},
    {"role": "user", "content": "I agree. That's nice."},
    {"role": "assistant", "content": "Shopping takes time."},
    {"role": "user", "content": "Window shopping is also fun."},
]

response = "It's fun to go shopping with somebody."

result = monitor.detect(dialogue_history, response)
# Returns: {
#   "breakdown": True,
#   "confidence": 0.87,
#   "justification": "The response contradicts the earlier statement..."
# }

Hierarchical Deployment with Escalation

from dee import DEEPipeline

pipeline = DEEPipeline(
    monitor_model="aghassel/dialogue_disruption_monitor",
    superior_model="claude-3-5-sonnet",  # or "gpt-4o", "llama-3.1-405b"
    escalation_threshold=0.5
)

# Automatically monitors and escalates when needed
safe_response = pipeline.generate(dialogue_history, user_input)

Prompting Strategies

from dee import PromptStrategies

# Zero-shot
result = detector.detect(dialogue, strategy="zero-shot")

# Few-shot with hard examples
result = detector.detect(dialogue, strategy="few-shot", difficulty="hard", n_shots=2)

# Chain-of-thought
result = detector.detect(dialogue, strategy="cot")

# Analogical reasoning
result = detector.detect(dialogue, strategy="analogical")

# Curriculum learning + Analogical reasoning
result = detector.detect(dialogue, strategy="cl+ar")

πŸ“Š Datasets

We evaluate on three benchmarks:

Dataset Language Type Annotation Level
DBDC5 English Open-domain Utterance
DBDC5 Japanese Open-domain Utterance
BETOLD English Task-oriented Conversation

πŸ“ˆ Results

DBDC5 Performance

Model English Acc. English F1_B Japanese Acc. Japanese F1_B
Previous SOTA (S2T2) 77.9 82.4 76.7 75.4
Claude-3.5 Sonnet (AR) 85.5 89.8 88.0 91.7
Claude-3.5 Sonnet (CL+AR) 83.5 88.5 89.0 92.4
Llama-3.3 70B (CL+AR) 85.5 89.5 77.0 83.1
Ours (8B Monitor) 81.5 86.2 67.9 68.8

Cost Efficiency

Our hierarchical architecture achieves 54% cost reduction compared to running all queries on large models, while maintaining high accuracy.


πŸ”§ Training

Fine-tune the Dialogue Disruption Monitor

python train.py \
    --base_model meta-llama/Llama-3.1-8B-Instruct \
    --dataset dbdc5 \
    --output_dir ./checkpoints \
    --lora_rank 16 \
    --learning_rate 2e-4 \
    --num_epochs 3 \
    --batch_size 8

Generate Teacher Reasoning Traces

python generate_traces.py \
    --teacher_model meta-llama/Llama-3.3-70B-Instruct \
    --dataset dbdc5 \
    --output_path ./data/reasoning_traces.json

πŸ“– Citation

If you find this work useful, please cite our paper:

@article{ghassel2025dee,
  title={Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents},
  author={Ghassel, Abdellah and Li, Xianzhi and Zhu, Xiaodan},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2025},
  publisher={IEEE}
}

πŸ“§ Contact

For questions or issues, please:

Downloads last month
6
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support