---
license: apache-2.0
base_model: ettin-encoder-32M-TR
tags:
- turkish
- hallucination-detection
- rag
- token-classification
- lettucedetect
pipeline_tag: token-classification
---

# ettin-encoder-32M-TR-HD

## Model Description

`ettin-encoder-32M-TR-HD` is an ultra-efficient 32M parameter model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. Despite its minimal size, this model achieves competitive performance, outperforming large language models like GPT-4.1 and Mistral Small in balanced metrics while offering significant computational efficiency advantages.

This model is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications. It performs token-level binary classification to identify unsupported claims in Turkish text generated by RAG systems.

## Model Details

- **Model Type**: Encoder-based transformer for token classification
- **Parameters**: 32M
- **Language**: Turkish
- **Task**: Hallucination Detection (Token-Level Binary Classification)
- **Framework**: LettuceDetect
- **Base Model**: ettin-encoder-32M-TR
- **Fine-tuned on**: RAGTruth-TR dataset

## Performance Highlights

### Example-Level Performance (Whole Dataset)
- **F1-Score**: 61.35% (outperforms GPT-4.1's 53.97%)
- **Precision**: 61.38% (vs. GPT-4.1's 37.09%)
- **Recall**: 61.32% (balanced, avoiding false positive overload)
- **AUROC**: 78.24% (vs. GPT-4.1's 54.45%)

### Task-Specific Performance

**Data2txt Task:**
- F1-Score: 75.75%
- Precision: 85.96%
- Recall: 67.70%
- AUROC: 82.11%

**QA Task:**
- F1-Score: 52.51%
- Precision: 41.37%
- Recall: 71.88%
- AUROC: 81.98%

**Summary Task:**
- F1-Score: 34.31%
- Precision: 33.98%
- Recall: 34.65%
- AUROC: 64.06%

### Token-Level Performance (Whole Dataset)
- **F1-Score**: 38.33%
- **Precision**: 40.65%
- **Recall**: 36.27%
- **AUROC**: 67.00%

**Token-Level Task Performance:**
- **QA**: AUROC 76.32% (strongest token-level performance)
- **Data2txt**: F1 37.87%, AUROC 64.10%
- **Summary**: F1 16.47%, AUROC 55.99%

## Key Advantages

1. **Ultra-Efficient**: 32M parameters enable deployment on resource-constrained devices
2. **Balanced Performance**: Avoids the extreme precision-recall imbalance of LLMs
3. **Production-Ready**: Fast inference (30-60 examples/second) suitable for real-time RAG pipelines
4. **Cost-Effective**: Minimal computational requirements reduce operational costs
5. **Superior to LLMs**: Outperforms GPT-4.1 and Mistral Small in balanced metrics despite using <0.01% of their parameters

## Intended Use

This model is designed for:

- **Turkish RAG Systems**: Detecting hallucinations in generated Turkish text
- **Production Deployment**: Real-time hallucination detection in high-throughput pipelines
- **Resource-Constrained Environments**: Edge devices and cost-sensitive applications
- **Token-Level Analysis**: Fine-grained identification of unsupported claims

## Training Data

The model was fine-tuned on **RAGTruth-TR**, a Turkish translation of the RAGTruth benchmark dataset:

- **Training Samples**: 17,790 examples
- **Test Samples**: 2,700 examples
- **Task Types**: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail)
- **Annotation**: Token-level hallucination labels preserved during translation

## Evaluation Data

The model was evaluated on the RAGTruth-TR test set across three task types:
- **Summary**: 900 examples
- **Data2txt**: 900 examples  
- **QA**: 900 examples
- **Whole Dataset**: 2,700 examples

## Limitations

1. **Token-Level Performance**: While competitive at example-level, token-level F1-scores (38.33%) are lower than larger specialized models (71-78%)
2. **Summary Task**: Lower performance in summarization tasks (34.31% F1) compared to Data2txt and QA
3. **Language Specificity**: Trained specifically for Turkish; performance on other languages not evaluated
4. **Domain Specificity**: Optimized for RAG scenarios; may not generalize to other hallucination detection contexts

## Recommendations

**Use this model when:**
- Maximum efficiency and minimal resource usage are priorities
- Moderate performance (61% F1) is acceptable for your use case
- Edge deployment or mobile applications are required
- Cost optimization is critical
- Data2txt or QA tasks are primary use cases

**Consider larger models (ettin-encoder-150M-TR, ModernBERT) when:**
- Maximum accuracy is required
- Token-level precision is critical
- Summary tasks are primary use case
- Computational resources are abundant

## How to Use

```python
from lettucedetect import TransformerDetector

# Load the model
detector = TransformerDetector.from_pretrained(
    "newmindai/ettin-encoder-32M-TR-HD"
)

# Detect hallucinations
context = "Your source document text..."
question = "Your question..."
answer = "Generated answer text..."

result = detector.detect(
    context=context,
    question=question,
    answer=answer
)

# Access token-level predictions
hallucinated_tokens = result.get_hallucinated_spans()
```

## Citation

If you use this model, please cite:

```bibtex
@misc{taş2025turklettucedetecthallucinationdetectionmodels,
      title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications}, 
      author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu},
      year={2025},
      eprint={2509.17671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.17671}, 
}
```

## References

- [LettuceDetect Framework](https://arxiv.org/abs/2502.17125)

## Model Card Contact

For questions or issues, please open an issue on the project repository.