--- license: apache-2.0 base_model: ettin-encoder-32M-TR tags: - turkish - hallucination-detection - rag - token-classification - lettucedetect pipeline_tag: token-classification --- # ettin-encoder-32M-TR-HD ## Model Description `ettin-encoder-32M-TR-HD` is an ultra-efficient 32M parameter model fine-tuned for Turkish hallucination detection in Retrieval-Augmented Generation (RAG) systems. Despite its minimal size, this model achieves competitive performance, outperforming large language models like GPT-4.1 and Mistral Small in balanced metrics while offering significant computational efficiency advantages. This model is part of the Turk-LettuceDetect project, which adapts the LettuceDetect framework for Turkish language applications. It performs token-level binary classification to identify unsupported claims in Turkish text generated by RAG systems. ## Model Details - **Model Type**: Encoder-based transformer for token classification - **Parameters**: 32M - **Language**: Turkish - **Task**: Hallucination Detection (Token-Level Binary Classification) - **Framework**: LettuceDetect - **Base Model**: ettin-encoder-32M-TR - **Fine-tuned on**: RAGTruth-TR dataset ## Performance Highlights ### Example-Level Performance (Whole Dataset) - **F1-Score**: 61.35% (outperforms GPT-4.1's 53.97%) - **Precision**: 61.38% (vs. GPT-4.1's 37.09%) - **Recall**: 61.32% (balanced, avoiding false positive overload) - **AUROC**: 78.24% (vs. GPT-4.1's 54.45%) ### Task-Specific Performance **Data2txt Task:** - F1-Score: 75.75% - Precision: 85.96% - Recall: 67.70% - AUROC: 82.11% **QA Task:** - F1-Score: 52.51% - Precision: 41.37% - Recall: 71.88% - AUROC: 81.98% **Summary Task:** - F1-Score: 34.31% - Precision: 33.98% - Recall: 34.65% - AUROC: 64.06% ### Token-Level Performance (Whole Dataset) - **F1-Score**: 38.33% - **Precision**: 40.65% - **Recall**: 36.27% - **AUROC**: 67.00% **Token-Level Task Performance:** - **QA**: AUROC 76.32% (strongest token-level performance) - **Data2txt**: F1 37.87%, AUROC 64.10% - **Summary**: F1 16.47%, AUROC 55.99% ## Key Advantages 1. **Ultra-Efficient**: 32M parameters enable deployment on resource-constrained devices 2. **Balanced Performance**: Avoids the extreme precision-recall imbalance of LLMs 3. **Production-Ready**: Fast inference (30-60 examples/second) suitable for real-time RAG pipelines 4. **Cost-Effective**: Minimal computational requirements reduce operational costs 5. **Superior to LLMs**: Outperforms GPT-4.1 and Mistral Small in balanced metrics despite using <0.01% of their parameters ## Intended Use This model is designed for: - **Turkish RAG Systems**: Detecting hallucinations in generated Turkish text - **Production Deployment**: Real-time hallucination detection in high-throughput pipelines - **Resource-Constrained Environments**: Edge devices and cost-sensitive applications - **Token-Level Analysis**: Fine-grained identification of unsupported claims ## Training Data The model was fine-tuned on **RAGTruth-TR**, a Turkish translation of the RAGTruth benchmark dataset: - **Training Samples**: 17,790 examples - **Test Samples**: 2,700 examples - **Task Types**: Question Answering (MS MARCO), Data-to-Text (Yelp reviews), Summarization (CNN/Daily Mail) - **Annotation**: Token-level hallucination labels preserved during translation ## Evaluation Data The model was evaluated on the RAGTruth-TR test set across three task types: - **Summary**: 900 examples - **Data2txt**: 900 examples - **QA**: 900 examples - **Whole Dataset**: 2,700 examples ## Limitations 1. **Token-Level Performance**: While competitive at example-level, token-level F1-scores (38.33%) are lower than larger specialized models (71-78%) 2. **Summary Task**: Lower performance in summarization tasks (34.31% F1) compared to Data2txt and QA 3. **Language Specificity**: Trained specifically for Turkish; performance on other languages not evaluated 4. **Domain Specificity**: Optimized for RAG scenarios; may not generalize to other hallucination detection contexts ## Recommendations **Use this model when:** - Maximum efficiency and minimal resource usage are priorities - Moderate performance (61% F1) is acceptable for your use case - Edge deployment or mobile applications are required - Cost optimization is critical - Data2txt or QA tasks are primary use cases **Consider larger models (ettin-encoder-150M-TR, ModernBERT) when:** - Maximum accuracy is required - Token-level precision is critical - Summary tasks are primary use case - Computational resources are abundant ## How to Use ```python from lettucedetect import TransformerDetector # Load the model detector = TransformerDetector.from_pretrained( "newmindai/ettin-encoder-32M-TR-HD" ) # Detect hallucinations context = "Your source document text..." question = "Your question..." answer = "Generated answer text..." result = detector.detect( context=context, question=question, answer=answer ) # Access token-level predictions hallucinated_tokens = result.get_hallucinated_spans() ``` ## Citation If you use this model, please cite: ```bibtex @misc{taş2025turklettucedetecthallucinationdetectionmodels, title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications}, author={Selva Taş and Mahmut El Huseyni and Özay Ezerceli and Reyhan Bayraktar and Fatma Betül Terzioğlu}, year={2025}, eprint={2509.17671}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.17671}, } ``` ## References - [LettuceDetect Framework](https://arxiv.org/abs/2502.17125) ## Model Card Contact For questions or issues, please open an issue on the project repository.