--- base_model: CohereForAI/aya-expanse-8b library_name: peft license: apache-2.0 datasets: - LingoIITGN/COMI-LINGUA language: - hi - en pipeline_tag: token-classification tags: - code-mixing - Hinglish - pos-tagging metrics: - precision - recall - f1 --- # Model Card for Model ID ### Model Description This is a fine-tuned version of aya-expanse-8b for **Part-of-Speech (POS) Tagging** on Hinglish (Hindi-English code-mixed) text. It assigns a grammatical category to each token using a language-agnostic Universal POS tagset suitable for code-mixed content in Roman and Devanagari scripts. Supported tags: NOUN, PROPN, VERB, ADJ, ADV, ADP, PRON, DET, CONJ, PART, PRON_WH, PART_NEG, NUM, X (for typos, punctuation, abbreviations, foreign elements). Achieves **88.61 F1** on the COMI-LINGUA POS test set (5K instances), competitive with or slightly outperforming specialized traditional tools (codeswitch: 88.2 F1 zero-shot) and surpassing strong zero/one-shot LLMs. - **Model type:** LoRA-adapted Transformer LLM (8B params, ~32M trainable) - **License:** apache-2.0 - **Finetuned from model:** CohereForAI/aya-expanse-8b ### Model Sources - **Paper:** [COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing](https://aclanthology.org/2025.findings-emnlp.422.pdf) - **Demo:** Integrated in [Demo Portal](https://lingo.iitgn.ac.in/comi-lingua/) ## Uses - POS tagging in Hinglish pipelines (e.g., syntactic analysis, downstream tasks like dependency parsing, sentiment analysis, machine translation on code-mixed social media/news text). - Preprocessing for structure-sensitive NLP in mixed-language content. - **Example inference prompt:** ``` Assign Part-of-Speech (POS) tags to each token in the sentence given as: "मीराबाई चानू ने 21 st Commonwealth Games में India के लिए first Gold medal जीता था।" Output: [{'मीराबाई': 'PROPN'}, {'चानू': 'PROPN'}, {'ने': 'PART'}, {'21': 'NUM'}, {'st': 'X'}, {'Commonwealth': 'PROPN'}, {'Games': 'PROPN'}, {'में': 'ADP'}, {'India': 'PROPN'}, {'के': 'ADP'}, {'लिए': 'ADP'}, {'first': 'ADJ'}, {'Gold': 'NOUN'}, {'medal': 'NOUN'}, {'जीता': 'VERB'}, {'था': 'VERB'}, {'।': 'X'}] ``` ## Training Details ### Training Data [COMI-LINGUA Dataset Card](https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA). ### Training Procedure #### Preprocessing Tokenized with base tokenizer; instruction templates + few-shot examples. Filtered: ≥5 tokens, no hate/non-Hinglish, focused on code-mixed content. #### Training Hyperparameters - **Regime:** PEFT LoRA (rank=32, alpha=64, dropout=0.1) - **Epochs:** 3 - **Batch:** 4 (accum=8, effective=32) - **LR:** 2e-4 (cosine + warmup=0.1) - **Weight decay:** 0.01 ## Evaluation #### Testing Data COMI-LINGUA POS test set (5K instances). #### Metrics Macro Precision / Recall / F1 (token-level). ### Results | Setting | P | R | F1 | |---------|------|------|------| | Zero-shot | 76.92 | 29.50 | 40.55 | | One-shot | 55.29 | 48.70 | 48.20 | | **Fine-tuned** | **88.97** | **88.55** | **88.61** | **Summary:** Achieves competitive SOTA among open-weight models and edges out specialized traditional tools (codeswitch) on this high-quality benchmark; fine-tuning closes the gap with closed LLMs and handles script variability + code-mixing effectively. ## Bias, Risks, and Limitations This model is a research preview and is subject to ongoing iterative updates. As such, it provides only limited safety measures. ## Model Card Contact [Lingo Research Group at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/)
Mail at: [lingo@iitgn.ac.in](mailto:lingo@iitgn.ac.in) ## Citation If you use this model, please cite the following work: ``` @inproceedings{sheth-etal-2025-comi, title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing", author = "Sheth, Rajvee and Beniwal, Himanshu and Singh, Mayank", editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.findings-emnlp.422/", pages = "7973--7992", ISBN = "979-8-89176-335-7", } ```