NagaNLP POS Tagger (XLM-RoBERTa)

NagaNLP-POS is a Part-of-Speech (POS) tagging model fine-tuned on Nagamese (Naga Pidgin). It is built on top of XLM-RoBERTa Base and achieves an F1-score of 0.91.

This model is part of the NagaNLP project, aimed at developing foundational resources for the low-resource languages of Nagaland.

Model Details

Developer: Agniva Maiti
Base Architecture: XLM-RoBERTa Base
Task: Token Classification (POS Tagging)
Language: Nagamese (nag)
Training Data: ~214 annotated sentences (CoNLL format).

Performance

The model was evaluated on a held-out test set (10% split):

F1 Score: 0.9127
Accuracy: ~99% (on validation set)
Validation Loss: 0.77

How to Use

You can use this model directly with the Hugging Face pipeline:

from transformers import pipeline

# Load the pipeline
# Note: Aggregation strategy 'simple' merges sub-tokens into words
pos_pipeline = pipeline(
    "token-classification",
    model="agnivamaiti/naganlp-pos-annotated-corpus",
    aggregation_strategy="simple"
)

# Inference
text = "moi etiya school jai ase."
results = pos_pipeline(text)

# Print results
for entity in results:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")

Output:

moi: PRON (0.22)
etiya: ADV (0.52)
school: NOUN (0.92)
jai ase: VERB (0.95)
.: PUNCT (0.95)

Training Details

Epochs: 10
Batch Size: 16
Learning Rate: 3e-5
Optimizer: AdamW
Precision: Float32

Limitations

Data Scarcity: Trained on a small corpus (~200 sentences). While it performs well on simple sentences, it may struggle with complex grammatical structures or rare vocabulary.
Code-Switching: Nagamese often mixes with English/Assamese. This model is optimized for standard Nagamese text.