Hassaniya Punctuation Restoration Model (الحسانية)

This is the first open-source model dedicated to restoring punctuation for the Hassaniya Arabic dialect (spoken in Mauritania, Southern Morocco, etc.).

It restores the following punctuation marks:

Period (.)
Comma (،)
Question Mark (؟)
Exclamation Mark (!)
Colon (:)

Model Details

Base Model: UBC-NLP/MARBERTv2 (State-of-the-art for Arabic Dialects).
Training Data: ~10,000 sentences of Hassaniya dialogue and narrative text.
Task: Token Classification (The model tags each word with the punctuation that should follow it).

How to use

You can use this model directly with the Hugging Face pipeline:

from transformers import pipeline

# Load model
model_name = "Emin009/hassaniya-punctuation-restoration"
punct_pipe = pipeline("token-classification", model=model_name, aggregation_strategy="simple")

# Define punctuation map
punct_map = {"PERIOD": ".", "COMMA": "،", "QUESTION": "؟", "EXCLAM": "!", "COLON": ":"}

def restore_punct(text):
    results = punct_pipe(text)
    output_text = ""
    last_idx = 0
    for res in results:
        output_text += text[last_idx:res['end']]
        if res['entity_group'] in punct_map:
            output_text += punct_map[res['entity_group']]
        last_idx = res['end']
    output_text += text[last_idx:]
    return output_text

# Test
text = "منهو هذا الراجل جا منين"
print(restore_punct(text))
# Output: منهو هذا الراجل؟ جا منين؟

Downloads last month: 3

Safetensors

Model size

0.2B params

Tensor type

F32

Emin009
/

hassaniya-punctuation-restoration

Hassaniya Punctuation Restoration Model (الحسانية)

Model Details

How to use

Space using Emin009/hassaniya-punctuation-restoration 1