Hassaniya Punctuation Restoration Model (ุงู„ุญุณุงู†ูŠุฉ)

This is the first open-source model dedicated to restoring punctuation for the Hassaniya Arabic dialect (spoken in Mauritania, Southern Morocco, etc.).

It restores the following punctuation marks:

  • Period (.)
  • Comma (ุŒ)
  • Question Mark (ุŸ)
  • Exclamation Mark (!)
  • Colon (:)

Model Details

  • Base Model: UBC-NLP/MARBERTv2 (State-of-the-art for Arabic Dialects).
  • Training Data: ~10,000 sentences of Hassaniya dialogue and narrative text.
  • Task: Token Classification (The model tags each word with the punctuation that should follow it).

How to use

You can use this model directly with the Hugging Face pipeline:

from transformers import pipeline

# Load model
model_name = "Emin009/hassaniya-punctuation-restoration"
punct_pipe = pipeline("token-classification", model=model_name, aggregation_strategy="simple")

# Define punctuation map
punct_map = {"PERIOD": ".", "COMMA": "ุŒ", "QUESTION": "ุŸ", "EXCLAM": "!", "COLON": ":"}

def restore_punct(text):
    results = punct_pipe(text)
    output_text = ""
    last_idx = 0
    for res in results:
        output_text += text[last_idx:res['end']]
        if res['entity_group'] in punct_map:
            output_text += punct_map[res['entity_group']]
        last_idx = res['end']
    output_text += text[last_idx:]
    return output_text

# Test
text = "ู…ู†ู‡ูˆ ู‡ุฐุง ุงู„ุฑุงุฌู„ ุฌุง ู…ู†ูŠู†"
print(restore_punct(text))
# Output: ู…ู†ู‡ูˆ ู‡ุฐุง ุงู„ุฑุงุฌู„ุŸ ุฌุง ู…ู†ูŠู†ุŸ
Downloads last month
3
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Emin009/hassaniya-punctuation-restoration 1