Hassaniya Punctuation Restoration Model (ุงูุญุณุงููุฉ)
This is the first open-source model dedicated to restoring punctuation for the Hassaniya Arabic dialect (spoken in Mauritania, Southern Morocco, etc.).
It restores the following punctuation marks:
- Period (.)
- Comma (ุ)
- Question Mark (ุ)
- Exclamation Mark (!)
- Colon (:)
Model Details
- Base Model:
UBC-NLP/MARBERTv2(State-of-the-art for Arabic Dialects). - Training Data: ~10,000 sentences of Hassaniya dialogue and narrative text.
- Task: Token Classification (The model tags each word with the punctuation that should follow it).
How to use
You can use this model directly with the Hugging Face pipeline:
from transformers import pipeline
# Load model
model_name = "Emin009/hassaniya-punctuation-restoration"
punct_pipe = pipeline("token-classification", model=model_name, aggregation_strategy="simple")
# Define punctuation map
punct_map = {"PERIOD": ".", "COMMA": "ุ", "QUESTION": "ุ", "EXCLAM": "!", "COLON": ":"}
def restore_punct(text):
results = punct_pipe(text)
output_text = ""
last_idx = 0
for res in results:
output_text += text[last_idx:res['end']]
if res['entity_group'] in punct_map:
output_text += punct_map[res['entity_group']]
last_idx = res['end']
output_text += text[last_idx:]
return output_text
# Test
text = "ู
ููู ูุฐุง ุงูุฑุงุฌู ุฌุง ู
ููู"
print(restore_punct(text))
# Output: ู
ููู ูุฐุง ุงูุฑุงุฌูุ ุฌุง ู
ูููุ
- Downloads last month
- 3