--- language: fa license: apache-2.0 library_name: transformers pipeline_tag: fill-mask tags: - roberta - masked-lm - persian - farsi - ner - relation-extraction model-index: - name: persian_roberta_opt_tokenizer results: - task: type: token-classification name: Named Entity Recognition (NER) dataset: name: ARMAN + PEYMA (merged) type: ner config: fa metrics: - type: precision value: 93.4 - type: recall value: 94.8 - type: f1 value: 94.08 - task: type: relation-classification name: Relation Extraction dataset: name: PERLEX type: relation-extraction config: fa metrics: - type: f1 value: 90.0 --- # persian_roberta_opt_tokenizer A compact RoBERTa-style **Masked Language Model (MLM)** for Persian (Farsi). We trained a Persian BPE tokenizer on a mixed corpus combining formal text with social-media and chat data. The model is pre-trained with this tokenizer, optimized for Persian script and evaluated on two downstream tasks: - **NER** on a **merged ARMAN + PEYMA** corpus - **Relation Extraction** on **PERLEX** Model size and training hyperparameters were kept **identical** to the baselines to ensure fair comparisons. --- ## 1) Model Description - **Architecture:** RoBERTa-style Transformer for Masked LM - **Intended use:** Persian text understanding, masked token prediction, and as a backbone for NER/RE fine-tuning - **Vocabulary:** BPE with Persian-aware preprocessing (supports ZWNJ and Persian punctuation) - **Max sequence length:** 256 > The repository name on the Hub should be: `selfms/persian_roberta_opt_tokenizer`. --- ## 2) Architecture and Training Setup **Backbone (example config):** - hidden size: 256 - layers: 6 - attention heads: 4 - intermediate size: 1024 - activation: GELU - dropout: 0.1 - positional embeddings: 514 > Adjust numbers above to your final `config.json` if they differ. All baselines used **the same parameter budget**. **Pretraining objective:** Masked Language Modeling **Fine-tuning hyperparameters (shared across all compared models):** ```text epochs = 3 batch_size = 8 learning_rate = 3e-5 weight_decay = 0.01 max_tokens = 128 optimizer = AdamW scheduler = linear with warmup (recommended 10% warmup) seed = 42 ``` --- ## 3) Data and Tasks ### NER - **Datasets:** **ARMAN** + **PEYMA**, merged and standardized to a unified tag set (BIO or BILOU; pick one consistently) - **Preprocessing:** Persian normalization (digits, punctuation, ZWNJ), sentence segmentation, max length 128, label alignment with wordpieces ### Relation Extraction - **Dataset:** **PERLEX** (Persian Relation Extraction) - **Entity marking:** special entity markers in the text (recommended) or span pooling; we used a simple [CLS] pooling baseline in code example below --- ## 4) Quantitative Results ### 4.1 NER (ARMAN + PEYMA, merged) | Model | Precision | Recall | F1-Score | |--------------------------:|----------:|-------:|---------:| | **Proposed (this model)** | **93.4** | **94.8** | **94.08** | | TooKaBERT-base | 94.9 | 96.2 | 95.5 | | FABERT | 94.1 | 95.3 | 94.7 | ### 4.2 Relation Extraction (PERLEX) | Model | F1-score (%) | |--------------------------:|-------------:| | **Proposed (this model)** | **90** | | TooKaBERT-base | 91 | | FABERT | 88 | > All three models used **identical** hyperparameters, token length, and parameter budgets to isolate architecture/tokenizer effects. --- ## 5) Usage ### 5.1 Fill-Mask Inference (simple) ```python from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline path = "selfms/persian_roberta_opt_tokenizer" tokenizer = AutoTokenizer.from_pretrained(path) model = AutoModelForMaskedLM.from_pretrained(path) model.eval() fill = pipeline("fill-mask", model=model, tokenizer=tokenizer, top_k=10) print(fill(" سلام کسی تحلیل دقیقی ازاین داره کی میخواد حرکت کنه")) ``` ### 5.2 Text-Embedding Inference (simple) ```python import torch from transformers import AutoTokenizer, AutoModel path = "selfms/persian_roberta_opt_tokenizer" tok = AutoTokenizer.from_pretrained(path) mdl = AutoModel.from_pretrained(path).eval() def embed(text): with torch.no_grad(): x = tok(text, return_tensors="pt", truncation=True, max_length=256) h = mdl(**x).last_hidden_state a = x["attention_mask"].unsqueeze(-1) v = (h * a).sum(1) / a.sum(1).clamp(min=1) return (v / v.norm(dim=1, keepdim=True)).squeeze(0) # 1D vector text = "متن فارسی به بردار 768 بعدی تبدیل میشه" vec = embed(text) print(len(vec)) ``` ### 5.3 Tokenizer Inference (simple) ```python from transformers import AutoTokenizer path = "selfms/persian_roberta_opt_tokenizer" tok = AutoTokenizer.from_pretrained(path) text = "برای tokenizer از پیش پردازش معنایی روی دیتاست ها مختلف خبری و شبکه های اجتماعی استفاده شده" enc = tok(text, return_tensors="pt") tokens = tok.convert_ids_to_tokens(enc["input_ids"][0]) print("Tokens:", tokens) print("IDs :", enc["input_ids"][0].tolist()) ``` --- ## 6) Comparison with Other Models Under identical parameter budgets and training settings: - **NER (ARMAN + PEYMA):** TooKaBERT achieves the highest F1 (95.5), our model is competitive (94.08) and close to FABERT but slightly lower on F1 . - **Relation Extraction (PERLEX):** Our model (F1=90) surpasses FABERT (88) and is slightly below TooKaBERT (91). These results suggest the tokenizer/backbone choices here are strong for RE and competitive for NER, especially considering the compact backbone. --- ## 7) Limitations, Bias, and Ethical Considerations - **Domain bias:** Training corpora and NER/RE datasets are news/formal-text heavy; performance may drop on slang, dialects, or domain-specific jargon. - **Tokenization quirks:** ZWNJ handling and Persian punctuation are supported, but mixed Persian/English code-switching can degrade quality. - **Sequence length:** Experiments reported at `max_tokens=128`. Longer contexts may require re-tuning and more memory. - **Stereotypes/Bias:** As with all language models, learned correlations may reflect societal biases. Avoid using outputs as ground truth for sensitive decisions. --- ## 8) How to Reproduce 1) Pretrain or load the MLM checkpoint: ```python from transformers import AutoModelForMaskedLM, AutoTokenizer tok = AutoTokenizer.from_pretrained("selfms/persian_roberta_opt_tokenizer") mdl = AutoModelForMaskedLM.from_pretrained("selfms/persian_roberta_opt_tokenizer") ``` 2) Fine-tune for NER/RE with the shared hyperparameters: ``` epochs=3, batch_size=8, lr=3e-5, weight_decay=0.01, max_tokens=128 ``` 3) Evaluate: - NER: token-level Precision/Recall/F1 (micro or macro; report your choice consistently) - RE: relation-level micro-F1 on PERLEX --- ## 9) Files in the Repository - `config.json` - `model.safetensors` or `pytorch_model.bin` - `tokenizer_config.json`, `special_tokens_map.json`, `tokenizer.json` - `vocab.json`, `merges.txt` (BPE) - `README.md`, `LICENSE`, `.gitattributes` > Ensure `mask_token` is set to `` and `pipeline_tag: fill-mask` is present so the Hub widget works out-of-the-box. --- ## 10) Citation If you use this model, please cite: ```bibtex @misc{persian_roberta_opt_tokenizer_2025, title = {persian\_roberta\_opt\_tokenizer: A compact RoBERTa-style Persian Masked LM}, author = {selfms}, year = {2025}, howpublished = {\url{https://huggingface.co/selfms/persian_roberta_opt_tokenizer}}, note = {Pretrained on Persian text; evaluated on ARMAN+PEYMA (NER) and PERLEX (RE).} } ``` --- ## 11) License Apache-2.0 (recommended). Please verify dataset licenses (ARMAN, PEYMA, PERLEX) before redistribution. ## Metrics & Evaluation Notes - **NER:** entity-level micro-F1 under the **BIO** tagging scheme. - **Relation Extraction (RE):** micro-F1 at relation level. - **Sequence length:** model supports up to **512** tokens (RoBERTa has 514 positions including special tokens). Evaluations in this report used **256** for efficiency. ## Model Config Summary - **Architecture:** RoBERTa-base (12 layers, 12 heads, hidden size **768**, FFN **3072**). - **Max positions:** 514 (effective input up to 512 tokens). - **Dropout:** hidden 0.1, attention 0.1. - **Vocab size:** 48,000 (BPE). - **Special tokens:** `=0`, `=1`, `=2`, `` as mask token.