Helsinki-NLP/tatoeba
Updated • 3.44k • 56
This repository contains an encoder–decoder Transformer (BART-style) trained from scratch for Spanish to Portuguese translation using the Helsinki-NLP/Tatoeba dataset (es-pt).
| Component | Value |
|---|---|
| vocab size | 32,000 |
| d_model | 512 |
| encoder layers | 6 |
| decoder layers | 6 |
| attention heads | 8 |
| FFN dim | 2048 |
| dropout | 0.1 |
| parameters | ~61.6M |
Helsinki-NLP/tatoebaes-pttrain, validation, testDataset link: https://huggingface.co/datasets/Helsinki-NLP/tatoeba
Metric: chrF (generation-based)
| Split | chrF |
|---|---|
| Validation | 70.6691 |
| Test | 70.4862 |
Note: chrF is character n-gram based and is suitable for evaluating adequacy in Romance language translation tasks such as ES↔PT.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "liansheng06/bart-tatoeba-es-pt"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
text = "Las personas dicen que estoy loco."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model.generate(**inputs, num_beams=5, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Try the interactive Gradio demo here:
https://huggingface.co/spaces/liansheng06/ATA-Assignment2