Bart (From Scratch) Spanish to Portuguese (ES to PT)

This repository contains an encoder–decoder Transformer (BART-style) trained from scratch for Spanish to Portuguese translation using the Helsinki-NLP/Tatoeba dataset (es-pt).

Model details

Task: Machine Translation (ES to PT)
Architecture: BART-style encoder–decoder Transformer (trained from scratch)
Tokenizer: Subword BPE (32k vocab)
Max sequence length: 256 (source) / 256 (target)
Decoding used for evaluation: Beam search (beam=5)

Architecture summary

Component	Value
vocab size	32,000
d_model	512
encoder layers	6
decoder layers	6
attention heads	8
FFN dim	2048
dropout	0.1
parameters	~61.6M

Dataset

Dataset: Helsinki-NLP/tatoeba
Config / language pair: es-pt
Splits used: official train, validation, test
Train/Val/Test sizes: 63,716 / 1,998 / 1,999
Leakage prevention: tokenizer trained on training split only; duplicate (src,tgt) pairs removed via hashing.

Dataset link: https://huggingface.co/datasets/Helsinki-NLP/tatoeba

Evaluation

Metric: chrF (generation-based)

Split	chrF
Validation	70.6691
Test	70.4862

Note: chrF is character n-gram based and is suitable for evaluating adequacy in Romance language translation tasks such as ES↔PT.

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "liansheng06/bart-tatoeba-es-pt"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "Las personas dicen que estoy loco."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

outputs = model.generate(**inputs, num_beams=5, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Demo

Try the interactive Gradio demo here:
https://huggingface.co/spaces/liansheng06/ATA-Assignment2

Downloads last month: 4

Safetensors

Model size

61.6M params

Tensor type

F32

liansheng06
/

bart-tatoeba-es-pt