Maracatu-20M
Brazilian Portuguese language model trained from scratch. First checkpoint of the Maracatu family. Open weights, Apache 2.0.
Maracatu-20M is a 17M-parameter causal language model trained from scratch on Brazilian Portuguese Wikipedia. It is the first public checkpoint of the Maracatu AI project — an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training.
This is a base model (completion, not instruct). It continues text; it is not a chat assistant.
Architecture
Llama-style decoder-only transformer. The state_dict is compatible with transformers.LlamaForCausalLM and loads via AutoModelForCausalLM.from_pretrained without any conversion script.
| Hyperparameter | Value |
|---|---|
| Total parameters | 17M (16.77M) |
| Non-embedding parameters | 10.62M |
num_hidden_layers |
6 |
hidden_size |
384 |
num_attention_heads |
6 |
num_key_value_heads |
6 |
intermediate_size (SwiGLU) |
1024 |
max_position_embeddings |
512 |
vocab_size |
16000 |
rope_theta |
10000 |
| Normalization | RMSNorm |
| Positional encoding | RoPE (rotate-half) |
| Activation | SwiGLU |
Bias in nn.Linear |
No |
| Weight tying (embed ↔ lm_head) | Yes |
| Tokenizer | SentencePiece BPE, nmt_nfkc_cf (lowercase), split_digits, byte fallback |
Training Data
| Property | Value |
|---|---|
| Source | wikimedia/wikipedia, config 20231101.pt |
| License | CC BY-SA 4.0 |
| Articles (after filters + dedup) | 979,492 |
| Lines | 9.7M |
| Corpus size | 2.28 GB |
| Tokens (BPE) | ~550M |
Sources deliberately excluded from v1: OSCAR, BrWaC, Common Crawl. Exclusion is a legal-caution decision, not a quality judgment. Corpus expansion is planned for Maracatu-80M (next scale in the ladder).
Training Configuration
| Item | Value |
|---|---|
| Framework | PyTorch |
| Hardware | Kaggle T4 (single GPU, 15.6 GB VRAM) |
| Total iterations | 50,000 |
| Tokens seen | |
| Batch size | 16 |
| Context length | 512 tokens |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1) |
| Gradient clipping | 1.0 |
| Learning rate | 3e-4 → 3e-5 (linear warmup 1,000 steps + cosine decay) |
| Throughput | ~20,000 tok/s |
| Total training time | 5h 45min |
| Seed | 42 |
Evaluation
Training metrics
| Metric | Value | Step |
|---|---|---|
| Best validation loss | 3.1703 | 43,500 |
| Best validation perplexity | 23.81 | 43,500 |
| Final validation loss | 3.22 | 50,000 |
| Train/val gap | ~0.05 | — |
The train/val gap of ~0.05 indicates no measurable overfitting at this scale. The slight degradation from step 43,500 to 50,000 is attributed to an aggressive cosine LR schedule in the final phase.
Benchmark comparisons (lm-evaluation-harness, seed=42, 0-shot)
All numbers below were produced with lm-evaluation-harness v0.4.12, greedy decoding (T=0), batch size 8, seed 42. Baselines run on the same hardware under identical conditions.
Important caveat for base models: A 17M base model without instruction tuning is not expected to score above random chance on MCQ benchmarks. These numbers are an honest baseline — not a measure of failure. Expect meaningful benchmark improvement only at 500M+ scale with instruction tuning.
| Model | Params | Belebele PT (4-choice) | ASSIN Entailment (2-choice) | ASSIN Paraphrase (2-choice) | ENEM (5-choice) |
|---|---|---|---|---|---|
| Random chance | — | 25.00% | 50.00% | 50.00% | 20.00% |
| Maracatu-20M | 17M | 23.78% ± 1.42% | 27.67% ± 0.71% | 60.52% ± 0.77% | 19.22% ± 1.04% |
| Tucano-160M | 160M | 22.56% ± 1.39% | 31.35% ± 0.73% | 56.37% ± 0.78% | 21.10% ± 1.08% |
| Tucano-630M | 630M | 22.89% ± 1.40% | 30.40% ± 0.73% | 54.93% ± 0.79% | 21.10% ± 1.08% |
Notes:
- All three models score near or below random chance on Belebele and ENEM — this is expected for base models at this scale.
- Maracatu-20M outperforms Tucano-160M on ASSIN Paraphrase (+4.15 pp) despite having 9x fewer parameters. This may reflect coherent lexical representations from Wikipedia training.
- Tucano-630M does not outperform Tucano-160M on these benchmarks — scale alone does not guarantee improvement without instruction tuning.
- Perplexity comparison between Maracatu-20M (23.81, SentencePiece 16k) and Tucano-160M (~30, reported in paper) is not directly comparable due to different vocabularies and corpora.
- Evaluation report:
docs/experiments/2026-04-20-maracatu-20m-benchmarks.md
Qualitative Samples
Generated with temperature=0.8, top_k=50, seed=42, reproducible locally with scripts/sample.py.
Prompt: "O Brasil é"
Output: "o brasil é o méxico e o canadá é o único dos territórios menores que se estende
da europa oriental. a região tem uma área de 8,32 km2. a região está dividida em
3 partes e 6 regiões, que se estendem até ao planalto central, que inclui o
extremo oeste do país."
Prompt: "Machado de Assis escreveu"
Output: "machado de assis escreveu o livro \"a obra-prima do autor da obra-prima do
autor\", que foi escrito em 1998 e publicado em 2001. no brasil, em 2003, o
livro foi publicada pela universidade federal de minas gerais."
Prompt: "A capital de Pernambuco é"
Output: "a capital de pernambuco é a capital do estado. o município possui vários
estabelecimentos de serviços (faa) e de estabelecimentos de saúde (fisfa) e
empresas de transporte civil, gás e gás."
These samples illustrate both what the model learned and what it does not: Portuguese syntax, common collocations, and encyclopedic register are present, but factual retrieval fails systematically. Sample 1 conflates Brazil with Mexico and Canada and places the region in eastern Europe. Sample 2 invents a book title with recursive self-reference and gives dates that place Machado de Assis (who died in 1908) in the late 1990s. Sample 3 tautologically says "the capital is the capital" and uses invented acronyms (FAA, FISFA). Repetition artifacts ("gás e gás", "obra-prima do autor da obra-prima do autor") are characteristic of small base models.
How to Use
HuggingFace transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-20m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-20m")
model.eval()
inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=60,
temperature=0.8,
top_k=50,
do_sample=True,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Ollama (local inference)
ollama run whereisanzi/maracatu-20m "O Brasil é"
Available quantizations: Q4_K_M (default), Q5_K_M, Q8_0.
PyTorch (native checkpoint)
import torch
import sentencepiece as spm
from maracatu.model import Maracatu, ModelConfig
ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
model = Maracatu(ModelConfig(**ckpt["model_config"]))
model.load_state_dict({k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()})
model.eval()
sp = spm.SentencePieceProcessor(model_file="maracatu.model")
prompt_ids = torch.tensor([sp.encode("O Brasil é")])
out = model.generate(prompt_ids, max_new_tokens=60, temperature=0.8, top_k=50)
print(sp.decode(out[0].tolist()))
Limitations
- Scale: 17M parameters is small. Factual recall is unreliable; hallucination is expected and frequent.
- Lowercase only: The tokenizer normalizes all text to lowercase (
nmt_nfkc_cf). The model does not generate uppercase characters. - Digit splitting: Numbers are tokenized digit-by-digit (
split_digits). Arithmetic, dates, and numeric reasoning are not reliable. - Encyclopedic style: Trained exclusively on Wikipedia. Output tends toward neutral, encyclopedic prose; informal registers are underrepresented.
- Wikipedia biases: Inherits notability and coverage biases of the Portuguese Wikipedia — topics with sparse PT coverage are likely to produce lower-quality output.
- No safety fine-tuning: This is an unfiltered base model. It has not been evaluated for safety and may generate biased, incorrect, or offensive content.
- No instruction following: This model completes text. It is not a chat assistant.
- Short context window: 512 tokens. Long ENEM questions are left-truncated by the evaluation harness.
Citation
@misc{maracatu2026,
author = {Anzileiro, Anderson},
title = {Maracatu-20M: A Brazilian Portuguese Language Model Trained from Scratch},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/maracatu-ai/maracatu-20m}},
}
License
Code and weights are released under the Apache License 2.0.
Training data (Wikipedia PT) is licensed CC BY-SA 4.0 by their respective authors.
Acknowledgments
- Andrej Karpathy — nanoGPT, the pedagogical foundation for this project.
- Maritaca AI — Sabiá paper, reference for Portuguese LLM training decisions.
- WideLabs — Tucano paper, the primary baseline for Portuguese benchmark comparisons.
- Brazilian AI community (LNCC, USP, Unicamp, MCTI/PBIA) for the broader ecosystem this project aims to contribute to.
- Downloads last month
- 1,167
Model tree for maracatu-ai/maracatu-20m
Dataset used to train maracatu-ai/maracatu-20m
Papers for maracatu-ai/maracatu-20m
Sabiá-2: A New Generation of Portuguese Large Language Models
Evaluation results
- Validation Perplexity (Wikipedia PT)self-reported23.810
- Accuracy (0-shot) on facebook/belebeleself-reported0.238
- Accuracy (0-shot) on eduagarcia/enem_challengeself-reported0.192
- Accuracy (0-shot) on nilc-nlp/assinself-reported0.605