Maracatu-20M

Brazilian Portuguese language model trained from scratch. First checkpoint of the Maracatu family. Open weights, Apache 2.0.

Maracatu-20M is a 17M-parameter causal language model trained from scratch on Brazilian Portuguese Wikipedia. It is the first public checkpoint of the Maracatu AI project — an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training.

This is a base model (completion, not instruct). It continues text; it is not a chat assistant.

Architecture

Llama-style decoder-only transformer. The state_dict is compatible with transformers.LlamaForCausalLM and loads via AutoModelForCausalLM.from_pretrained without any conversion script.

Hyperparameter	Value
Total parameters	17M (16.77M)
Non-embedding parameters	10.62M
`num_hidden_layers`	6
`hidden_size`	384
`num_attention_heads`	6
`num_key_value_heads`	6
`intermediate_size` (SwiGLU)	1024
`max_position_embeddings`	512
`vocab_size`	16000
`rope_theta`	10000
Normalization	RMSNorm
Positional encoding	RoPE (rotate-half)
Activation	SwiGLU
Bias in `nn.Linear`	No
Weight tying (embed ↔ lm_head)	Yes
Tokenizer	SentencePiece BPE, `nmt_nfkc_cf` (lowercase), `split_digits`, byte fallback

Training Data

Property	Value
Source	`wikimedia/wikipedia`, config `20231101.pt`
License	CC BY-SA 4.0
Articles (after filters + dedup)	979,492
Lines	9.7M
Corpus size	2.28 GB
Tokens (BPE)	~550M

Sources deliberately excluded from v1: OSCAR, BrWaC, Common Crawl. Exclusion is a legal-caution decision, not a quality judgment. Corpus expansion is planned for Maracatu-80M (next scale in the ladder).

Training Configuration

Item	Value
Framework	PyTorch
Hardware	Kaggle T4 (single GPU, 15.6 GB VRAM)
Total iterations	50,000
Tokens seen	~~410M (~~0.75 epoch)
Batch size	16
Context length	512 tokens
Optimizer	AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
Gradient clipping	1.0
Learning rate	3e-4 → 3e-5 (linear warmup 1,000 steps + cosine decay)
Throughput	~20,000 tok/s
Total training time	5h 45min
Seed	42

Evaluation

Training metrics

Metric	Value	Step
Best validation loss	3.1703	43,500
Best validation perplexity	23.81	43,500
Final validation loss	3.22	50,000
Train/val gap	~0.05	—

The train/val gap of ~0.05 indicates no measurable overfitting at this scale. The slight degradation from step 43,500 to 50,000 is attributed to an aggressive cosine LR schedule in the final phase.

Benchmark comparisons (lm-evaluation-harness, seed=42, 0-shot)

All numbers below were produced with lm-evaluation-harness v0.4.12, greedy decoding (T=0), batch size 8, seed 42. Baselines run on the same hardware under identical conditions.

Important caveat for base models: A 17M base model without instruction tuning is not expected to score above random chance on MCQ benchmarks. These numbers are an honest baseline — not a measure of failure. Expect meaningful benchmark improvement only at 500M+ scale with instruction tuning.

Model	Params	Belebele PT (4-choice)	ASSIN Entailment (2-choice)	ASSIN Paraphrase (2-choice)	ENEM (5-choice)
Random chance	—	25.00%	50.00%	50.00%	20.00%
Maracatu-20M	17M	23.78% ± 1.42%	27.67% ± 0.71%	60.52% ± 0.77%	19.22% ± 1.04%
Tucano-160M	160M	22.56% ± 1.39%	31.35% ± 0.73%	56.37% ± 0.78%	21.10% ± 1.08%
Tucano-630M	630M	22.89% ± 1.40%	30.40% ± 0.73%	54.93% ± 0.79%	21.10% ± 1.08%

Notes:

All three models score near or below random chance on Belebele and ENEM — this is expected for base models at this scale.
Maracatu-20M outperforms Tucano-160M on ASSIN Paraphrase (+4.15 pp) despite having 9x fewer parameters. This may reflect coherent lexical representations from Wikipedia training.
Tucano-630M does not outperform Tucano-160M on these benchmarks — scale alone does not guarantee improvement without instruction tuning.
Perplexity comparison between Maracatu-20M (23.81, SentencePiece 16k) and Tucano-160M (~30, reported in paper) is not directly comparable due to different vocabularies and corpora.
Evaluation report: docs/experiments/2026-04-20-maracatu-20m-benchmarks.md

Qualitative Samples

Generated with temperature=0.8, top_k=50, seed=42, reproducible locally with scripts/sample.py.

Prompt:  "O Brasil é"
Output:  "o brasil é o méxico e o canadá é o único dos territórios menores que se estende
          da europa oriental. a região tem uma área de 8,32 km2. a região está dividida em
          3 partes e 6 regiões, que se estendem até ao planalto central, que inclui o
          extremo oeste do país."

Prompt:  "Machado de Assis escreveu"
Output:  "machado de assis escreveu o livro \"a obra-prima do autor da obra-prima do
          autor\", que foi escrito em 1998 e publicado em 2001. no brasil, em 2003, o
          livro foi publicada pela universidade federal de minas gerais."

Prompt:  "A capital de Pernambuco é"
Output:  "a capital de pernambuco é a capital do estado. o município possui vários
          estabelecimentos de serviços (faa) e de estabelecimentos de saúde (fisfa) e
          empresas de transporte civil, gás e gás."

These samples illustrate both what the model learned and what it does not: Portuguese syntax, common collocations, and encyclopedic register are present, but factual retrieval fails systematically. Sample 1 conflates Brazil with Mexico and Canada and places the region in eastern Europe. Sample 2 invents a book title with recursive self-reference and gives dates that place Machado de Assis (who died in 1908) in the late 1990s. Sample 3 tautologically says "the capital is the capital" and uses invented acronyms (FAA, FISFA). Repetition artifacts ("gás e gás", "obra-prima do autor da obra-prima do autor") are characteristic of small base models.

How to Use

HuggingFace `transformers`

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-20m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-20m")
model.eval()

inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=60,
        temperature=0.8,
        top_k=50,
        do_sample=True,
    )
print(tokenizer.decode(out[0], skip_special_tokens=True))

Ollama (local inference)

ollama run whereisanzi/maracatu-20m "O Brasil é"

Available quantizations: Q4_K_M (default), Q5_K_M, Q8_0.

PyTorch (native checkpoint)

import torch
import sentencepiece as spm
from maracatu.model import Maracatu, ModelConfig

ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
model = Maracatu(ModelConfig(**ckpt["model_config"]))
model.load_state_dict({k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()})
model.eval()

sp = spm.SentencePieceProcessor(model_file="maracatu.model")
prompt_ids = torch.tensor([sp.encode("O Brasil é")])
out = model.generate(prompt_ids, max_new_tokens=60, temperature=0.8, top_k=50)
print(sp.decode(out[0].tolist()))

Limitations

Scale: 17M parameters is small. Factual recall is unreliable; hallucination is expected and frequent.
Lowercase only: The tokenizer normalizes all text to lowercase (nmt_nfkc_cf). The model does not generate uppercase characters.
Digit splitting: Numbers are tokenized digit-by-digit (split_digits). Arithmetic, dates, and numeric reasoning are not reliable.
Encyclopedic style: Trained exclusively on Wikipedia. Output tends toward neutral, encyclopedic prose; informal registers are underrepresented.
Wikipedia biases: Inherits notability and coverage biases of the Portuguese Wikipedia — topics with sparse PT coverage are likely to produce lower-quality output.
No safety fine-tuning: This is an unfiltered base model. It has not been evaluated for safety and may generate biased, incorrect, or offensive content.
No instruction following: This model completes text. It is not a chat assistant.
Short context window: 512 tokens. Long ENEM questions are left-truncated by the evaluation harness.

Citation

@misc{maracatu2026,
  author    = {Anzileiro, Anderson},
  title     = {Maracatu-20M: A Brazilian Portuguese Language Model Trained from Scratch},
  year      = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/maracatu-ai/maracatu-20m}},
}

License

Code and weights are released under the Apache License 2.0.

Training data (Wikipedia PT) is licensed CC BY-SA 4.0 by their respective authors.

Acknowledgments

Andrej Karpathy — nanoGPT, the pedagogical foundation for this project.
Maritaca AI — Sabiá paper, reference for Portuguese LLM training decisions.
WideLabs — Tucano paper, the primary baseline for Portuguese benchmark comparisons.
Brazilian AI community (LNCC, USP, Unicamp, MCTI/PBIA) for the broader ecosystem this project aims to contribute to.

Downloads last month: 1,167

Safetensors

Model size

16.8M params

Tensor type

F32

Model tree for maracatu-ai/maracatu-20m

Quantizations

1 model

Dataset used to train maracatu-ai/maracatu-20m

Papers for maracatu-ai/maracatu-20m

Tucano: Advancing Neural Text Generation for Portuguese

Paper • 2411.07854 • Published Nov 12, 2024 • 10

Sabiá-2: A New Generation of Portuguese Large Language Models

Paper • 2403.09887 • Published Mar 14, 2024

Evaluation results

Validation Perplexity (Wikipedia PT)
self-reported

23.810
Accuracy (0-shot) on facebook/belebele
self-reported

0.238
Accuracy (0-shot) on eduagarcia/enem_challenge
self-reported

0.192
Accuracy (0-shot) on nilc-nlp/assin
self-reported

0.605