Maracatu-20M

Brazilian Portuguese language model trained from scratch. First checkpoint of the Maracatu family. Open weights, Apache 2.0.

Maracatu-20M is a 17M-parameter causal language model trained from scratch on Brazilian Portuguese Wikipedia. It is the first public checkpoint of the Maracatu AI project — an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training.

This is a base model (completion, not instruct). It continues text; it is not a chat assistant.


Architecture

Llama-style decoder-only transformer. The state_dict is compatible with transformers.LlamaForCausalLM and loads via AutoModelForCausalLM.from_pretrained without any conversion script.

Hyperparameter Value
Total parameters 17M (16.77M)
Non-embedding parameters 10.62M
num_hidden_layers 6
hidden_size 384
num_attention_heads 6
num_key_value_heads 6
intermediate_size (SwiGLU) 1024
max_position_embeddings 512
vocab_size 16000
rope_theta 10000
Normalization RMSNorm
Positional encoding RoPE (rotate-half)
Activation SwiGLU
Bias in nn.Linear No
Weight tying (embed ↔ lm_head) Yes
Tokenizer SentencePiece BPE, nmt_nfkc_cf (lowercase), split_digits, byte fallback

Training Data

Property Value
Source wikimedia/wikipedia, config 20231101.pt
License CC BY-SA 4.0
Articles (after filters + dedup) 979,492
Lines 9.7M
Corpus size 2.28 GB
Tokens (BPE) ~550M

Sources deliberately excluded from v1: OSCAR, BrWaC, Common Crawl. Exclusion is a legal-caution decision, not a quality judgment. Corpus expansion is planned for Maracatu-80M (next scale in the ladder).


Training Configuration

Item Value
Framework PyTorch
Hardware Kaggle T4 (single GPU, 15.6 GB VRAM)
Total iterations 50,000
Tokens seen 410M (0.75 epoch)
Batch size 16
Context length 512 tokens
Optimizer AdamW (β₁=0.9, β₂=0.95, weight_decay=0.1)
Gradient clipping 1.0
Learning rate 3e-4 → 3e-5 (linear warmup 1,000 steps + cosine decay)
Throughput ~20,000 tok/s
Total training time 5h 45min
Seed 42

Evaluation

Training metrics

Metric Value Step
Best validation loss 3.1703 43,500
Best validation perplexity 23.81 43,500
Final validation loss 3.22 50,000
Train/val gap ~0.05

The train/val gap of ~0.05 indicates no measurable overfitting at this scale. The slight degradation from step 43,500 to 50,000 is attributed to an aggressive cosine LR schedule in the final phase.

Benchmark comparisons (lm-evaluation-harness, seed=42, 0-shot)

All numbers below were produced with lm-evaluation-harness v0.4.12, greedy decoding (T=0), batch size 8, seed 42. Baselines run on the same hardware under identical conditions.

Important caveat for base models: A 17M base model without instruction tuning is not expected to score above random chance on MCQ benchmarks. These numbers are an honest baseline — not a measure of failure. Expect meaningful benchmark improvement only at 500M+ scale with instruction tuning.

Model Params Belebele PT (4-choice) ASSIN Entailment (2-choice) ASSIN Paraphrase (2-choice) ENEM (5-choice)
Random chance 25.00% 50.00% 50.00% 20.00%
Maracatu-20M 17M 23.78% ± 1.42% 27.67% ± 0.71% 60.52% ± 0.77% 19.22% ± 1.04%
Tucano-160M 160M 22.56% ± 1.39% 31.35% ± 0.73% 56.37% ± 0.78% 21.10% ± 1.08%
Tucano-630M 630M 22.89% ± 1.40% 30.40% ± 0.73% 54.93% ± 0.79% 21.10% ± 1.08%

Notes:

  • All three models score near or below random chance on Belebele and ENEM — this is expected for base models at this scale.
  • Maracatu-20M outperforms Tucano-160M on ASSIN Paraphrase (+4.15 pp) despite having 9x fewer parameters. This may reflect coherent lexical representations from Wikipedia training.
  • Tucano-630M does not outperform Tucano-160M on these benchmarks — scale alone does not guarantee improvement without instruction tuning.
  • Perplexity comparison between Maracatu-20M (23.81, SentencePiece 16k) and Tucano-160M (~30, reported in paper) is not directly comparable due to different vocabularies and corpora.
  • Evaluation report: docs/experiments/2026-04-20-maracatu-20m-benchmarks.md

Qualitative Samples

Generated with temperature=0.8, top_k=50, seed=42, reproducible locally with scripts/sample.py.

Prompt:  "O Brasil é"
Output:  "o brasil é o méxico e o canadá é o único dos territórios menores que se estende
          da europa oriental. a região tem uma área de 8,32 km2. a região está dividida em
          3 partes e 6 regiões, que se estendem até ao planalto central, que inclui o
          extremo oeste do país."
Prompt:  "Machado de Assis escreveu"
Output:  "machado de assis escreveu o livro \"a obra-prima do autor da obra-prima do
          autor\", que foi escrito em 1998 e publicado em 2001. no brasil, em 2003, o
          livro foi publicada pela universidade federal de minas gerais."
Prompt:  "A capital de Pernambuco é"
Output:  "a capital de pernambuco é a capital do estado. o município possui vários
          estabelecimentos de serviços (faa) e de estabelecimentos de saúde (fisfa) e
          empresas de transporte civil, gás e gás."

These samples illustrate both what the model learned and what it does not: Portuguese syntax, common collocations, and encyclopedic register are present, but factual retrieval fails systematically. Sample 1 conflates Brazil with Mexico and Canada and places the region in eastern Europe. Sample 2 invents a book title with recursive self-reference and gives dates that place Machado de Assis (who died in 1908) in the late 1990s. Sample 3 tautologically says "the capital is the capital" and uses invented acronyms (FAA, FISFA). Repetition artifacts ("gás e gás", "obra-prima do autor da obra-prima do autor") are characteristic of small base models.


How to Use

HuggingFace transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-20m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-20m")
model.eval()

inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=60,
        temperature=0.8,
        top_k=50,
        do_sample=True,
    )
print(tokenizer.decode(out[0], skip_special_tokens=True))

Ollama (local inference)

ollama run whereisanzi/maracatu-20m "O Brasil é"

Available quantizations: Q4_K_M (default), Q5_K_M, Q8_0.

PyTorch (native checkpoint)

import torch
import sentencepiece as spm
from maracatu.model import Maracatu, ModelConfig

ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
model = Maracatu(ModelConfig(**ckpt["model_config"]))
model.load_state_dict({k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()})
model.eval()

sp = spm.SentencePieceProcessor(model_file="maracatu.model")
prompt_ids = torch.tensor([sp.encode("O Brasil é")])
out = model.generate(prompt_ids, max_new_tokens=60, temperature=0.8, top_k=50)
print(sp.decode(out[0].tolist()))

Limitations

  • Scale: 17M parameters is small. Factual recall is unreliable; hallucination is expected and frequent.
  • Lowercase only: The tokenizer normalizes all text to lowercase (nmt_nfkc_cf). The model does not generate uppercase characters.
  • Digit splitting: Numbers are tokenized digit-by-digit (split_digits). Arithmetic, dates, and numeric reasoning are not reliable.
  • Encyclopedic style: Trained exclusively on Wikipedia. Output tends toward neutral, encyclopedic prose; informal registers are underrepresented.
  • Wikipedia biases: Inherits notability and coverage biases of the Portuguese Wikipedia — topics with sparse PT coverage are likely to produce lower-quality output.
  • No safety fine-tuning: This is an unfiltered base model. It has not been evaluated for safety and may generate biased, incorrect, or offensive content.
  • No instruction following: This model completes text. It is not a chat assistant.
  • Short context window: 512 tokens. Long ENEM questions are left-truncated by the evaluation harness.

Citation

@misc{maracatu2026,
  author    = {Anzileiro, Anderson},
  title     = {Maracatu-20M: A Brazilian Portuguese Language Model Trained from Scratch},
  year      = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/maracatu-ai/maracatu-20m}},
}

License

Code and weights are released under the Apache License 2.0.

Training data (Wikipedia PT) is licensed CC BY-SA 4.0 by their respective authors.


Acknowledgments

  • Andrej Karpathy — nanoGPT, the pedagogical foundation for this project.
  • Maritaca AI — Sabiá paper, reference for Portuguese LLM training decisions.
  • WideLabs — Tucano paper, the primary baseline for Portuguese benchmark comparisons.
  • Brazilian AI community (LNCC, USP, Unicamp, MCTI/PBIA) for the broader ecosystem this project aims to contribute to.
Downloads last month
1,167
Safetensors
Model size
16.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maracatu-ai/maracatu-20m

Quantizations
1 model

Dataset used to train maracatu-ai/maracatu-20m

Papers for maracatu-ai/maracatu-20m

Evaluation results