ChatTLA-20B Prover v3 (LoRA)

A LoRA adapter on top of openai/gpt-oss-20b trained to write TLAPS proofs for TLA+ theorems. Prover branch of the ChatTLA project. The spec-gen branch is at EricSpencer00/chattla-20b.

Evaluation — held-out 151 theorems

Two evaluations on the same 151-row v2 holdout (theorems disjoint from training by (source_file, theorem_line)):

metric	single-shot (T=0)	best-of-8 (T=0.5)
`parse_rate` (any K)	62.25%	84.11%
`any_proved` (any K had ≥1 obligation discharged)	2/151 (1.32%)	3/151 (1.99%)
`full_proved` (any K hit gold-obligation count)	1/151	2/151
total obligations machine-discharged (across best K each row)	141	22

Three theorems machine-proved by best-of-K (all proved tier — full proofs accepted by tlapm):

theorem	obligations	tier
`TLAPS.tla:L446`	2/2	`proved`
`AddTwo.tla:L49` (LearnProofs)	18/18	`proved`
`Lock.tla:L84` (locks_auxiliary_vars)	2/2	`proved`

Single-shot's standout result: VoteProof.tla:L520 — a Paxos-style theorem where greedy decoding produced a proof body that tlapm accepted for 139/142 sub-obligations (97.9%). The same theorem at T=0.5 sampling landed parse_error for all 8 attempts — best-of-K is not strictly dominant; greedy and sampling each find different theorems.

Practical recommendation: for production, run both strategies (greedy + best-of-8) and union the results. That gives 4 unique theorems proved across 151 with 161 total obligations, vs 3 / 22 from bo-K alone.

What problem does this solve

Given a TLA+ module that contains definitions, an Init / Next / Spec, and a THEOREM (or LEMMA) statement, the model writes the proof body (hierarchical step bullets <1>1 ... <1> QED, with BY, BY DEF, OBVIOUS, USE, DEFINE, SUFFICES, ASSUME, PROVE, CASE, PICK, WITNESS, and the PTL / Zenon / SMT backends).

Training data

1,357 verified proof rows built from 553 round-tripped theorem chunks across 265 .tla files in tlaplus/examples, tlapm/library, tlapm/examples, tlapm/examples_draft, and FormaLLM/data. Two row types per theorem:

full: preamble + statement → entire proof body
piecewise: preamble + statement + steps[0..k] → step k+1

Each row was independently verified by reconstructing a synthetic .tla module and running tlapm 1.5.0 on it; only chunks that round-tripped (≥1 obligation discharged) were kept.

Training config


Base	`openai/gpt-oss-20b` (no stacking on prior ChatTLA checkpoints)
LoRA	`r=16, alpha=32, dropout=0, target_modules=all-linear`
Optimizer	AdamW, `lr=2e-5`, cosine schedule
Sequence	`max_length=3072`
Batch	`per_device=1, grad_accum=8` (effective 8)
Steps / epochs	170 / 1
Hardware	2× RTX 8000 (48 GB), naive PP
Wall time	3:47:54
Final train loss	1.668 (avg) / 1.217 (final batch)
Final mean_token_accuracy	0.712

Inference — recommended (best-of-8 union greedy)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

BASE = "openai/gpt-oss-20b"
ADAPTER = "EricSpencer00/chattla-20b-prover-v3"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base = AutoModelForCausalLM.from_pretrained(
    BASE, attn_implementation="eager",
    torch_dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

DEV = ("You are ChatTLA-Prover... Your job is to write the PROOF body... "
       "Output only the proof. Reasoning: medium")
USER = "Write the TLAPS proof for the final theorem in the following module.\n\n```tla\n<your module here>\n```"

prompt = tokenizer.apply_chat_template(
    [{"role": "developer", "content": DEV}, {"role": "user", "content": USER}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Greedy candidate
with torch.no_grad():
    g = model.generate(**inputs, max_new_tokens=512, do_sample=False)

# Best-of-8 candidates
with torch.no_grad():
    s = model.generate(**inputs, max_new_tokens=512, do_sample=True,
                       temperature=0.5, num_return_sequences=8)

# Union, dedupe, run tlapm on each, accept first that passes.

Caveats

This is a v3 first prover-branch run. Prior ChatTLA "prover" attempts trained on a corpus that was 98% TLC spec-gen data despite the name; v3 is the first iteration where the corpus is 100% proof targets and the eval measures tlapm-acceptance.
1.3–2.0% any-proved on 151 unseen theorems is a low absolute rate. Real prover models (DeepSeek-Prover, Lean Prover Pipeline) achieve 30–50% on similarly-scoped benchmarks but use 10–100× more proof-pair training data and per-step search rather than full-proof generation.
Two RL attempts (scripts/launch_prover_grpo.sh) failed to lift the rate — v1 was killed by 768-token completion truncation collapsing reward variance, v2 at T=1.0 pushed the model off-distribution into non-proof text. The bottleneck appears to be SFT data scale, not RL recipe.

License

Apache 2.0. Built on openai/gpt-oss-20b (Apache 2.0). Seed data via LUC-AI4FM/FormaLLM (MIT) and tlapm (BSD-2-Clause).

Downloads last month: 135

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EricSpencer00/chattla-20b-prover-v3

Base model

openai/gpt-oss-20b

Adapter

(227)

this model