Instructions to use EricSpencer00/chattla-20b-prover-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use EricSpencer00/chattla-20b-prover-v3 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b") model = PeftModel.from_pretrained(base_model, "EricSpencer00/chattla-20b-prover-v3") - Notebooks
- Google Colab
- Kaggle
ChatTLA-20B Prover v3 (LoRA)
A LoRA adapter on top of openai/gpt-oss-20b
trained to write TLAPS proofs for TLA+ theorems.
Prover branch of the ChatTLA project.
The spec-gen branch is at EricSpencer00/chattla-20b.
Evaluation — held-out 151 theorems
Two evaluations on the same 151-row v2 holdout (theorems disjoint from
training by (source_file, theorem_line)):
| metric | single-shot (T=0) | best-of-8 (T=0.5) |
|---|---|---|
parse_rate (any K) |
62.25% | 84.11% |
any_proved (any K had ≥1 obligation discharged) |
2/151 (1.32%) | 3/151 (1.99%) |
full_proved (any K hit gold-obligation count) |
1/151 | 2/151 |
| total obligations machine-discharged (across best K each row) | 141 | 22 |
Three theorems machine-proved by best-of-K (all proved tier — full
proofs accepted by tlapm):
| theorem | obligations | tier |
|---|---|---|
TLAPS.tla:L446 |
2/2 | proved |
AddTwo.tla:L49 (LearnProofs) |
18/18 | proved |
Lock.tla:L84 (locks_auxiliary_vars) |
2/2 | proved |
Single-shot's standout result: VoteProof.tla:L520 — a Paxos-style
theorem where greedy decoding produced a proof body that tlapm accepted
for 139/142 sub-obligations (97.9%). The same theorem at T=0.5 sampling
landed parse_error for all 8 attempts — best-of-K is not strictly
dominant; greedy and sampling each find different theorems.
Practical recommendation: for production, run both strategies (greedy + best-of-8) and union the results. That gives 4 unique theorems proved across 151 with 161 total obligations, vs 3 / 22 from bo-K alone.
What problem does this solve
Given a TLA+ module that contains definitions, an Init / Next / Spec,
and a THEOREM (or LEMMA) statement, the model writes the proof body
(hierarchical step bullets <1>1 ... <1> QED, with BY, BY DEF, OBVIOUS,
USE, DEFINE, SUFFICES, ASSUME, PROVE, CASE, PICK, WITNESS,
and the PTL / Zenon / SMT backends).
Training data
1,357 verified proof rows built from 553 round-tripped theorem chunks
across 265 .tla files in tlaplus/examples, tlapm/library,
tlapm/examples, tlapm/examples_draft, and FormaLLM/data. Two row types
per theorem:
- full: preamble + statement → entire proof body
- piecewise: preamble + statement + steps[0..k] → step k+1
Each row was independently verified by reconstructing a synthetic .tla
module and running tlapm 1.5.0 on it;
only chunks that round-tripped (≥1 obligation discharged) were kept.
Training config
| Base | openai/gpt-oss-20b (no stacking on prior ChatTLA checkpoints) |
| LoRA | r=16, alpha=32, dropout=0, target_modules=all-linear |
| Optimizer | AdamW, lr=2e-5, cosine schedule |
| Sequence | max_length=3072 |
| Batch | per_device=1, grad_accum=8 (effective 8) |
| Steps / epochs | 170 / 1 |
| Hardware | 2× RTX 8000 (48 GB), naive PP |
| Wall time | 3:47:54 |
| Final train loss | 1.668 (avg) / 1.217 (final batch) |
| Final mean_token_accuracy | 0.712 |
Inference — recommended (best-of-8 union greedy)
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
BASE = "openai/gpt-oss-20b"
ADAPTER = "EricSpencer00/chattla-20b-prover-v3"
tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base = AutoModelForCausalLM.from_pretrained(
BASE, attn_implementation="eager",
torch_dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()
DEV = ("You are ChatTLA-Prover... Your job is to write the PROOF body... "
"Output only the proof. Reasoning: medium")
USER = "Write the TLAPS proof for the final theorem in the following module.\n\n```tla\n<your module here>\n```"
prompt = tokenizer.apply_chat_template(
[{"role": "developer", "content": DEV}, {"role": "user", "content": USER}],
tokenize=False, add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Greedy candidate
with torch.no_grad():
g = model.generate(**inputs, max_new_tokens=512, do_sample=False)
# Best-of-8 candidates
with torch.no_grad():
s = model.generate(**inputs, max_new_tokens=512, do_sample=True,
temperature=0.5, num_return_sequences=8)
# Union, dedupe, run tlapm on each, accept first that passes.
Caveats
- This is a v3 first prover-branch run. Prior ChatTLA "prover" attempts
trained on a corpus that was 98% TLC spec-gen data despite the name; v3
is the first iteration where the corpus is 100% proof targets and the
eval measures
tlapm-acceptance. - 1.3–2.0% any-proved on 151 unseen theorems is a low absolute rate. Real prover models (DeepSeek-Prover, Lean Prover Pipeline) achieve 30–50% on similarly-scoped benchmarks but use 10–100× more proof-pair training data and per-step search rather than full-proof generation.
- Two RL attempts (
scripts/launch_prover_grpo.sh) failed to lift the rate — v1 was killed by 768-token completion truncation collapsing reward variance, v2 at T=1.0 pushed the model off-distribution into non-proof text. The bottleneck appears to be SFT data scale, not RL recipe.
License
Apache 2.0. Built on openai/gpt-oss-20b (Apache 2.0).
Seed data via LUC-AI4FM/FormaLLM (MIT)
and tlapm (BSD-2-Clause).
- Downloads last month
- 135
Model tree for EricSpencer00/chattla-20b-prover-v3
Base model
openai/gpt-oss-20b