ChatTLA-20B Prover v3 (LoRA)

A LoRA adapter on top of openai/gpt-oss-20b trained to write TLAPS proofs for TLA+ theorems. Prover branch of the ChatTLA project. The spec-gen branch is at EricSpencer00/chattla-20b.

Evaluation — held-out 151 theorems

Two evaluations on the same 151-row v2 holdout (theorems disjoint from training by (source_file, theorem_line)):

metric single-shot (T=0) best-of-8 (T=0.5)
parse_rate (any K) 62.25% 84.11%
any_proved (any K had ≥1 obligation discharged) 2/151 (1.32%) 3/151 (1.99%)
full_proved (any K hit gold-obligation count) 1/151 2/151
total obligations machine-discharged (across best K each row) 141 22

Three theorems machine-proved by best-of-K (all proved tier — full proofs accepted by tlapm):

theorem obligations tier
TLAPS.tla:L446 2/2 proved
AddTwo.tla:L49 (LearnProofs) 18/18 proved
Lock.tla:L84 (locks_auxiliary_vars) 2/2 proved

Single-shot's standout result: VoteProof.tla:L520 — a Paxos-style theorem where greedy decoding produced a proof body that tlapm accepted for 139/142 sub-obligations (97.9%). The same theorem at T=0.5 sampling landed parse_error for all 8 attempts — best-of-K is not strictly dominant; greedy and sampling each find different theorems.

Practical recommendation: for production, run both strategies (greedy + best-of-8) and union the results. That gives 4 unique theorems proved across 151 with 161 total obligations, vs 3 / 22 from bo-K alone.

What problem does this solve

Given a TLA+ module that contains definitions, an Init / Next / Spec, and a THEOREM (or LEMMA) statement, the model writes the proof body (hierarchical step bullets <1>1 ... <1> QED, with BY, BY DEF, OBVIOUS, USE, DEFINE, SUFFICES, ASSUME, PROVE, CASE, PICK, WITNESS, and the PTL / Zenon / SMT backends).

Training data

1,357 verified proof rows built from 553 round-tripped theorem chunks across 265 .tla files in tlaplus/examples, tlapm/library, tlapm/examples, tlapm/examples_draft, and FormaLLM/data. Two row types per theorem:

  • full: preamble + statement → entire proof body
  • piecewise: preamble + statement + steps[0..k] → step k+1

Each row was independently verified by reconstructing a synthetic .tla module and running tlapm 1.5.0 on it; only chunks that round-tripped (≥1 obligation discharged) were kept.

Training config

Base openai/gpt-oss-20b (no stacking on prior ChatTLA checkpoints)
LoRA r=16, alpha=32, dropout=0, target_modules=all-linear
Optimizer AdamW, lr=2e-5, cosine schedule
Sequence max_length=3072
Batch per_device=1, grad_accum=8 (effective 8)
Steps / epochs 170 / 1
Hardware 2× RTX 8000 (48 GB), naive PP
Wall time 3:47:54
Final train loss 1.668 (avg) / 1.217 (final batch)
Final mean_token_accuracy 0.712

Inference — recommended (best-of-8 union greedy)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

BASE = "openai/gpt-oss-20b"
ADAPTER = "EricSpencer00/chattla-20b-prover-v3"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base = AutoModelForCausalLM.from_pretrained(
    BASE, attn_implementation="eager",
    torch_dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

DEV = ("You are ChatTLA-Prover... Your job is to write the PROOF body... "
       "Output only the proof. Reasoning: medium")
USER = "Write the TLAPS proof for the final theorem in the following module.\n\n```tla\n<your module here>\n```"

prompt = tokenizer.apply_chat_template(
    [{"role": "developer", "content": DEV}, {"role": "user", "content": USER}],
    tokenize=False, add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Greedy candidate
with torch.no_grad():
    g = model.generate(**inputs, max_new_tokens=512, do_sample=False)

# Best-of-8 candidates
with torch.no_grad():
    s = model.generate(**inputs, max_new_tokens=512, do_sample=True,
                       temperature=0.5, num_return_sequences=8)

# Union, dedupe, run tlapm on each, accept first that passes.

Caveats

  • This is a v3 first prover-branch run. Prior ChatTLA "prover" attempts trained on a corpus that was 98% TLC spec-gen data despite the name; v3 is the first iteration where the corpus is 100% proof targets and the eval measures tlapm-acceptance.
  • 1.3–2.0% any-proved on 151 unseen theorems is a low absolute rate. Real prover models (DeepSeek-Prover, Lean Prover Pipeline) achieve 30–50% on similarly-scoped benchmarks but use 10–100× more proof-pair training data and per-step search rather than full-proof generation.
  • Two RL attempts (scripts/launch_prover_grpo.sh) failed to lift the rate — v1 was killed by 768-token completion truncation collapsing reward variance, v2 at T=1.0 pushed the model off-distribution into non-proof text. The bottleneck appears to be SFT data scale, not RL recipe.

License

Apache 2.0. Built on openai/gpt-oss-20b (Apache 2.0). Seed data via LUC-AI4FM/FormaLLM (MIT) and tlapm (BSD-2-Clause).

Downloads last month
135
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EricSpencer00/chattla-20b-prover-v3

Adapter
(227)
this model