RavenBERT

RavenBERT is a SentenceTransformers embedding model specialized for smart-contract invariants (e.g., require(...), assert(...), if (...) revert) from Ethereum/Vyper sources.
It starts from web3se/SmartBERT-v2 and is contrastively fine-tuned so that cosine similarity reflects semantic intent of guards used in transaction-reverting checks.

Architecture: BERT-family encoder (SmartBERT-v2) → MeanPooling → L2 Normalize
Embedding dimension: 768
Normalization: Enabled (unit-norm vectors; cosine ≡ dot product)
Intended use: clustering / semantic search / dedup / taxonomy building for short guard predicates (and optional messages)

Quick start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("MojtabaEshghie/RavenBERT")
sentences = [
    "amountOut >= amountOutMin",
    "deadline >= block.timestamp",
    "balances[msg.sender] >= amount"
]
emb = model.encode(sentences, convert_to_numpy=True, show_progress_bar=False)
# emb are L2-normalized; use cosine similarity for comparisons

Training summary (contrastive)

Base model: web3se/SmartBERT-v2
Objective: CosineSimilarityLoss (positives near 1.0, negatives near 0.0)
Pair construction: L2-normalized seed embeddings → positives if cosine ≥ 0.80, negatives if cosine ≤ 0.20 (nearest-neighbor candidates, top_k=10, max 5 positives/item)
This release stats: 1,647 unique texts → 16,470 pairs (8,235 pos / 8,235 neg)
Hyperparams: epochs=1, batch_size=16, max_seq_len=512
Saved as: canonical SentenceTransformers layout (0_Transformer/, 1_Pooling/, 2_Normalize/)

A more detailed methodology and evaluation appear in the RAVEN paper (semantic clustering of revert-inducing invariants).

Intended uses & limitations

Good for

Measuring semantic relatedness of short invariant predicates
Clustering guards by intent (e.g., access control, slippage, timeouts)
Deduplicating near-equivalent checks across contracts

Not ideal for

Long code blocks or whole-function embeddings
General code understanding outside invariant-style snippets
Non-EVM ecosystems without adaptation

Evaluation (paper)

When paired with DBSCAN on predicate-only text, RavenBERT produced compact, well-separated clusters (e.g., Silhouette ≈ 0.93, S_Dbw ≈ 0.043 at ~52% coverage), surfacing meaningful categories of defenses from reverted transactions. See paper for full protocol, ablations, and metrics.

Reproducibility

Pair thresholds: τ₊ = 0.80, τ₋ = 0.20
Normalization: L2 via sentence_transformers.models.Normalize()
Training log: ravenbert_training_stats.json (included in repo)

Citation

If you use RavenBERT, please cite the RAVEN paper and this model:

TBD

License

MIT

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for software-ses/RavenBERT

Base model

microsoft/codebert-base-mlm

Finetuned

web3se/SmartBERT-v2