Fill-Mask
Transformers
Safetensors
English
roberta
binary-analysis
security
malware-analysis
executable-analysis
masked-language-modeling
Instructions to use mjbommar/glaurung-small-001 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mjbommar/glaurung-small-001 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="mjbommar/glaurung-small-001")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("mjbommar/glaurung-small-001") model = AutoModelForMaskedLM.from_pretrained("mjbommar/glaurung-small-001") - Notebooks
- Google Colab
- Kaggle
Upload Glaurung Small 001 - RoBERTa model for binary analysis
Browse files
README.md
CHANGED
|
@@ -17,12 +17,14 @@ widget:
|
|
| 17 |
|
| 18 |
# Glaurung Small 001
|
| 19 |
|
| 20 |
-
A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis.
|
| 21 |
|
| 22 |
## Overview
|
| 23 |
|
| 24 |
**Glaurung Small 001** is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).
|
| 25 |
|
|
|
|
|
|
|
| 26 |
### Key Features
|
| 27 |
- **Custom Binary Tokenizer**: BPE tokenizer that creates efficient multi-byte tokens from binary data
|
| 28 |
- **Binary-Aware**: Trained on actual executable files, not hex strings
|
|
@@ -36,16 +38,31 @@ A RoBERTa-based masked language model trained on binary executable files for sec
|
|
| 36 |
- **Layers**: 12
|
| 37 |
- **Attention Heads**: 12
|
| 38 |
- **Vocabulary Size**: 65,536 tokens
|
|
|
|
| 39 |
- **Max Position Embeddings**: 520
|
| 40 |
- **Special Tokens**:
|
| 41 |
- `<|start|>` (0): Beginning of sequence
|
| 42 |
-
- `<|end|>` (1): End token
|
| 43 |
- `<|sep|>` (2): Separator/EOS
|
| 44 |
- `<|cls|>` (3): Classification token
|
| 45 |
- `<|pad|>` (4): Padding
|
| 46 |
- `<|mask|>` (5): Mask token for MLM
|
| 47 |
- `<|unk|>` (6): Unknown token
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
## Installation & Loading
|
| 50 |
|
| 51 |
```bash
|
|
|
|
| 17 |
|
| 18 |
# Glaurung Small 001
|
| 19 |
|
| 20 |
+
A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis. Part of the [Glaurung](https://github.com/mjbommar/glaurung) project: a modern reverse engineering framework with first-class AI integration.
|
| 21 |
|
| 22 |
## Overview
|
| 23 |
|
| 24 |
**Glaurung Small 001** is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).
|
| 25 |
|
| 26 |
+
This is the **small variant** (160M parameters, 12 layers) optimized for faster inference. For enhanced understanding, see [glaurung-large-001](https://huggingface.co/mjbommar/glaurung-large-001) (371M parameters).
|
| 27 |
+
|
| 28 |
### Key Features
|
| 29 |
- **Custom Binary Tokenizer**: BPE tokenizer that creates efficient multi-byte tokens from binary data
|
| 30 |
- **Binary-Aware**: Trained on actual executable files, not hex strings
|
|
|
|
| 38 |
- **Layers**: 12
|
| 39 |
- **Attention Heads**: 12
|
| 40 |
- **Vocabulary Size**: 65,536 tokens
|
| 41 |
+
- **Tokenizer**: [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005)
|
| 42 |
- **Max Position Embeddings**: 520
|
| 43 |
- **Special Tokens**:
|
| 44 |
- `<|start|>` (0): Beginning of sequence
|
| 45 |
+
- `<|end|>` (1): End token
|
| 46 |
- `<|sep|>` (2): Separator/EOS
|
| 47 |
- `<|cls|>` (3): Classification token
|
| 48 |
- `<|pad|>` (4): Padding
|
| 49 |
- `<|mask|>` (5): Mask token for MLM
|
| 50 |
- `<|unk|>` (6): Unknown token
|
| 51 |
|
| 52 |
+
## Glaurung Ecosystem
|
| 53 |
+
|
| 54 |
+
This model is part of the **Glaurung** project ecosystem:
|
| 55 |
+
|
| 56 |
+
### 🔧 Main Project
|
| 57 |
+
- **[Glaurung](https://github.com/mjbommar/glaurung)** - A modern reverse engineering framework designed to replace Ghidra with first-class AI integration throughout the analysis pipeline. Built with Rust's performance and Python's accessibility, featuring AI agents integrated at every level from format detection to decompilation.
|
| 58 |
+
|
| 59 |
+
### 🤖 Model Family
|
| 60 |
+
- **[glaurung-small-001](https://huggingface.co/mjbommar/glaurung-small-001)** (this model) - 160M parameters, 12 layers, faster inference
|
| 61 |
+
- **[glaurung-large-001](https://huggingface.co/mjbommar/glaurung-large-001)** - 371M parameters, 24 layers
|
| 62 |
+
|
| 63 |
+
### 🔤 Tokenizer
|
| 64 |
+
- **[binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005)** - 65K vocabulary BPE tokenizer trained on multi-byte patterns
|
| 65 |
+
|
| 66 |
## Installation & Loading
|
| 67 |
|
| 68 |
```bash
|