mjbommar
/

glaurung-small-001

@@ -17,12 +17,14 @@ widget:
 # Glaurung Small 001
-A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis.
 ## Overview
 **Glaurung Small 001** is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).
 ### Key Features
 - **Custom Binary Tokenizer**: BPE tokenizer that creates efficient multi-byte tokens from binary data
 - **Binary-Aware**: Trained on actual executable files, not hex strings
@@ -36,16 +38,31 @@ A RoBERTa-based masked language model trained on binary executable files for sec
 - **Layers**: 12
 - **Attention Heads**: 12
 - **Vocabulary Size**: 65,536 tokens
 - **Max Position Embeddings**: 520
 - **Special Tokens**:
   - `<|start|>` (0): Beginning of sequence
-  - `<|end|>` (1): End token
   - `<|sep|>` (2): Separator/EOS
   - `<|cls|>` (3): Classification token
   - `<|pad|>` (4): Padding
   - `<|mask|>` (5): Mask token for MLM
   - `<|unk|>` (6): Unknown token
 ## Installation & Loading
 ```bash

 # Glaurung Small 001
+A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis. Part of the [Glaurung](https://github.com/mjbommar/glaurung) project: a modern reverse engineering framework with first-class AI integration.
 ## Overview
 **Glaurung Small 001** is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).
+This is the **small variant** (160M parameters, 12 layers) optimized for faster inference. For enhanced understanding, see [glaurung-large-001](https://huggingface.co/mjbommar/glaurung-large-001) (371M parameters).
 ### Key Features
 - **Custom Binary Tokenizer**: BPE tokenizer that creates efficient multi-byte tokens from binary data
 - **Binary-Aware**: Trained on actual executable files, not hex strings
 - **Layers**: 12
 - **Attention Heads**: 12
 - **Vocabulary Size**: 65,536 tokens
+- **Tokenizer**: [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005)
 - **Max Position Embeddings**: 520
 - **Special Tokens**:
   - `<|start|>` (0): Beginning of sequence
+  - `<|end|>` (1): End token
   - `<|sep|>` (2): Separator/EOS
   - `<|cls|>` (3): Classification token
   - `<|pad|>` (4): Padding
   - `<|mask|>` (5): Mask token for MLM
   - `<|unk|>` (6): Unknown token
+## Glaurung Ecosystem
+This model is part of the **Glaurung** project ecosystem:
+### 🔧 Main Project
+- **[Glaurung](https://github.com/mjbommar/glaurung)** - A modern reverse engineering framework designed to replace Ghidra with first-class AI integration throughout the analysis pipeline. Built with Rust's performance and Python's accessibility, featuring AI agents integrated at every level from format detection to decompilation.
+### 🤖 Model Family
+- **[glaurung-small-001](https://huggingface.co/mjbommar/glaurung-small-001)** (this model) - 160M parameters, 12 layers, faster inference
+- **[glaurung-large-001](https://huggingface.co/mjbommar/glaurung-large-001)** - 371M parameters, 24 layers
+### 🔤 Tokenizer
+- **[binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005)** - 65K vocabulary BPE tokenizer trained on multi-byte patterns
 ## Installation & Loading
 ```bash