mjbommar commited on
Commit
ec3a2f9
·
verified ·
1 Parent(s): 22ced0d

Upload Glaurung Small 001 - RoBERTa model for binary analysis

Browse files
Files changed (1) hide show
  1. README.md +19 -2
README.md CHANGED
@@ -17,12 +17,14 @@ widget:
17
 
18
  # Glaurung Small 001
19
 
20
- A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis.
21
 
22
  ## Overview
23
 
24
  **Glaurung Small 001** is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).
25
 
 
 
26
  ### Key Features
27
  - **Custom Binary Tokenizer**: BPE tokenizer that creates efficient multi-byte tokens from binary data
28
  - **Binary-Aware**: Trained on actual executable files, not hex strings
@@ -36,16 +38,31 @@ A RoBERTa-based masked language model trained on binary executable files for sec
36
  - **Layers**: 12
37
  - **Attention Heads**: 12
38
  - **Vocabulary Size**: 65,536 tokens
 
39
  - **Max Position Embeddings**: 520
40
  - **Special Tokens**:
41
  - `<|start|>` (0): Beginning of sequence
42
- - `<|end|>` (1): End token
43
  - `<|sep|>` (2): Separator/EOS
44
  - `<|cls|>` (3): Classification token
45
  - `<|pad|>` (4): Padding
46
  - `<|mask|>` (5): Mask token for MLM
47
  - `<|unk|>` (6): Unknown token
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ## Installation & Loading
50
 
51
  ```bash
 
17
 
18
  # Glaurung Small 001
19
 
20
+ A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis. Part of the [Glaurung](https://github.com/mjbommar/glaurung) project: a modern reverse engineering framework with first-class AI integration.
21
 
22
  ## Overview
23
 
24
  **Glaurung Small 001** is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).
25
 
26
+ This is the **small variant** (160M parameters, 12 layers) optimized for faster inference. For enhanced understanding, see [glaurung-large-001](https://huggingface.co/mjbommar/glaurung-large-001) (371M parameters).
27
+
28
  ### Key Features
29
  - **Custom Binary Tokenizer**: BPE tokenizer that creates efficient multi-byte tokens from binary data
30
  - **Binary-Aware**: Trained on actual executable files, not hex strings
 
38
  - **Layers**: 12
39
  - **Attention Heads**: 12
40
  - **Vocabulary Size**: 65,536 tokens
41
+ - **Tokenizer**: [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005)
42
  - **Max Position Embeddings**: 520
43
  - **Special Tokens**:
44
  - `<|start|>` (0): Beginning of sequence
45
+ - `<|end|>` (1): End token
46
  - `<|sep|>` (2): Separator/EOS
47
  - `<|cls|>` (3): Classification token
48
  - `<|pad|>` (4): Padding
49
  - `<|mask|>` (5): Mask token for MLM
50
  - `<|unk|>` (6): Unknown token
51
 
52
+ ## Glaurung Ecosystem
53
+
54
+ This model is part of the **Glaurung** project ecosystem:
55
+
56
+ ### 🔧 Main Project
57
+ - **[Glaurung](https://github.com/mjbommar/glaurung)** - A modern reverse engineering framework designed to replace Ghidra with first-class AI integration throughout the analysis pipeline. Built with Rust's performance and Python's accessibility, featuring AI agents integrated at every level from format detection to decompilation.
58
+
59
+ ### 🤖 Model Family
60
+ - **[glaurung-small-001](https://huggingface.co/mjbommar/glaurung-small-001)** (this model) - 160M parameters, 12 layers, faster inference
61
+ - **[glaurung-large-001](https://huggingface.co/mjbommar/glaurung-large-001)** - 371M parameters, 24 layers
62
+
63
+ ### 🔤 Tokenizer
64
+ - **[binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005)** - 65K vocabulary BPE tokenizer trained on multi-byte patterns
65
+
66
  ## Installation & Loading
67
 
68
  ```bash