Strand-Rust-Coder-v1: Rust Coding Model Fine-Tuned on Peer-Ranked Synthetic Data
Date: October 24, 2025
Abstract
Large Language Models (LLMs) for code generation demonstrate strong capabilities across mainstream programming languages. Yet, their performance on systems programming languages like Rust remains limited due to sparse training data and the language’s unique ownership and type system complexities. We present a comprehensive approach to fine-tuning Qwen2.5-Coder ([1] Hui et al., 2024) model with 14B parameters for Rust through high-quality synthetic dataset generation powered by Fortytwo’s ([42] Larin et al., 2025) swarm inference with peer-ranked consensus. Our method generates 191,008 training examples across 15 task categories, including code generation, completion, bug detection, refactoring, and optimization. Each category uses task-specific schemas that ensure proper context preservation under Rust’s strict type requirements. The dataset construction leverages multiple collaborative nodes that systematically analyze real Rust codebases, create diverse examples, and cross-validate outputs for quality assurance. We evaluate our fine-tuned models on three benchmarks: our Strandset-Rust-v1, HumanEval-Rust ([2] Diverso AI Lab, 2024), and RustEvo2 ([3] Liang et al., 2025). Results show substantial improvements over baseline models. The 14B model achieves a +14% absolute improvement on our holdout set compared to the baseline. It also shows significant gains on RustEvo2 (+13%), confirming that specialized training enhances Rust-specific capabilities without causing catastrophic forgetting of general coding proficiency, as shown by strong performance on HumanEval-Rust.
1 Introduction
Rust has emerged as one of the most influential modern programming languages for building high-performance, concurrent, and safety-critical systems. Originally designed by Mozilla and now maintained by an active open-source community, Rust combines low-level control comparable to C and C++ with strong safety guarantees enforced at compile time. Its ownership and borrowing model, lifetime system, and strict static typing collectively provide memory safety without a garbage collector—a property that makes it uniquely suited for operating systems, embedded software, WebAssembly modules, and performance-sensitive backend services ([4] Crichton, 2021).
However, the same design principles that make Rust so robust also introduce substantial challenges for automated reasoning, verification, and code generation. The ownership model enforces unique access semantics that prevent data races but require the compiler (and, by extension, any automated code generation model) to track complex aliasing and lifetime constraints. The borrow checker, which enforces these constraints, is notorious for rejecting code that appears correct at first glance yet violates deep lifetime or mutability invariants ([5] Yang et al., 2025). As a result, even small mistakes in automatically generated Rust programs can lead to compilation failures that are nontrivial to diagnose or fix.
From the perspective of LLMs, Rust represents an especially demanding target. General-purpose code generation models—trained primarily on high-level, garbage-collected languages such as Python, JavaScript, and Java—often fail to internalize Rust’s unique idioms, such as zero-cost abstractions, explicit trait bounds, and pervasive use of pattern matching and iterators. As a result, while these models can generate syntactically valid Rust snippets, they frequently produce code that is non-compilable, non-idiomatic, or unsafe when integrated into larger systems ([6] Chen et al., 2025; [7] Jiang et al., 2024). Furthermore, the relatively small volume of high-quality Rust code in public repositories limits the statistical exposure of LLMs to idiomatic Rust patterns compared to more established ecosystems ([8] Joel et al., 2024).
Current large language models exhibit notable limitations when confronted with Rust’s complex type system, ownership semantics, and lifetime constraints, necessitating specialized approaches to improve AI-assisted systems programming. Understanding and addressing these deficiencies is critical not only for enhancing developer productivity but also for establishing Rust as a rigorous testbed for studying how code models learn complex static semantics and nonlocal program constraints.
Our prior investigation into self-supervised swarm inference mechanisms within trustless computational environments demonstrated that peer-ranked inference protocols generate robust quality signals across diverse programming tasks—a finding particularly relevant given Rust’s inherent complexity. This convergence of limited existing model performance and our ability to generate high-fidelity training data through Fortytwo’s swarm inference architecture provides both the motivation and methodological foundation for developing a domain-specialized Rust language model through careful fine-tuning with functionally validated, complex Rust tasks.
2 Related Work
Our work builds upon several research areas: general-purpose code generation models, Rust-specific modeling efforts, synthetic data generation for programming languages, and domain-specific fine-tuning techniques. We review each area and position our contributions within this landscape.
2.1 Code Generation Language Models
The evolution of large language models for code generation has accelerated rapidly since the introduction of Codex ([9] Chen et al., 2021), which demonstrated that GPT models fine-tuned on code could achieve impressive performance on programming benchmarks. This sparked a proliferation of specialized code models, each advancing different aspects of the field.
Open-source foundation models have democratized access to code generation capabilities. StarCoder ([10] Li et al., 2023) introduced a 15.5B parameter model trained on permissively licensed code from 86 programming languages, establishing strong baselines across multiple benchmarks. CodeLlama ([11] Rozière et al., 2024) built upon Llama 2 ([12] Touvron et al., 2023) with additional code-specific pre-training and introduced specialized variants for Python and instruction-following. WizardCoder ([13] Luo et al., 2023) demonstrated that instruction-tuning with complex code instructions could significantly improve performance, achieving state-of-the-art results on HumanEval at the time.
Commercial models have pushed the boundaries further. GPT-5 ([14] OpenAI, 2025) and Claude ([15] Anthropic, 2024) demonstrate strong code generation across languages, though their training data and methods remain proprietary. GitHub Copilot, powered by models from OpenAI, Anthropic, Google, and xAI, has demonstrated the practical value of code completion in real development workflows, with studies ([16] Mohamed et al., 2025) indicating measurable productivity gains.
Architecture innovations have explored different approaches to code modeling. CodeT5+ ([17] Wang et al., 2023) unified encoder-decoder and decoder-only architectures, showing benefits for both understanding and generation tasks. InCoder ([18] Fried et al., 2023) introduced bidirectional context through causal masking, enabling fill-in-the-middle capabilities crucial for code editing. CodeGen2 ([19] Nijkamp et al., 2023) explored multi-turn program synthesis through conversational paradigms.
However, most of these models exhibit a significant performance gap when generating Rust code compared to more common languages like Python or JavaScript. This disparity stems from both the relative scarcity of Rust in training corpora and the language’s unique challenges, which we address directly in our work.
2.2 Rust-Specific Approaches
Despite Rust’s growing importance in systems programming, dedicated efforts to improve LLM performance on Rust remain limited. The few existing works highlight both the challenges and opportunities in this domain.
Existing Rust models are scarce. Notably, Tessa-Rust-T1-7B ([20] Tesslate, 2024) represents one of the first attempts at a Rust-specialized model, fine-tuning a 7B parameter base model on Rust-specific data. However, it focuses primarily on basic code completion without addressing the broader spectrum of Rust programming tasks or the language’s unique semantic requirements.
Benchmark development for Rust has received more attention. MultiPL-E ([21] Cassano et al., 2022) extended HumanEval to 18 languages, including Rust, revealing that models consistently perform worse on Rust than on garbage-collected languages. RustEvo2 introduced a more comprehensive benchmark with 490 problems specifically designed to test Rust-specific features like ownership, borrowing, and lifetime management. These benchmarks confirm that existing models struggle with Rust’s distinctive characteristics as well as ecosystem knowledge.
Analysis of Rust challenges in the literature identifies several key difficulties. Prior studies analyzing code model failures on Rust find that ownership and borrowing violations account for over 40% of compilation errors in generated code. Models often produce syntactically valid but semantically incorrect Rust code, particularly around unsafe operations and trait implementations ([22] Wu et al., 2023).
Our work differs fundamentally from these previous efforts by taking a comprehensive, task-oriented approach. Rather than focusing solely on code completion or creating another benchmark, we systematically address 15 distinct programming tasks while ensuring that generated training data respects Rust’s strict type and ownership requirements.
2.3 Synthetic Data Generation for Code
The use of synthetic data for training code models has emerged as a powerful technique to address data scarcity and quality issues.
Self-instruction methods have proven effective for code. CodeAlpaca ([23] Chaudhary, 2023) applied self-instruction to generate 20K instruction-following examples for code. WizardCoder introduced Evol-Instruct, iteratively complexifying code instructions to create increasingly challenging training data. Magicoder ([24] Wei et al., 2023) combined self-instruction with retrieval from open-source projects, generating 75K synthetic examples that improved performance across multiple benchmarks.
Execution-based filtering adds another dimension of quality control. CodeT ([25] Chen et al., 2022) generated multiple solutions and filtered based on test execution results. AlphaCode ([26] Li et al., 2022a) used massive sampling with execution-based filtering to achieve competitive programming performance. Self-Debug ([27] Chen et al., 2023) taught models to iteratively refine code based on execution feedback.
Multi-agent approaches for synthetic data generation have shown promise. MetaGPT ([28] Hong et al., 2024) used multiple specialized agents for different software engineering tasks. ChatDev ([29] Qian et al., 2023) simulated an entire software company with collaborative agents. However, these approaches focused on high-level software development rather than training data generation.
Our swarm intelligence approach with peer-ranked consensus advances this line of work in several ways. We employ multiple collaborative nodes not just for generation but for systematic validation, ensuring that each synthetic example is both functionally correct and idiomatically appropriate for Rust. The peer-ranking mechanism acts as a quality gate, rejecting examples that violate Rust’s safety guarantees or compilation requirements.
2.4 Domain-Specific Fine-Tuning
The strategy of specializing general models for specific domains or languages has proven effective across various applications.
Language-specific fine-tuning has shown consistent benefits. PyCodeGPT ([30] Zan et al., 2022) demonstrated that Python-specific fine-tuning improved performance by 8–15% on Python benchmarks. JavaBERT ([31] De Sousa and Hasselbring, 2021) showed similar gains for Java code understanding tasks.
Task-specific approaches have explored different granularities of specialization. CodeReviewer ([32] Li et al., 2022b) is fine-tuned specifically for code review tasks, learning to identify bugs and suggest improvements. TestPilot ([33] Schäfer et al., 2023) specializes in test generation, achieving higher coverage than general models. DocGen ([34] Pimparkhede et al., 2024) focused exclusively on documentation generation, producing more accurate and comprehensive docstrings.
Instruction tuning has emerged as a particularly effective technique. CodeLlama-Instruct showed that instruction tuning significantly improves zero-shot task performance. OctoCoder ([35] Muennighoff et al., 2024) demonstrated that instruction tuning on Git commits could improve code explanation and editing capabilities.
Our approach synthesizes insights from these various specialization strategies. We combine language-specific (Rust) and task-specific (15 distinct tasks) fine-tuning with a structured instruction format. This multi-dimensional specialization, coupled with our swarm-generated training data, enables models to handle the full spectrum of Rust programming tasks while respecting the language’s unique constraints.
3 Methodology
3.1 Synthetic Dataset Construction
3.1.1 Swarm Intelligence Infrastructure
We leverage Fortytwo’s Swarm Inference — a decentralized inference architecture where multiple specialized models (nodes) collaborate to generate, evaluate, and aggregate outputs through peer-ranked consensus.
The swarm intelligence approach offers several critical advantages:
- Distributed generation: Inferences are processed in parallel across many independent nodes
- Peer-ranking: Each node evaluates and scores outputs from others, providing a robust measure of quality
- Consensus aggregation: The network converges on the best-scoring outputs through reputation-based consensus
- Diversity preservation: Independent participants contribute varied reasoning paths, reducing mode collapse and homogenization
- Decentralized quality control: Evaluation and filtering are performed collectively, eliminating centralized moderation
Our implementation employs a three-tier architecture:
- Generator nodes: Participants that produce initial examples based on task schemas, ensuring syntactic, semantic, and Rust-specific validity
- Consensus layer: Aggregates peer feedback, computes final rankings, and determines which inferences are selected for inclusion in the final output compilation
- Reputation layer: Maintains on-chain scoring of node reliability and contribution quality over time
3.1.2 Data Generation Process
The dataset construction follows a systematic pipeline leveraging real Rust crates from crates.io as source material. We processed 2,383 popular Rust crates, extracting patterns, idioms, and real-world usage examples.
Task-Specific Schema Design: We developed 15 specialized schemas that preserve Rust’s type information and ownership semantics. Each schema follows a structured format ensuring proper context preservation—critical for Rust’s strict compilation requirements. The schemas include fields for task descriptions, code content, and, crucially, a code_context field that maintains all necessary type definitions, trait implementations, and module imports.
Multi-Stage Generation Pipeline: The generation process operates through four distinct phases:
Crate Analysis Phase
- Extract and parse source code from
crates.iorepositories - Build dependency graphs and API surfaces
- Identify common patterns and idioms
- Construct a shared knowledge base for participating nodes
- Extract and parse source code from
Distributed Generation Phase
- Nodes receive crate content and task-specific prompts
- Nodes organize into swarms to perform parallel peer-ranked inference and aggregation
- Each swarm performs peer-ranking checks for:
- Syntactic correctness
- Semantic validity
- Idiomatic Rust patterns
- Compilation potential
- Participants assign quality scores in the range [0, 1]
- Reputation-weighted consensus determines which inferences are selected for inclusion in the final output
- Each group produces 3–5 examples per crate per task category
- Examples include both code and all required context
Quality Filtering and Finalization
- From 200,000 initial samples, an additional swarm validation pass retained 191,233 high-quality examples
- 225 examples were separated to form the
Strandset-Rust-v1hold-out benchmark - Final training set: 191,008 validated examples
- Examples achieving a consensus score above 0.7 were retained
- Borderline cases (0.5–0.7) underwent additional review
3.1.3 Dataset Structure and Format
The final dataset follows a structured format optimized for fine-tuning:
| Field | Type | Description |
|---|---|---|
| crate_name | string | Source crate identifier |
| input_data | string | JSON-encoded input following the task schema |
| output_data | string | JSON-encoded output following the task schema |
| task_category | string | One of 15 task category identifiers |
Table 1: Dataset Schema Structure
Both input_data and output_data fields contain JSON-serialized task-specific schemas, enabling structured training while maintaining flexibility across different task types.
3.1.4 Quality Control and Validation
Compilation Testing: For code-generation tasks, we implemented automated compilation testing:
- Extract generated code and context into minimal Rust projects
- Attempt compilation with
rustc - Track compilation success rates per category
- Overall observed compilation success rate: 94.3%
Semantic Validation: Beyond syntax, we validate semantic correctness through:
- Ownership rules compliance verification
- Lifetime constraint satisfaction checking
- Trait-bound completeness analysis
- Safe/unsafe boundary adherence validation
- Type inference compatibility testing
Statistical Quality Metrics and Task Distribution: We track comprehensive quality metrics across the dataset. The final training set consists of 191,008 examples. The distribution of examples across the 15 task categories is not uniform, reflecting the natural frequency of these tasks in real-world codebases.
| Metric | Value | Description |
|---|---|---|
| Total Examples | 191,008 | After filtering from 200,000 |
| Average Length | 127 tokens | Per example |
| Compilation Rate | 94.3% | Successfully compiling examples |
| Consensus Acceptance | 73.2% | Examples passing peer review |
| Coverage | 89% | Rust language features covered |
| Diversity Index | 0.82 | Token-level uniqueness measure |
Table 2: Dataset Quality Metrics
| Task Category | Count |
|---|---|
| code_generation | 17,241 |
| docstring_generation | 16,889 |
| code_explanation | 16,505 |
| comment_generation | 16,143 |
| code_summarization | 15,884 |
| function_naming | 15,776 |
| variable_naming | 15,754 |
| code_review | 15,195 |
| code_completion | 14,527 |
| code_refactoring | 14,324 |
| bug_detection | 12,765 |
| code_optimization | 12,569 |
| code_search | 3,766 |
| test_generation | 3,180 |
| api_usage_prediction | 490 |
Table 3: Final Training Set Distribution by Task Category
3.2 Model Selection and Architecture
3.2.1 Base Model Selection: Why Qwen2.5-Coder
We selected Qwen2.5-Coder as our base model family—specifically the 14B parameter variant. The 14B model offers an optimal balance between capability and computational efficiency: it achieves performance comparable to much larger models while remaining practical for deployment.
Superior Code Understanding Architecture: Qwen2.5-Coder incorporates several architectural innovations specifically designed for code:
- Grouped Query Attention (GQA): Reduces memory bandwidth requirements while maintaining quality, crucial for processing lengthy Rust code with extensive type annotations ([36] Ainslie et al., 2023).
- Extended RoPE embeddings: Native support for 128K context length, essential for understanding large Rust projects with complex module structures ([37] Su et al., 2024).
- Code-aware tokenization: Optimized tokenizer that preserves Rust syntax elements, reducing fragmentation of keywords and operators ([38] Hu et al., 2025).
- Instruction-tuned variants: Available checkpoints optimized for code-specific tasks.
Exceptional Pre-training Quality: The Qwen2.5-Coder series was pre-trained on a large scale data corpus:
- 5.5 trillion tokens spanning high-quality code and natural language.
- Data sources include source code, text–code grounding pairs, and synthetic data.
- 92 programming languages in training data, including substantial Rust representation.
- Balanced mixture of code (60%), code-related documentation (20%), and natural language (20%).
- Curriculum learning that progresses from simple to complex code patterns.
Proven Performance Advantages: Empirical evaluations demonstrate Qwen2.5-Coder’s strengths:
- Strong performance across mainstream programming languages and on low-resource languages.
- Superior handling of long-context dependencies, critical for Rust’s ownership tracking.
- Better generalization to unseen programming patterns compared to similarly sized models.
Practical Considerations
- Open availability: Fully open weights under the Apache 2.0 license.
- Fit for specialization: While the 14B model exhibits strong baseline performance on general coding tasks, like other general-purpose models it shows typical degradation on Rust-specific challenges—motivating our Rust-focused fine-tuning.
3.2.2 Model Configuration
| Parameter | Value | Rationale |
|---|---|---|
| Total Parameters | 14.7B | Balance of capability and efficiency |
| Hidden Size | 5,120 | Wide representations for complex semantics |
| Layers | 48 | Deep architecture for abstract reasoning |
| Attention Heads | 40 | Multi-head attention for diverse patterns |
| Context Length | 32,768 | Handles large Rust codebases |
| Vocabulary Size | 151,936 | Comprehensive token coverage |
Table 4: Qwen2.5-Coder-14B Configuration
3.3 Fine-Tuning Setup
3.3.1 Training Configuration
We employed parameter-efficient fine-tuning (PEFT) ([39] Xu et al., 2023) using the LoRA (Low-Rank Adaptation) ([40] Hu et al., 2022) method to adapt the models to Rust-specific patterns while preserving their general capabilities.
| Parameter | 14B Model | Justification |
|---|---|---|
| Method | LoRA | Efficient adaptation |
| LoRA r (rank) | 64 | Balance expressiveness/efficiency |
| LoRA alpha | 16 | Standard scaling factor |
| Learning Rate | 5e-5 | Smaller LR for stability |
| Batch Size | 128 | Fit into memory constraints |
| Epochs | 3 | Prevent overfitting |
| Optimizer | AdamW | Best for transformer fine-tuning |
| Precision | BF16 | Stability/speed trade-off |
| Loss | Completion-only | Focus on response |
Table 5: Training Hyperparameters
3.3.2 Training Dynamics
Training exhibited stable convergence with consistent improvements across metrics:
- Monotonic decrease in training loss across all epochs
- Minimal overfitting indicators (train–val gap < 0.1)
- Task-specific performance gains emerging after epoch 1
- Convergence achieved without loss spikes or gradient explosions
Key observations from training:
- Early learning: Basic Rust syntax mastered in the first 20% of training
- Mid-phase: Ownership and borrowing patterns emerge around 40–60%
- Late refinement: Idiomatic Rust and complex crate usage after 70%
3.3.3 Instruction Formatting
We adopted a structured three-part instruction format optimized for Rust tasks:
- Instruction Section: Clear task description and requirements
- Context Section: Necessary type definitions, imports, and dependencies
- Response Section: Expected output format and constraints
This format clearly separates concerns, improving the model’s ability to understand task requirements while maintaining access to necessary Rust-specific context.
3.4 Evaluation Protocol
3.4.1 Test Suites
To comprehensively evaluate our models, we selected three distinct test suites, each designed to test different facets of Rust programming proficiency.
Strandset-Rust-v1: Our custom-curated evaluation set, consisting of a diverse mix of all 15 task categories held out from our training data. This benchmark is designed to directly measure the capabilities acquired during fine-tuning and assess generalization across a wide range of practical, real-world tasks beyond simple code generation.
HumanEval-Rust: The Rust version of the canonical HumanEval benchmark, comprising 164 hand-crafted programming problems. This benchmark tests fundamental algorithmic correctness and serves as a crucial control to measure the retention of general coding abilities after specialization.
RustEvo2: A comprehensive, evolution-based test suite specifically designed to probe a deep understanding of Rust’s unique features. It contains complex refactoring and optimization tasks, rigorously testing a model’s grasp of ownership, borrowing, lifetimes, and advanced idiomatic patterns.
3.4.2 Baseline Models
To contextualize the performance of our fine-tuned model, we conducted a broad comparison against a range of open-source and proprietary state-of-the-art models.
Primary Baselines: The original, non-fine-tuned foundation models:
- Qwen2.5-Coder-14B-Instruct
- Qwen2.5-Coder-32B-Instruct
Open-Source Models:
- Qwen3 family (Qwen3-Coder-30B-A3B-Instruct, Qwen3 Max, etc.)
- DeepSeek (DeepSeek Chat, DeepSeek V3.1 Terminus)
- Meta (Llama 3.3 70B, Llama 4 Maverick)
Proprietary Models:
- OpenAI (GPT-5 Codex, GPT-40, o1-mini)
- Anthropic (Claude Sonnet 4.5, Claude 3.7 Sonnet)
- X.AI (Grok-4, Grok-3, Grok Code Fast)
- Google (Gemini 2.5 Pro, Gemini 2.5 Flash)
All proprietary models were evaluated via the OpenRouter API with consistent settings to ensure a fair comparison.
3.4.3 Evaluation Pipeline and Metrics
We developed a robust evaluation pipeline that uses task-specific metrics to provide a nuanced assessment of model performance.
Code Execution Environment: All code-based tasks were executed in an isolated Docker environment using the Rust 1.86.0 compiler. The pipeline supports parallel execution with up to 64 workers, featuring timeout protection and resource limits to ensure stability.
Task-Specific Metrics:
Code-Related Tasks (e.g., code_generation, bug_detection, code_refactoring): Metric: Unit Test Pass Rate.
Non-Code Tasks (e.g., docstring_generation, code_summarization, function_naming):
Metric: LLM-Based Correctness. Outputs are evaluated by a judge model (Claude Sonnet 4), which provides a binary score (0 or 1) based on semantic accuracy, completeness, and adherence to Rust conventions.API Usage & Code Completion Tasks: Metric: Syntax-Weighted Similarity. This score combines a check for syntactic validity with the Levenshtein text similarity ratio between the predicted and ground-truth code.
Comment Generation Task: Metric: Compilation Success. A binary metric (0 or 1) that verifies whether the generated code with comments compiles successfully.
4 Results
This section presents the quantitative and qualitative results of our experiments, comparing the performance of our fine-tuned model, Strand-Rust-Coder-14B-v1, against their baselines and other state-of-the-art models.
4.1 Overall Performance
Our fine-tuned models demonstrated substantial performance improvements on Rust-specific benchmarks (Strandset-Rust-v1 and RustEvo2) while remaining competitive on the general-purpose HumanEval-Rust benchmark.
The results clearly show that our fine-tuned model significantly outperforms its baselines on domain-specific tests. The Strand-Rust-Coder-14B-v1 model is competitive with or surpasses much larger proprietary models like GPT-5 Codex on the Rust-specific benchmarks, demonstrating the power of targeted specialization. The slight regression on HumanEval-Rust is an expected trade-off, indicating successful adaptation to the training distribution without catastrophic forgetting.
4.2 Task Category Performance
A detailed breakdown of performance on our Strandset-Rust-v1 reveals where the models improved the most.
The most dramatic improvements occurred in complex tasks like test_generation, code_refactoring, and api_usage_prediction, suggesting our dataset successfully taught the models a nuanced understanding of Rust idioms and best practices.
Figure 2: Overall Performance (Unit Test Pass Rate@1) across Strandset-Rust-v1, HumanEval-Rust, and RustEvo2 at ckpt1200.
| Model | Strandset-Rust-v1 | HumanEval-Rust | RustEvo2 |
|---|---|---|---|
| Strand-Rust-Coder-14B-v1 (Ours) | 0.43 | 0.76 | 0.43 |
| Strand-Rust-Coder-14B-v1-ckpt3200 (Ours) | 0.50 | 0.74 | 0.42 |
| Qwen2.5-Coder-14B-Instruct (Base) | 0.29 | 0.78 | 0.30 |
| Qwen2.5-Coder-32B-Instruct (Base) | 0.29 | 0.77 | 0.31 |
| openai/gpt-5-codex | 0.47 | 0.82 | 0.28 |
| anthropic/claude-sonnet-4.5 | 0.46 | 0.88 | 0.21 |
| qwen/qwen3-coder | 0.43 | 0.82 | 0.21 |
| x-ai/grok-4 | 0.39 | 0.85 | 0.37 |
| deepseek/deepseek-v3.1-terminus | 0.37 | 0.82 | 0.33 |
| meta-llama/llama-4-maverick | 0.35 | 0.52 | 0.28 |
| google/gemini-2.5-pro | 0.28 | 0.24 | 0.22 |
Table 6: Overall Performance (Unit Test Pass Rate@1) Across Benchmarks at ckpt1200
| Task Category | Base 14B | Ours 14B |
|---|---|---|
| test_generation | 0.00 | 0.51 |
| code_refactoring | 0.04 | 0.20 |
| api_usage_prediction | 0.27 | 0.63 |
| code_summarization | 0.40 | 0.80 |
| function_naming | 0.53 | 0.80 |
| variable_naming | 0.87 | 1.00 |
| code_generation | 0.40 | 0.49 |
| bug_detection | 0.09 | 0.00 |
| code_optimization | 0.07 | 0.06 |
Table 7: Per-Category Performance on Strandset-Rust-v1 (Unit Test Pass Rate@1)
4.3 Checkpoint Analysis
Analyzing performance at different training checkpoints reveals the learning dynamics.
| Checkpoint | baseline | ckpt1200 | ckpt1600 | ckpt3200 | ckpt4452 |
|---|---|---|---|---|---|
| Unit Test Pass Rate@1 | 0.29 | 0.43 | 0.47 | 0.50 | 0.44 |
Table 8: Performance of Strand-Rust-Coder-14B-v1 on Hold Out set by Checkpoint (Strandset-Rust-v1)
Performance improved rapidly within the first epoch (up to checkpoint 1200) and peaked around checkpoint 3200 (end of the second epoch). We decided to pick ckpt1200 as an optimal balance between Rust enterprise domain specialization and general utility.
4.4 Qualitative Analysis
Beyond quantitative metrics, a qualitative review of model outputs reveals a significant improvement in code quality, idiomatic correctness, and handling of Rust-specific language features.
Example 1: Correcting an Ownership Error
A common failure mode for base models is generating code that violates Rust’s borrow checker rules. When asked to write a function to find the first long string in a vector, the base model produced code that failed to compile. It attempted to return a reference to a value within a loop, a classic ownership violation that the Rust compiler strictly forbids.
In contrast, the fine-tuned model provided a correct and idiomatic solution. It used the iter().find() method chain, which is the standard approach in Rust. Furthermore, it correctly wrapped the return type in an Option, properly handling the case where no matching string is found. This demonstrated a much deeper understanding of both Rust’s safety rules and its conventions.
Example 2: Idiomatic Trait Usage
When tasked with implementing the Display trait for a simple struct, the base model provided a functional but verbose implementation. It used the format! macro to create a temporary string and then write that string to the formatter, which involves an unnecessary memory allocation.
The fine-tuned model, however, generated a more concise and efficient solution using the write! macro directly. This is the canonical way to implement the Display trait in Rust, as it avoids intermediate allocations and is more performant. This subtle but important difference highlights the model’s absorption of idiomatic patterns from our training data.
5 Discussion
The experimental results strongly validate our hypothesis: a large-scale, high-quality synthetic dataset generated via swarm inference can effectively specialize a general-purpose code model for a complex, low-resource language like Rust.
5.1 Analysis of Key Findings
The substantial performance gains on RustEvo2 and our holdout set are the most significant findings. The models’ success indicates they have internalized concepts of ownership, lifetimes, and the trait system. The key to this success lies in the peer-ranked accuracy and structure of our training dataset produced through Fortytwo’s Swarm Inference. The 15 distinct task categories forced the model to learn about Rust from multiple perspectives—not just how to write code, but how to debug, refactor, and explain it. This holistic approach appears more effective than training on code generation alone.
5.2 The Role of Swarm Intelligence and Peer Review
We attribute the high quality of our dataset, and consequently the models’ performance, to our swarm intelligence generation pipeline. A single LLM generating data, even a powerful one, is prone to mode collapse and repetitive errors. Our decentralized approach, where multiple nodes generate and then cross-validate examples, introduces diversity and acts as a powerful quality filter. The peer-ranked consensus mechanism was particularly effective at filtering out subtle, non-compiling code or non-idiomatic solutions that might pass automated syntax checks but would be flagged by an experienced developer. Through this ranked output aggregation, the swarm consistently converged on the most correct and idiomatic examples, resulting in a dataset of exceptional precision. This process mimics a collaborative open-source project, producing data that allowed fine-tuning to deliver performance competitive with—and in many cases superior to—much larger centralized models.
5.3 Implications and Limitations
Our work has several practical implications. First, it provides a clear roadmap for improving LLM support for other niche or complex programming languages that are underrepresented in standard training corpora. Second, the fine-tuned Strand-Rust-Coder models can serve as powerful developer assistants, accelerating development and reducing the steep learning curve associated with Rust.
However, we acknowledge several limitations. Our approach relies on the quality of the initial seed crates; a biased or low-quality selection could propagate issues into the synthetic dataset. Furthermore, while our models is significantly improved, they can still struggle with extremely complex lifetime annotations or highly abstract generic code involving higher-kinded types. Finally, this study was limited to the Qwen2.5-Coder family, and further research is needed to determine how these gains are transferable to other model architectures.
6 Conclusion
In this paper, we presented a comprehensive approach to enhancing the capabilities of LLMs for Rust programming. By generating a large-scale, high-quality synthetic dataset of 191,008 examples across 15 tasks using a novel swarm intelligence pipeline with peer-ranked consensus, we successfully fine-tuned the Qwen2.5-Coder 14B model.
Our evaluation demonstrates that this domain-specialization strategy yields substantial improvements, with our models achieving an absolute performance gain of up to 13% on Rust-specific benchmarks. The fine-tuned models exhibit a deeper understanding of Rust’s unique ownership, borrowing, and type systems, producing code that is not only correct but also idiomatic. This work significantly closes the performance gap between LLMs on mainstream languages and their capabilities on Rust, paving the way for more effective AI-assisted systems programming. Future work will explore applying this methodology to other specialized domains.
References
[1] Qwen2.5-Coder Technical Report. Hui, B. et al. (2024). arXiv:2409.12186. https://doi.org/10.48550/arXiv.2409.12186
[2] HumanEval-Rust Dataset. Diverso AI Lab (2024). https://huggingface.co/datasets/diversoailab/humaneval-rust
[3] RustEvo2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation. Liang, L. et al. (2025). arXiv:2503.16922. https://doi.org/10.48550/arXiv.2503.16922
[4] The Usability of Ownership. Crichton, W. (2021). arXiv:2011.06171. https://doi.org/10.48550/arXiv.2011.06171
[5] AutoVerus: Automated Proof Generation for Rust Code. Yang, C. et al. (2025, Oct). Proc. ACM Program. Lang., 9(OOPSLA2), 3454–3482. https://doi.org/10.1145/3763174
[6] A Survey on Evaluating Large Language Models in Code Generation Tasks. Chen, L. et al. (2025). arXiv:2408.16498. https://doi.org/10.48550/arXiv.2408.16498
[7] A Survey on Large Language Models for Code Generation. Jiang, J. et al. (2024). arXiv:2406.00515. https://doi.org/10.48550/arXiv.2406.00515
[8] A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages. Joel, S.; Wu, J.; Fard, F. (2024). ACM Transactions on Software Engineering and Methodology.
[9] Evaluating Large Language Models Trained on Code. Chen, M. et al. (2021). arXiv:2107.03374. https://doi.org/10.48550/arXiv.2107.03374
[10] StarCoder: may the source be with you! Li, R. et al. (2023). arXiv:2305.06161. https://doi.org/10.48550/arXiv.2305.06161
[11] Code Llama: Open Foundation Models for Code. Rozière, B. et al. (2024). arXiv:2308.12950. https://doi.org/10.48550/arXiv.2308.12950
[12] Llama 2: Open Foundation and Fine-Tuned Chat Models. Touvron, H. et al. (2023). arXiv:2307.09288. https://doi.org/10.48550/arXiv.2307.09288
[13] WizardCoder: Empowering Code LLMs with Evol-Instruct. Luo, Z. et al. (2023). arXiv:2306.08568. https://doi.org/10.48550/arXiv.2306.08568
[14] GPT-5 System Card. OpenAI (2025). https://cdn.openai.com/gpt-5-system-card.pdf
[15] The Claude 3 Model Family. Anthropic (2024). https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
[16] The Impact of LLM-Assistants on Software Developer Productivity. Mohamed, A. et al. (2025). arXiv:2507.03156. https://doi.org/10.48550/arXiv.2507.03156
[17] CodeT5+: Open Code LLMs for Code Understanding and Generation. Wang, Y. et al. (2023). arXiv:2305.07922. https://doi.org/10.48550/arXiv.2305.07922
[18] InCoder: A Generative Model for Code Infilling and Synthesis. Fried, D. et al. (2023). arXiv:2204.05999. https://doi.org/10.48550/arXiv.2204.05999
[19] CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. Nijkamp, E. et al. (2023). arXiv:2305.02309. https://doi.org/10.48550/arXiv.2305.02309
[20] Tessa-Rust-T1. Tesslate (2024). https://huggingface.co/Tesslate/Tessa-Rust-T1-7B
[21] MultiPL-E. Cassano, F. et al. (2022). arXiv:2208.08227. https://doi.org/10.48550/arXiv.2208.08227
[22] RustGen: An Augmentation Approach for Generating Compilable Rust Code with Large Language Models. Wu, X. et al. (2023). OpenReview: https://openreview.net/forum?id=y9A0vJ5vuM
[23] Code Alpaca: An Instruction-following Llama Model for Code Generation. Chaudhary, S. (2023). https://github.com/sahil280114/codealpaca
[24] Magicoder: Empowering Code Generation with OSS-Instruct. Wei, Y. et al. (2023). arXiv:2312.02120 (ICML 2024). https://doi.org/10.48550/arXiv.2312.02120
[25] CodeT: Code Generation with Generated Tests. Chen, B. et al. (2022). arXiv:2207.10397. https://doi.org/10.48550/arXiv.2207.10397
[26] Competition-level code generation with AlphaCode. Science, 378(6624):1092–1097. Li, Y. et al. (2022). https://doi.org/10.1126/science.abq1158
[27] Teaching Large Language Models to Self-Debug. Chen, X. et al. (2023). arXiv:2304.05128. https://doi.org/10.48550/arXiv.2304.05128
[28] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. Hong, S. et al. (2024). arXiv:2308.00352. https://doi.org/10.48550/arXiv.2308.00352
[29] ChatDev: Communicative Agents for Software Development. Qian, C. et al. (2023). arXiv:2307.07924 (ACL 2024). https://doi.org/10.48550/arXiv.2307.07924
[30] When Language Model Meets Private Library. Zan, D. et al. (2022). arXiv:2210.17236. https://doi.org/10.48550/arXiv.2210.17236
[31] Javabert: Training a transformer-based model for the java programming language. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), pages 90–95. IEEE, 2021. De Sousa, N. T., & Hasselbring, W. (2021).
[32] Automating Code Review Activities by Large-scale Pre-training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, page 1035–1047. Association for Computing Machinery, 2022. https://doi.org/10.1145/3540250.3549081
[33] An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE TSE, 50(1), 85–105. Schäfer, M., Nadi, S., Eghbali, A., & Tip, F. (2023).
[34] DocCGen: Document-based Controlled Code Generation. Pimparkhede, S. et al. (2024). arXiv:2406.11925v2. https://doi.org/10.48550/arXiv.2406.11925
[35] OctoPack: Instruction Tuning Code Large Language Models. Muennighoff, N. et al. (2024). arXiv:2308.07124. https://doi.org/10.48550/arXiv.2308.07124
[36] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. Ainslie, J. et al. (2023). arXiv:2305.13245. https://doi.org/10.48550/arXiv.2305.13245
[37] Roformer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, 568:127063, 2024. Su, J. et al. (2024).
[38] Optimizing Token Consumption in LLMs. Hu, J. et al. (2025). arXiv:2504.15989v2. https://doi.org/10.48550/arXiv.2504.15989
[39] Parameter-Efficient Fine-Tuning Methods. Xu, L. et al. (2023). arXiv:2312.12148. https://doi.org/10.48550/arXiv.2312.12148
[40] LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 1(2):3, 2022. Hu, E. J. et al. (2022).
[42] Fortytwo: Swarm Inference with Peer-Ranked Consensus. Larin, V. et al. (2025). arXiv:2510.24801. https://doi.org/10.48550/arXiv.2510.24801


