Instructions to use ericflo/qwen3-0.6b-summarizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ericflo/qwen3-0.6b-summarizer with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ericflo/qwen3-0.6b-summarizer", filename="qwen3-0.6b-summarizer-f16.gguf", )
llm.create_chat_completion( messages = "\"The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.\"" )
- llama-cpp-python
How to use ericflo/qwen3-0.6b-summarizer with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ericflo/qwen3-0.6b-summarizer", filename="qwen3-0.6b-summarizer-f16.gguf", )
llm.create_chat_completion( messages = "\"The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.\"" )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ericflo/qwen3-0.6b-summarizer with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ericflo/qwen3-0.6b-summarizer:F16 # Run inference directly in the terminal: llama-cli -hf ericflo/qwen3-0.6b-summarizer:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ericflo/qwen3-0.6b-summarizer:F16 # Run inference directly in the terminal: llama-cli -hf ericflo/qwen3-0.6b-summarizer:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ericflo/qwen3-0.6b-summarizer:F16 # Run inference directly in the terminal: ./llama-cli -hf ericflo/qwen3-0.6b-summarizer:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ericflo/qwen3-0.6b-summarizer:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf ericflo/qwen3-0.6b-summarizer:F16
Use Docker
docker model run hf.co/ericflo/qwen3-0.6b-summarizer:F16
- LM Studio
- Jan
- Ollama
How to use ericflo/qwen3-0.6b-summarizer with Ollama:
ollama run hf.co/ericflo/qwen3-0.6b-summarizer:F16
- Unsloth Studio new
How to use ericflo/qwen3-0.6b-summarizer with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ericflo/qwen3-0.6b-summarizer to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ericflo/qwen3-0.6b-summarizer to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ericflo/qwen3-0.6b-summarizer to start chatting
- Pi new
How to use ericflo/qwen3-0.6b-summarizer with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ericflo/qwen3-0.6b-summarizer:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ericflo/qwen3-0.6b-summarizer:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ericflo/qwen3-0.6b-summarizer with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ericflo/qwen3-0.6b-summarizer:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ericflo/qwen3-0.6b-summarizer:F16
Run Hermes
hermes
- Docker Model Runner
How to use ericflo/qwen3-0.6b-summarizer with Docker Model Runner:
docker model run hf.co/ericflo/qwen3-0.6b-summarizer:F16
- Lemonade
How to use ericflo/qwen3-0.6b-summarizer with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ericflo/qwen3-0.6b-summarizer:F16
Run and chat with the model
lemonade run user.qwen3-0.6b-summarizer-F16
List all available models
lemonade list
Qwen3-0.6B Summarizer
A one-sentence summarizer fine-tuned from Qwen3-0.6B using LoRA distillation. Feed it any text, get back a concise one-sentence summary.
Trained by distilling 6,720 high-quality summaries generated by Gemini Flash into Qwen3-0.6B. The model learns to compress markdown text (chat logs, task descriptions, bug reports, planning notes) into clear, information-dense one-liners.
Example
Input: "Eric wants to reorganize how Cloud Eric handles project planning loops.
Currently the planning task runs every 30 minutes and creates sub-tasks,
but it often creates duplicates because it does not check what tasks are
already running or what PRs are already open. The fix should add a dedup
check that reviews pending tasks and recent GitHub PRs before creating
anything new."
Output: "Cloud Eric planning bug fix โ current task creates duplicates because
it lacks a dedup check for pending tasks and open PRs."
375 characters in, 124 out โ 67% compression while preserving the root cause and fix.
Quick Start
With llama-cpp-python (CPU, no GPU needed)
from llama_cpp import Llama
llm = Llama("qwen3-0.6b-summarizer-q8_0.gguf", n_ctx=512, n_threads=8, verbose=False)
text = "Your text to summarize here..."
prompt = f"Summarize in one sentence:\n{text}\n\nSummary:"
output = llm(prompt, max_tokens=80, temperature=0.3, stop=["\n", "<|im_end|>"])
print(output["choices"][0]["text"])
With llama.cpp CLI
./llama-cli -m qwen3-0.6b-summarizer-q8_0.gguf \
-p "Summarize in one sentence:\nYour text here\n\nSummary:" \
-n 80 --temp 0.3
With transformers (GPU)
Apply the LoRA weights to the base model:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load base model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
# Load and apply LoRA weights (see lora_weights/best_distill.pt)
# Or use the pre-merged GGUF files directly with llama-cpp-python
text = "Your text here"
prompt = f"Summarize in one sentence:\n{text}\n\nSummary:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=80, temperature=0.3)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Files
| File | Size | Description |
|---|---|---|
qwen3-0.6b-summarizer-q8_0.gguf |
610 MB | Recommended. Q8_0 quantized, best speed/quality tradeoff. |
qwen3-0.6b-summarizer-f16.gguf |
1.1 GB | Full F16 precision. Slightly better quality, slower. |
lora_weights/best_distill.pt |
8.8 MB | Raw LoRA weights (PyTorch). Apply to base Qwen3-0.6B. |
training/training_metrics.json |
789 KB | Full training metrics (per-step loss, LR, grad norms). |
training/training_charts.png |
157 KB | Training loss curves visualization. |
training/gpu_distill.py |
18 KB | Training script (for reproduction). |
training/merge_and_export.py |
10 KB | LoRA merge + GGUF export script. |
Performance
| Metric | Value |
|---|---|
| Inference speed (CPU, 8 threads) | 3-5 seconds per summary (~7 tok/s) |
| Inference speed (GPU) | <0.5 seconds per summary |
| Model load time | 0.6s (Q8_0) |
| Average output length | ~30 tokens |
| Max recommended input | ~2,000 characters |
| Validation loss | 1.136 |
Training Details
Method
LoRA distillation from Gemini 3 Flash Preview outputs. The base Qwen3-0.6B is frozen; only LoRA adapters (rank=16, alpha=32) on q_proj and v_proj across all 28 attention layers are trained.
- Training data: 6,720 (text, summary) pairs. Text is markdown content from a personal knowledge management system (chat logs, task descriptions, project notes, bug reports). Summaries were generated by Gemini 3 Flash Preview.
- Split: 6,048 train / 672 validation
- Optimizer: AdamW, lr=2e-4, weight_decay=0.01, cosine schedule
- Mixed precision: float16 with GradScaler
- Batch size: 8
- Epochs: 5 (best at epoch 3)
- Hardware: NVIDIA L4 (24GB) on RunPod
- Training time: 31 minutes
- Training cost: ~$0.20
Prompt Format
The model was trained with this exact prompt template:
Summarize in one sentence:
{text}
Summary:
Use this format for best results. The model outputs a single sentence and stops.
Training Curves
Best validation loss of 1.136 at epoch 3. Mild overfitting begins at epoch 4.
Why Distillation?
We tried five approaches before settling on distillation:
| Approach | Val Loss | Quality |
|---|---|---|
| Prefix tuning (embedding โ soft tokens) | 1.15-1.24 | Hallucinated entity names |
| LoRA + embedding projection | 1.14-1.16 | Better but still imprecise on details |
| Text distillation (this model) | 1.14 | Near-verbatim reproduction |
The key insight: embedding vectors don't encode specific details (PR numbers, app names, exact error messages). By training directly on raw text, the model can see and reproduce those details. The distillation approach produces summaries that are nearly indistinguishable from the Gemini originals.
Sample Generations
From the validation set (unseen during training):
| Reference (Gemini) | Generated (This Model) |
|---|---|
| Batch deployment โ merged seven branches into clouderic including FolkReel query fixes, desktop/mobi... | Batch deployment โ merged seven branches for clouderic including folkreel query field fixes, desktop... |
| FolkReel planning cycle โ reviewed project state and strategic focus on AI-led interview elicitation... | FolkReel planning cycle โ reviewed project state to prioritize interview elicitation and the iterati... |
| Bug fix for Claude configuration error โ adding symlink-on-startup logic to the Go binary to ensure... | Claude configuration fix โ adding symlink-on-startup logic to Go binary and running tests to resolv... |
| WebUI Next planning cycle โ enforcing a tight 10-minute loop to address subpar product quality... | WebUI Next planning cycle โ enforcing a short cadence to address poor visual design and interaction... |
Limitations
- Domain-specific: Trained on software engineering/devops content. Will work on general text but style is tuned for technical summaries.
- Single sentence: Always outputs one sentence. Not suitable for multi-paragraph summarization.
- English only: Trained exclusively on English text.
- Max input ~2K chars: Longer texts get truncated. For very long documents, consider chunking.
- No thinking/reasoning: This is a distilled model โ it pattern-matches rather than reasons about content.
License
Apache 2.0 (same as the base Qwen3-0.6B model).
Acknowledgments
- Qwen3-0.6B by Alibaba Cloud โ the base model
- Gemini 3 Flash Preview by Google โ generated the training summaries
- llama.cpp โ GGUF format and CPU inference
- RunPod โ GPU training infrastructure
- Built as part of the Cloud Eric project
- Downloads last month
- 36
8-bit
16-bit
Model tree for ericflo/qwen3-0.6b-summarizer
Evaluation results
- Validation Lossself-reported1.136
