Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-4bit-MLX

A high-performance 4-bit MLX quantization of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled. Specifically optimized for Apple Silicon (M-series chips) to provide deep, agentic-level reasoning locally.

The original BF16 weights are 55.6 GB. This conversion reduces the footprint to 14 GB, making it runnable on any Mac with 24 GB+ of unified memory with room to spare for large context windows.


🧠 Why This Model?

Most local LLMs are "reactive" β€” they start generating a response before they've fully mapped out the logic. This model is deliberative.

Distilled from state-of-the-art Claude 4.6 Opus reasoning trajectories, it uses an advanced Chain-of-Thought (CoT) scaffold. Before providing its final answer, it enters an internal <think> state where it:

  • Deconstructs complex, multi-layered prompts into manageable sub-tasks
  • Simulates different solution paths and self-corrects logic errors before you see them
  • Reduces redundancy by adopting Claude's structured thinking pattern rather than the looping often seen in base reasoning models

This makes it the premier choice for technical planning, complex logic puzzles, and high-stakes decision support on Apple hardware.


πŸ“Š Performance Benchmarks

Tested on Apple M4 Pro, 64 GB Β· mlx-lm 0.30.7 Β· macOS 15

Metric Result
Model load time 2.4 seconds
Prompt ingestion 86.5 tokens/sec
Generation speed 15.7 tokens/sec
Peak RAM usage 15.6 GB
Bit-rate 4.501 bits/weight
Final size 14 GB (3 shards)

πŸ’» System Requirements

Hardware Apple Silicon Mac (M1, M2, M3, M4 or later)
Minimum RAM 24 GB Unified Memory
Recommended RAM 32 GB+ (headroom for long-context reasoning)
OS macOS 13.5 or later
Python 3.10+ (Homebrew Python 3.12 recommended)

πŸš€ Quick Start

1. Install the MLX library

pip install mlx-lm

2. Run in your terminal

python -m mlx_lm.chat \
  --model BeastCode/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit

3. Python integration β€” recommended approach

Use apply_chat_template with enable_thinking=True. This is the idiomatic way to trigger the reasoning mode β€” no manual prompt construction needed.

from mlx_lm import load, generate

model, tokenizer = load("BeastCode/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit")

messages = [
    {
        "role": "user",
        "content": (
            "A farmer needs to cross a river with a wolf, a goat, and a cabbage. "
            "The boat can only hold the farmer and one other item. "
            "If left alone, the wolf eats the goat, and the goat eats the cabbage. "
            "How can he get everything across safely?"
        ),
    }
]

# enable_thinking=True inserts the <think> prefix automatically
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=2048, verbose=True)
print(response)

Tip: Set enable_thinking=False (or omit it) for fast, direct answers without the reasoning block β€” useful for simple lookups or chat where latency matters.

Manual prefill style (advanced)

If you want to see exactly what happens under the hood, you can construct the prompt manually and pre-fill the <think> token yourself. The two approaches produce identical results.

prompt = (
    "<|im_start|>system\n"
    "You are a highly analytical assistant.\n"
    "<|im_end|>\n"
    "<|im_start|>user\n"
    "A farmer needs to cross a river with a wolf, a goat, and a cabbage. "
    "The boat can only hold the farmer and one other item. "
    "If left alone, the wolf eats the goat, and the goat eats the cabbage. "
    "How can he get everything across safely?\n"
    "<|im_end|>\n"
    "<|im_start|>assistant\n"
    "<think>\n"
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=2048, verbose=True)
print(response)

βš™οΈ Quantization Details

Property Value
Method 4-bit group-wise quantization
Precision Mixed (embeddings/heads kept at higher precision for stability)
Tooling mlx-lm.convert
Base model Qwen 3.5 27B (Dense Architecture)

Reproduce this quantization

python -m mlx_lm.convert \
  --hf-path Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
  --mlx-path ./Qwen3.5-27B-Claude-4.6-MLX \
  --quantize \
  --q-bits 4

🧩 Stripping the Reasoning Block

The <think> block is invaluable for verifying logic, but you may want to strip it for a cleaner UI:

import re

def strip_thinking(text: str) -> str:
    """Remove the internal <think> process, returning only the final answer."""
    return re.sub(r'<think>.*?</think>\s*', '', text, flags=re.DOTALL).strip()

πŸ† Model Comparison

Model Size Reasoning style Hardware target
This model (27B) 14 GB Claude 4.6 distilled 24 GB+ Macs
Qwen3.5-9B ~5 GB Fast / intuitive Base 8 GB / 16 GB Macs
Qwen3.5-72B ~42 GB Deep / exhaustive 64 GB+ Ultra/Max

πŸ™ Acknowledgements

  • Core weights: Alibaba Qwen Team β€” Qwen 3.5 27B
  • Reasoning SFT: Jackrong for the Claude 4.6 Opus distillation work
  • Inference engine: Apple MLX Team for making high-speed local inference possible on Apple Silicon
Downloads last month
356
Safetensors
Model size
27B params
Tensor type
BF16
Β·
U32
Β·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for BeastCode/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit