brabooObrabo's picture
update header
a4fcee2 verified
metadata
license: mit
pipeline_tag: text-generation
library_name: mlx
base_model: moonshotai/Kimi-Linear-48B-A3B-Instruct
tags:
  - mlx
  - quantization
  - mxfp4
  - mixture-of-experts

Kimi-Linear-48B-A3B-Instruct · MXFP4 · 32-Group (MLX)

moonshotai/Kimi-Linear-48B-A3B-Instruct base model that has been quantized
with mlx-lm 0.28.4 into an MXFP4 format using group size 32 (with selective 8-bit gates for MoE stability).
The weights and auxiliary files live in this folder and are compatible with the MLX Runtime on Apple Silicon.

한국어 | English

Model Summary

  • Architecture: KimiLinear MoE decoder-only transformer (27 layers, hidden size 2304, 32 attention heads, 256 experts, 8 experts/token) as defined in config.json.
  • Context length: Configured for ~1M-token windows via linear attention blocks (effective window depends on runtime memory and the KV budget you set with --max-kv-size).
  • Tokenizer: tiktoken-based BPE (tokenizer_config.json, tiktoken.model) with special tokens defined inside the tokenizer files instead of hard-coded IDs.
  • Chat template: See chat_template.jinja for the multi-turn schema that mirrors the official Kimi tool-call format.
  • License: MIT, matching the upstream moonshotai/Kimi-Linear-48B-A3B-Instruct release.

Quantization Details

  • Tooling: mlx-lm (>=0.28.4) via python3 -m mlx_lm.convert -q targeting MXFP4 weights.
  • Format: MXFP4 4-bit packing with group size 32 across major linear layers (mode-enforced requirement for MXFP4).
  • Exceptions: Mixture-of-experts gate projections remain at 8-bit / group size 64 for routing stability, as recorded in quantization_config.
  • Shard layout: 5× model-0000n-of-00005.safetensors plus model.safetensors.index.json for streaming loads.
  • Memory: Expect roughly 26–29 GB of Apple Silicon unified memory for the weights; KV cache usage scales with context length.

Quantization snippet (abbreviated from config.json):

"quantization_config": {
  "group_size": 32,
  "bits": 4,
  "mode": "mxfp4",
  "model.layers.1.mlp.gate": {"group_size": 64, "bits": 8},
  "model.layers.2.mlp.gate": {"group_size": 64, "bits": 8},
  "model.layers.3.mlp.gate": {"group_size": 64, "bits": 8},
  "...": "additional gate entries continue through layer 26"
}

Files Included

File Purpose
config.json, generation_config.json, configuration_kimi.py Hugging Face config + custom MLX config class.
model-0000*-of-00005.safetensors, model.safetensors.index.json Quantized MXFP4 shards.
modeling_kimi.py Custom model definition (inherits KimiLinearForCausalLM).
tokenizer_config.json, special_tokens_map.json, tiktoken.model, tokenization_kimi.py Tokenizer assets.
chat_template.jinja HF chat template for AutoTokenizer.apply_chat_template.
README.md, README-kr.md English/Korean model cards.

Intended Use & Limitations

  • Use cases: Multilingual assistant/chat, tool invocation, long-context retrieval-augmented generation on Apple Silicon hardware.
  • Not for: Decisions requiring guarantees (medical, legal, financial) without human review; handling unfiltered harmful instructions.
  • Safety: Mirrors the base model’s safety profile—apply additional filtering and RLHF layers if deploying to end users.
  • Security: modeling_kimi.py defines custom modules, so you must use --trust-remote-code with the CLI and trust_remote_code=True with the Python API. Handle sensitive data in an offline, isolated environment.

How to Use (MLX)

  1. Install MLX tooling (macOS 13.6+ on Apple Silicon):

    pip install -U mlx-lm  # or: pip install -U "git+https://github.com/ml-explore/mlx-lm.git@main"
    # Offline cache only:
    # HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 ...
    
  2. Chat CLI (template + stop rules auto-applied):

    mlx_lm.chat \
      --model /path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX \
      --trust-remote-code \
      --max-tokens 512 --temperature 0.7 --top-p 0.9
    

    For ≥256K contexts, consider --max-kv-size 262144 (and scale further as needed).

  3. Programmatic usage:

    from mlx_lm import load, generate
    from mlx_lm.sample_utils import make_sampler, make_logits_processors
    
     model, tok = load(
        "/path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX",
        trust_remote_code=True,
     )
    
    messages = [{"role": "user", "content": "Kimi Linear 구조를 간단히 요약해줘."}]
    prompt = tok.apply_chat_template(messages, add_generation_prompt=True)
    
    sampler = make_sampler(temperature=0.7, top_p=0.9)
    procs = make_logits_processors(repetition_penalty=1.1, repetition_context_size=64)
    
    print(
        generate(
            model,
            tok,
            prompt,
            max_tokens=512,
            sampler=sampler,
            logits_processors=procs,
        )
    )
    

    (Set HF_HUB_OFFLINE=1 and/or TRANSFORMERS_OFFLINE=1 before running if you need hub-less operation.)

Conversion Notes

  • Source checkpoint: moonshotai/Kimi-Linear-48B-A3B-Instruct (synced 2025-11-07 UTC).
  • Conversion command actually executed:
    python3 -m mlx_lm.convert \
      --hf-path moonshotai/Kimi-Linear-48B-A3B-Instruct \
      --q-bits 4 -q \
      --group-size 32 \
      -o Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX
    
  • Post-quantization integrity verified via mlx_lm.chat sanity prompts and safetensors checksum inspection.

Integrity & Verification

After upload, verify shard integrity locally:

cd /path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX
shasum -a 256 model-*.safetensors > SHA256SUMS
shasum -c SHA256SUMS

Additional Tips

  • Set --trust-remote-code whenever loading this repository through MLX CLIs to ensure the custom layers in modeling_kimi.py are registered.
  • Leverage prompt caching (mlx_lm.cache_prompt) plus --max-kv-size for reliable ≥1M-token experiments without exhausting unified memory.

Acknowledgments

  • Moonshot AI — Thank you for releasing the Kimi family and openly documenting the Kimi Linear architecture. Reference: Moonshot AI GitHub, Kimi Linear repo, and the official technical report.
  • Apple Machine Learning Research — Deep gratitude for continuously evolving MLX / MLX-LM so Apple Silicon users can keep learning and iterating. See MLX and MLX-LM.
  • MLX Community on Hugging Face — Thanks for sharing MLX-ready weights and examples at lightning speed; they directly inspired this conversion flow. See mlx-community.

Citation

If you use this model, please cite both Moonshot AI and this quantized release in your research or product documentation.