update header

a4fcee2 verified 5 months ago

7.76 kB

license: mit
pipeline_tag: text-generation
library_name: mlx
base_model: moonshotai/Kimi-Linear-48B-A3B-Instruct
tags:
  - mlx
  - quantization
  - mxfp4
  - mixture-of-experts

Kimi-Linear-48B-A3B-Instruct · MXFP4 · 32-Group (MLX)

moonshotai/Kimi-Linear-48B-A3B-Instruct base model that has been quantized
with mlx-lm 0.28.4 into an MXFP4 format using group size 32 (with selective 8-bit gates for MoE stability).
The weights and auxiliary files live in this folder and are compatible with the MLX Runtime on Apple Silicon.

한국어 | English

Model Summary

Architecture: KimiLinear MoE decoder-only transformer (27 layers, hidden size 2304, 32 attention heads, 256 experts, 8 experts/token) as defined in config.json.
Context length: Configured for ~1M-token windows via linear attention blocks (effective window depends on runtime memory and the KV budget you set with --max-kv-size).
Tokenizer: tiktoken-based BPE (tokenizer_config.json, tiktoken.model) with special tokens defined inside the tokenizer files instead of hard-coded IDs.
Chat template: See chat_template.jinja for the multi-turn schema that mirrors the official Kimi tool-call format.
License: MIT, matching the upstream moonshotai/Kimi-Linear-48B-A3B-Instruct release.

Quantization Details

Tooling: mlx-lm (>=0.28.4) via python3 -m mlx_lm.convert -q targeting MXFP4 weights.
Format: MXFP4 4-bit packing with group size 32 across major linear layers (mode-enforced requirement for MXFP4).
Exceptions: Mixture-of-experts gate projections remain at 8-bit / group size 64 for routing stability, as recorded in quantization_config.
Shard layout: 5× model-0000n-of-00005.safetensors plus model.safetensors.index.json for streaming loads.
Memory: Expect roughly 26–29 GB of Apple Silicon unified memory for the weights; KV cache usage scales with context length.

Quantization snippet (abbreviated from config.json):

"quantization_config": {
  "group_size": 32,
  "bits": 4,
  "mode": "mxfp4",
  "model.layers.1.mlp.gate": {"group_size": 64, "bits": 8},
  "model.layers.2.mlp.gate": {"group_size": 64, "bits": 8},
  "model.layers.3.mlp.gate": {"group_size": 64, "bits": 8},
  "...": "additional gate entries continue through layer 26"
}

Files Included

File	Purpose
`config.json`, `generation_config.json`, `configuration_kimi.py`	Hugging Face config + custom MLX config class.
`model-0000*-of-00005.safetensors`, `model.safetensors.index.json`	Quantized MXFP4 shards.
`modeling_kimi.py`	Custom model definition (inherits `KimiLinearForCausalLM`).
`tokenizer_config.json`, `special_tokens_map.json`, `tiktoken.model`, `tokenization_kimi.py`	Tokenizer assets.
`chat_template.jinja`	HF chat template for `AutoTokenizer.apply_chat_template`.
`README.md`, `README-kr.md`	English/Korean model cards.

Intended Use & Limitations

Use cases: Multilingual assistant/chat, tool invocation, long-context retrieval-augmented generation on Apple Silicon hardware.
Not for: Decisions requiring guarantees (medical, legal, financial) without human review; handling unfiltered harmful instructions.
Safety: Mirrors the base model’s safety profile—apply additional filtering and RLHF layers if deploying to end users.
Security: modeling_kimi.py defines custom modules, so you must use --trust-remote-code with the CLI and trust_remote_code=True with the Python API. Handle sensitive data in an offline, isolated environment.

How to Use (MLX)

Install MLX tooling (macOS 13.6+ on Apple Silicon):

pip install -U mlx-lm  # or: pip install -U "git+https://github.com/ml-explore/mlx-lm.git@main"
# Offline cache only:
# HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 ...

Chat CLI (template + stop rules auto-applied):

mlx_lm.chat \
  --model /path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX \
  --trust-remote-code \
  --max-tokens 512 --temperature 0.7 --top-p 0.9

For ≥256K contexts, consider --max-kv-size 262144 (and scale further as needed).

Programmatic usage:

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

 model, tok = load(
    "/path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX",
    trust_remote_code=True,
 )

messages = [{"role": "user", "content": "Kimi Linear 구조를 간단히 요약해줘."}]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True)

sampler = make_sampler(temperature=0.7, top_p=0.9)
procs = make_logits_processors(repetition_penalty=1.1, repetition_context_size=64)

print(
    generate(
        model,
        tok,
        prompt,
        max_tokens=512,
        sampler=sampler,
        logits_processors=procs,
    )
)

(Set HF_HUB_OFFLINE=1 and/or TRANSFORMERS_OFFLINE=1 before running if you need hub-less operation.)

Conversion Notes

Source checkpoint: moonshotai/Kimi-Linear-48B-A3B-Instruct (synced 2025-11-07 UTC).

Conversion command actually executed:

python3 -m mlx_lm.convert \
  --hf-path moonshotai/Kimi-Linear-48B-A3B-Instruct \
  --q-bits 4 -q \
  --group-size 32 \
  -o Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX

Post-quantization integrity verified via mlx_lm.chat sanity prompts and safetensors checksum inspection.

Integrity & Verification

After upload, verify shard integrity locally:

cd /path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX
shasum -a 256 model-*.safetensors > SHA256SUMS
shasum -c SHA256SUMS

Additional Tips

Set --trust-remote-code whenever loading this repository through MLX CLIs to ensure the custom layers in modeling_kimi.py are registered.
Leverage prompt caching (mlx_lm.cache_prompt) plus --max-kv-size for reliable ≥1M-token experiments without exhausting unified memory.

Acknowledgments

Moonshot AI — Thank you for releasing the Kimi family and openly documenting the Kimi Linear architecture. Reference: Moonshot AI GitHub, Kimi Linear repo, and the official technical report.
Apple Machine Learning Research — Deep gratitude for continuously evolving MLX / MLX-LM so Apple Silicon users can keep learning and iterating. See MLX and MLX-LM.
MLX Community on Hugging Face — Thanks for sharing MLX-ready weights and examples at lightning speed; they directly inspired this conversion flow. See mlx-community.

Citation

If you use this model, please cite both Moonshot AI and this quantized release in your research or product documentation.