0xSero/Kimi-K2.5-PRISM-REAP-72
license: other license_name: kimi-k2.5 license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE base_model:
- moonshotai/Kimi-K2.5
- Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B tags:
- moe
- expert-pruning
- reap
- deepseek
- kimi
- prism
- int4
- compressed-tensors
- consumer-gpu
- rtx-3090 library_name: transformers pipeline_tag: text-generation
Kimi-K2.5-PRISM-REAP-72
KEEP TEMPERATURE AT 0
81% REAP expert-pruned version of moonshotai/Kimi-K2.5, further pruned from the PRISM-REAP 192-expert variant. Designed to fit on 8x RTX 3090 (24GB) consumer GPUs.
| Property | Value | |----------|-------| | Architecture | KimiK25 (DeepSeekV3 backbone, MLA attention) | | Total Parameters | ~200B (down from ~1T) | | Active Parameters | ~32B (8 experts per token) | | Experts per MoE Layer | 72 routed + 1 shared (down from 384 + 1) | | MoE Layers | 60 (layers 1-60, layer 0 is dense) | | Hidden Size | 7168 | | Attention | MLA (kv_lora_rank=512, q_lora_rank=1536) | | Quantization | W4A16 (group_size=32, symmetric) via compressed-tensors | | Disk Size | 122 GB (down from 289 GB / 555 GB original) | | Pruning Method | REAP (Router-weighted Expert Activation Pruning) | | Vision | Supported (inherited from Kimi-K2.5) |
Why 72 Experts?
72 was chosen because:
- Divisible by 8: Clean sharding across 8 GPUs for TP/EP
- ~122 GB total: Fits in 8x 24GB with room for KV cache
- ~15 GB/GPU weight footprint with Expert Parallelism, leaving ~7 GB for KV cache and overhead
- Retains the top 72 most salient experts per layer from the original 384
Performance (8x RTX 3090, 155W, vLLM 0.15.1)
| Metric | Value | |--------|-------| | Single request | 33.4 tok/s | | 2 concurrent | 52.5 tok/s | | 4 concurrent | 86.2 tok/s | | 8 concurrent | 145.5 tok/s | | TTFT | 0.08s | | Max context | 57,344 tokens | | Vision | Working |
Recommended vLLM Launch (8x RTX 3090)
VLLM_ATTENTION_BACKEND=TRITON_MLA vllm serve 0xsero/Kimi-K2.5-PRISM-REAP-72 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--max-model-len 57344 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 16 \
--trust-remote-code \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--enable-auto-tool-choice \
--enable-chunked-prefill \
--enable-prefix-caching
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"0xsero/Kimi-K2.5-PRISM-REAP-72",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"0xsero/Kimi-K2.5-PRISM-REAP-72",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", thinking=False)
inputs = inputs.to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Pruning Details
This model was created in a two-stage process:
- Stage 1 (Ex0bit): REAP pruning of original 384 experts to 192 experts using saliency scores from 512 calibration samples on allenai/tulu-3-sft-mixture
- Stage 2 (this model): Further pruning from 192 to 72 experts using the same REAP saliency scores, targeting consumer GPU deployment
Key Technical Details
- Per-layer top-72 selection: The 72 most salient experts retained independently per layer
- Gate weight slicing: Router gate weights
[192, 7168]sliced to[72, 7168],e_score_correction_biasfrom[192]to[72] - Contiguous expert remapping: Expert indices remapped to 0-71 in each layer
- All non-expert weights preserved: Attention (MLA), shared expert, embeddings, and LM head unchanged
- Saliency ordering verified: In every layer,
min(retained_saliency) > max(pruned_saliency)selecting the top 72
What is REAP?
REAP (Cerebras Research, 2025) is a one-shot expert pruning method for MoE models:
S_j = (1 / |X_j|) * SUM_{x in X_j} [ g_j(x) * ||f_j(x)||_2 ]
Where g_j(x) is the normalized gate weight and ||f_j(x)||_2 is the L2 norm of expert j's output.
What is PRISM?
The base model was treated with the PRISM-LITE pipeline, softening over-refusal and bias behaviors while preserving model quality.
Optimization Notes
- Expert Parallelism (EP) is critical on PCIe GPUs -- reduces per-GPU model memory from ~23 GB to ~17 GB
- TRITON_MLA is the only MLA backend available on Ampere (CC 8.6)
- FP8 KV cache is not supported with MLA on Ampere; MLA's built-in KV compression (kv_lora_rank=512) already provides ~14x efficiency vs standard MHA
- Uniform GPU power limits prevent synchronization stalls in TP/EP configurations
Citation
@article{reap2025,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Cerebras Research},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
Acknowledgments
- moonshotai/Kimi-K2.5 -- base model
- Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B -- 192-expert intermediate
- Cerebras REAP -- pruning method
- PRISM -- over-bias removal
Author: 0xSero
Likes: 3
Downloads: 0
Tags: transformers, safetensors, kimi_k25, feature-extraction, moe, expert-pruning, reap, deepseek, kimi, prism, int4, compressed-tensors, consumer-gpu, rtx-3090, text-generation, conversational, custom_code, arxiv:2510.13999, base_model:Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B, base_model:quantized:Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B, license:other, region:us



