license: mit
tags:
- glm
- moe
- reap
- nvfp4
- sglang
- blackwell
library_name: sglang
base_model:
- zai-org/GLM-5.1
pipeline_tag: text-generation
GLM-5.1-478B-A42B-REAP-NVFP4
NVFP4 quantization of zai-org/GLM-5.1, further REAP-pruned from 256 → 160 routed experts per MoE layer. Tuned to run at 200,000-token context on a 4× 96 GB Blackwell workstation.
| | |
|---|---|
| Total params | 478.4B |
| Activated / token | 42.7B |
| Routed experts / MoE layer | 160 (was 256 in base) |
| Active experts / token | 8 routed + 1 shared |
| Layers | 78 (3 dense + 75 MoE) + 1 MTP / NEXTN |
| Hidden size | 6144 |
| Attention | MLA-DSA, 64 heads |
| Max position | 202,752 |
| Quantization | NVFP4, group_size=16 (modelopt_fp4) |
| On-disk size | 285 GB (85 shards) |
| License | MIT (inherited from GLM-5.1) |
Measured performance
Single-user, batch size 1, decode tok/s at various prompt lengths on our reference rig:
| Context | tok/s |
|---|---|
| 256 | 46.5 |
| 4 k | 41.8 |
| 16 k | 38.6 |
| 150 k | 22.4 |
Under live mixed traffic (1,495 decode samples):
| Context range | p50 tok/s |
|---|---|
| < 1 k | 42.7 |
| 1 – 8 k | 44.3 |
| 8 – 32 k | 36.3 |
| 32 – 100 k | 27.7 |
Per-rank VRAM at 202,752 ctx: weights 77.2 GB, KV pool 11.3 GB (270 k tokens), CUDA graphs 0.3 GB, ~5 GB free.
Quick start (4× 96 GB Blackwell)
# 1. Download the weights
hf download 0xSero/GLM-5.1-478B-A42B-REAP-NVFP4 --local-dir ./GLM-5.1-478B-A42B-REAP-NVFP4
# 2. Install the pinned inference stack (see "Exact versions" below)
python3.12 -m venv venv && source venv/bin/activate
pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3
# 3. Apply the required NSA-disable patch (see "Required sglang patch" below)
# 4. Launch
./launch.sh # see full script below
Reference rig
- 4× NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96 GB, compute capability 12.0 (sm_120)
- NVIDIA driver 580.126.18, CUDA 12.9 userspace
- Ubuntu / Pop!_OS 22.04, Python 3.12
This is what the tuning targets. The same recipe works on 4× B200 (sm_100), 8× Hopper (sm_90) with fewer or more aggressive quantization, and other Blackwell configurations — see the hardware compatibility matrix at the bottom of this page.
Exact versions (pinned from the running venv)
Everything below is reproducible from:
pip install "sglang[all]==0.5.10.post1" flashinfer-python==0.6.7.post3 flashinfer-cubin==0.6.7.post3
The resolver pulls in the whole stack at these versions:
sglang 0.5.10.post1
torch 2.9.1+cu129
triton 3.5.1
transformers 5.3.0
tokenizers 0.22.2
safetensors 0.8.0rc0
numpy 2.4.4
flashinfer-python 0.6.7.post3
flashinfer-cubin 0.6.7.post3
nvidia-cutlass-dsl 4.5.0.dev0
nvidia-cublas-cu12 12.9.1.4
nvidia-cudnn-cu12 9.10.2.21
nvidia-nccl-cu12 2.27.5
nvidia-cuda-nvrtc-cu12 12.9.86
nvidia-cuda-runtime-cu12 12.9.79
nvidia-nvjitlink-cu12 12.9.86
nvidia-nvshmem-cu12 3.3.20
Verify:
python -c "import sglang, torch, flashinfer; print(sglang.__version__, torch.__version__, flashinfer.__version__)"
# 0.5.10.post1 2.9.1+cu129 0.6.7.post3
Required sglang patch (SM120 only)
GLM-5.1's config advertises GlmMoeDsaForCausalLM, which sglang routes through DeepSeek Sparse Attention by default. Every NSA backend in sglang 0.5.10.post1 is built for sm_90a / sm_100f only and fails at launch on sm_120. Route GLM-5 through the stable dense-MLA path by excluding it from the NSA architecture list:
Edit <venv>/lib/python3.12/site-packages/sglang/srt/configs/model_config.py, function is_deepseek_nsa():
def is_deepseek_nsa(config) -> bool:
architectures = (
config.get("architectures") if isinstance(config, dict)
else getattr(config, "architectures", None)
)
index_topk = (
config.get("index_topk") if isinstance(config, dict)
else getattr(config, "index_topk", None)
)
# Keep GLM-5 on dense MLA until sm_120 NSA kernels ship.
return (
architectures is not None
and architectures[0] in [
"DeepseekV3ForCausalLM",
"DeepseekV32ForCausalLM",
"DeepseekV3ForCausalLMNextN",
"MistralLarge3ForCausalLM",
"PixtralForConditionalGeneration",
]
and index_topk is not None
)
(Only the architectures list changes — GlmMoeDsaForCausalLM is removed.)
After the patch, sglang auto-picks triton for attention on sm_120. Confirm in the startup log: attention_backend='triton'.
On sm_90 (Hopper) and sm_100 (B200) this patch is not needed — the native NSA kernels work. Skip to the launch section.
Launch
#!/usr/bin/env bash
set -euo pipefail
MODEL=/path/to/GLM-5.1-478B-A42B-REAP-NVFP4
VENV=/path/to/sglang-venv
# Route NCCL over PCIe (no NVLink on workstation Blackwell)
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1,2,3 # four Blackwell GPUs
# DeepGEMM has no sm_120 kernels; keep it off.
export SGLANG_ENABLE_JIT_DEEPGEMM=0
export SGLANG_ENABLE_DEEP_GEMM=0
export SGLANG_DISABLE_DEEP_GEMM=1
export SGLANG_SET_CPU_AFFINITY=1
export SGLANG_ENABLE_SPEC_V2=True
export FLASHINFER_DISABLE_VERSION_CHECK=1
# NCCL tuning for PCIe-only (no IB, no NVLink)
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=0
export NCCL_P2P_LEVEL=PIX
export NCCL_SHM_DISABLE=0
export NCCL_BUFFSIZE=4194304
export NCCL_MIN_NCHANNELS=8
export NCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
export NCCL_CUMEM_HOST_ENABLE=0
export TORCH_NCCL_BLOCKING_WAIT=1
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export OMP_NUM_THREADS=8
export SAFETENSORS_FAST_GPU=1
export NVIDIA_TF32_OVERRIDE=1
exec "$VENV/bin/python" -m sglang.launch_server \
--model-path "$MODEL" \
--served-model-name GLM-5.1-478B-A42B-REAP-NVFP4 \
--host 0.0.0.0 --port 8000 \
--trust-remote-code \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1 \
--context-length 202752 \
--max-running-requests 1 \
--mem-fraction-static 0.94 \
--chunked-prefill-size 4096 \
--page-size 128 \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8_e4m3 \
--triton-attention-num-kv-splits 64 \
--moe-runner-backend cutlass \
--fp4-gemm-backend flashinfer_cudnn \
--cuda-graph-max-bs 4 \
--pre-warm-nccl \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--chat-template "$MODEL/chat_template.jinja" \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 16}' \
--watchdog-timeout 1800
On sm_90 / sm_100 you will want --attention-backend flashinfer and --fp4-gemm-backend b12x instead — see the 555B sibling card for that recipe.
Key flag decisions (why these specific values)
These were measured on the reference rig; defaults were not.
--triton-attention-num-kv-splits 64 — biggest single win. Default is 8. At bs=1 decode on sm_120, raising kv-splits gave:
| Context | splits=8 | splits=64 |
|---|---|---|
| 4 k | 39.7 | 41.8 |
| 16 k | 26.4 | 38.6 |
| 150 k | 5.2 | 22.4 |
Coherence verified across arithmetic, factual recall, needle-in-haystack @ 32 k and @ 100 k, and 11-turn chat.
--mem-fraction-static 0.94 — decode is kernel-bound at bs=1, not memory-bound. 0.94 vs 0.97 gives identical tok/s and ~5 GB/rank of headroom for graph recapture and prefill scratch.
--kv-cache-dtype fp8_e4m3 — halves KV memory vs bf16. Required to fit 202 k context in budget.
--attention-backend is intentionally omitted — sglang auto-selects triton on sm_120 for this architecture after the NSA patch. Flashinfer attention is skipped because it requires PCIe P2P atomics not available on the workstation board.
--page-size 128 — the non-MTP default. Drop to 64 only if enabling speculative decode.
MTP / NEXTN speculative decode (optional)
The checkpoint includes an MTP head for layer 78, stitched from the original 256-expert source using the layer-77 REAP keep-map as a proxy.
| | Without MTP (this page's default) | With MTP |
|---|---|---|
| Decode tok/s (short) | ~46 | ~90 (1.93×) |
| Max context | 202,752 | ~65,536 |
| KV dtype | fp8_e4m3 | bf16 (required by NEXTN) |
| Page size | 128 | 64 (required by NEXTN) |
MTP is opt-in because the workstation target is long context, not peak short-prompt throughput. Enable with:
# Replace three lines in the launch script:
--context-length 65536 \
--page-size 64 \
--kv-cache-dtype auto \
# and add:
--speculative-algorithm NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-attention-mode decode \
--speculative-moe-runner-backend cutlass
Also drop --mem-fraction-static to 0.88 — the draft worker adds ~5 GB/rank.
Sampling recommendations
General chat / reasoning:
temperature=0.5 top_p=0.95 frequency_penalty=0.3 repetition_penalty=1.05
Strict-answer (MCQ, tool-use benchmarks):
temperature=0.0 repetition_penalty=1.05
Keep repetition_penalty=1.05 everywhere. Pure greedy with no penalty can loop on pathological low-entropy prompts (e.g., repeated filler tokens).
Lineage & license
zai-org/GLM-5.1 (official, 744B bf16, 256 experts, MIT)
│
├── community NVFP4 quantization via NVIDIA Model Optimizer
│ (e.g. lukealonso/GLM-5.1-NVFP4, ~434 GB, 256 experts)
│
├── Local REAP pass 1: 256 → 192 experts
│ 0xSero/GLM-5.1-555B-A14B-REAP-NVFP4
│
└── Local REAP pass 2: 192 → 160 experts
0xSero/GLM-5.1-478B-A42B-REAP-NVFP4 ← this model
Both REAP passes were done locally using pooled token-weighted observations from:
Prune scripts and MTP-stitch script are in the repo tree.
License: MIT, inherited from zai-org/GLM-5.1.
Citation (REAP method):
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025},
eprint = {2510.13999},
archivePrefix = {arXiv},
}
<!-- GLM51_FAMILY_COMPAT_START -->
GLM-5.1 REAP Family — Hardware Compatibility
All variants in this family are REAP-pruned (2510.13999) descendants of zai-org/GLM-5.1 (original: 744B params, 256 experts/MoE layer, 40B activated/token). Pick a variant based on your GPU architecture and available VRAM.
Quick picker
| You have | Use |
|---|---|
| 8× H100/H200 80GB (Hopper, sm_90) | GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 or GLM-5.1-555B-A14B-REAP-NVFP4 (NVFP4 on Hopper via modelopt_fp4 + triton path) |
| 4× RTX PRO 6000 Blackwell Workstation 96GB (sm_120) | GLM-5.1-478B-A42B-REAP-NVFP4 (further-pruned 160-expert, 200k ctx) — this is the Blackwell Workstation reference config |
| 4× B200 180GB (sm_100) | GLM-5.1-478B-A42B-REAP-NVFP4 or GLM-5.1-555B-A14B-REAP-NVFP4 |
| 8× B200 / Blackwell datacenter | GLM-5.1-555B-A14B-REAP-NVFP4 (192-expert, upstream's reference config with flashinfer + b12x backends) |
| 8× A100 80GB (Ampere, sm_80) | GLM-5.1-444B-A14B-REAP (BF16) or -GPTQ-W4A16 |
| CPU / Apple Silicon / consumer GPU with llama.cpp | GLM-5.1-555B-A14B-REAP-GGUF or GLM-5.1-444B-A14B-REAP-GGUF |
Full family
| Variant | Format | Size | Experts/layer | Activated/token | Min VRAM (TP) | Inference engine | Best on |
|---|---|---|---|---|---|---|---|
| GLM-5.1-555B-A14B-REAP | BF16 | ~1125 GB | 192 | ~14B | 8× 141 GB (H200) | sglang / vllm | Hopper |
| GLM-5.1-444B-A14B-REAP | BF16 | ~910 GB | 154 | ~14B | 8× 114 GB | sglang / vllm | Ampere / Hopper |
| GLM-5.1-555B-A14B-REAP-NVFP4 | NVFP4 (4-bit) | ~320 GB | 192 | ~14B | 4× 80 GB (B200), 8× 48 GB | sglang --quantization modelopt_fp4 | Blackwell (native); Hopper (triton path) |
| GLM-5.1-478B-A42B-REAP-NVFP4 | NVFP4 (4-bit) | ~285 GB | 160 | ~42B | 4× 80 GB Blackwell | sglang --quantization modelopt_fp4 | 4× RTX PRO 6000 Blackwell @ 200k ctx |
| GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 | GPTQ W4A16 | ~297 GB | 192 | ~14B | 4× 80 GB | vllm / sglang --quantization gptq_marlin | Hopper (best), works on Ampere |
| GLM-5.1-555B-A14B-REAP-GGUF | GGUF (Q2–Q8) | ~348 GB | 192 | ~14B | Varies by quant | llama.cpp | CPU / Apple / consumer CUDA |
| GLM-5.1-444B-A14B-REAP-GGUF | GGUF (Q2–Q8) | ~283 GB | 154 | ~14B | Varies by quant | llama.cpp | CPU / Apple / consumer CUDA |
Notes
- NVFP4 on Hopper (H100/H200): supported from sglang 25.10 / 0.5.10+ (NVIDIA SGLang release notes); native Blackwell tensor-core FP4 still gives better throughput.
- NVFP4 on B200 / Blackwell datacenter (sm_100): use flashinfer attention +
b12x or flashinfer MoE backends — this is the recipe in the original 555B-A14B-REAP-NVFP4 card.
- NVFP4 on Blackwell Workstation (sm_120): use
--attention-backend triton (not flashinfer — PCIe P2P atomics unavailable on the consumer board), --moe-runner-backend cutlass, --fp4-gemm-backend flashinfer_cudnn. See the GLM-5.1-478B-A42B-REAP-NVFP4 card for the full 200k-ctx replication guide.
- GPTQ-W4A16 vs NVFP4: same bit depth, different hardware path. NVFP4 has native Blackwell support and per-16 fp8 scales; GPTQ is group-quantized int4 with broader engine support.
- REAP expert count variants (555B/444B): different expert-retention ratios from the same base; 555B keeps more experts (higher quality ceiling), 444B trades quality for 20% less VRAM.
- Why NVFP4-478B-A42B-REAP is different: it's double-pruned (256 → 192 → 160 experts), optimized for a specific Blackwell Workstation 4×96GB target at 200k context. The A42B suffix reflects measured activated params/token on the 160-expert MoE, not the REAP branding convention of the sibling variants.
Pointer to active inference recipe
See GLM-5.1-478B-A42B-REAP-NVFP4 README for the full Blackwell Workstation replication guide (exact software pins, NSA patch, launch flags, measured 200k-ctx perf, sampling recommendations). Most of the sglang flags carry over to other NVFP4 variants on other hardware.
Citation
@misc{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year={2025},
eprint={2510.13999},
archivePrefix={arXiv},
}
<!-- GLM51_FAMILY_COMPAT_END -->
Tags: sglang, safetensors, glm_moe_dsa, glm, moe, reap, nvfp4, blackwell, text-generation, conversational, arxiv:2510.13999, base_model:zai-org/GLM-5.1, base_model:quantized:zai-org/GLM-5.1, license:mit, 8-bit, modelopt, region:us