base_model:
- MiniMaxAI/MiniMax-M2.7
library_name: transformers
license: other
pipeline_tag: text-generation
language:
- en
- zh
tags:
- minimax
- nvfp4
- 4-bit
- quantized
- compressed-tensors
- vllm
- DGX-Spark
- GB10
- MoE
- agentic
- tool-use
- code
MiniMax-M2.7-NVFP4-GB10-AC
Agentic + Coder recalibration of MiniMax-M2.7 NVFP4-GB10. Same architecture and quantization scheme as saricles/MiniMax-M2.7-NVFP4-GB10, but calibrated on a 7-dataset mix targeted at agentic tool-use and code-generation workloads instead of general chat. The two are parallel variants of the same quant approach — sibling releases, not a version chain.
Custom GB10 NVFP4 quantization of MiniMaxAI/MiniMax-M2.7 (230B, 256 MoE experts, top-K=8) targeted at NVIDIA DGX Spark (GB10) and Blackwell-family hardware. 141.05 GB on disk across 29 shards.
Why -AC? Why re-calibrate?
Post-training NVFP4 quantization depends on a calibration dataset to set per-layer activation scales (amax values). A 4-bit float format has 16 representable values — calibration determines how the full BF16 activation range at each layer is mapped to those 16 bins.
If calibration data doesn't match the target workload, real-world activations outside the calibrated range get clipped → quality loss on those inputs.
NVFP4-GB10 calibrated on HuggingFaceH4/ultrachat_200k (general multi-turn English chat, 64 samples)
NVFP4-GB10-AC calibrated on a 7-dataset agentic + coder mix (896 samples queued, 888 after length filtering)
The -AC calibration mix is designed to align activation scales with the workloads the model will actually serve when deployed in agent frameworks like OpenClaw, Aider, or Claude Code-style assistants.
Calibration mix
128 samples per dataset, 49,152 (48K) max sequence length:
| Dataset | Samples | Domain |
|---|---:|---|
| theblackcat102/evol-codealpaca-v1 | 128 | Code generation |
| Salesforce/xlam-function-calling-60k | 128 | Tool calling / function invocation |
| open-r1/Mixture-of-Thoughts (code) | 128 | Code reasoning |
| open-r1/Mixture-of-Thoughts (math) | 128 | Mathematical reasoning |
| open-r1/Mixture-of-Thoughts (science) | 128 | Scientific reasoning |
| SWE-bench/SWE-smith-trajectories (tool split) | 128 | Software-engineering agent trajectories |
| HuggingFaceH4/ultrachat_200k (train_sft) | 128 | General multi-turn chat coverage |
| Total queued | 896 | — |
| Tokenized (post length-filter) | 888 | 8 dropped as too-short after tokenization |
The 7th dataset (ultrachat_200k) is intentional: without a general-chat anchor, calibration would bias exclusively toward code/tool/math distributions and degrade plain conversational quality. The mix preserves chat capability while shifting activation scales toward the agentic/coder workloads this quant is built for.
Model Details
| | |
|---|---|
| Base Model | MiniMaxAI/MiniMax-M2.7 |
| Architecture | MiniMaxM2ForCausalLM (MoE, 256 experts, top-K=8) |
| Total Parameters | 230B |
| Active Parameters | ~10B per token |
| Hidden Layers | 62 |
| Hidden Size | 3,072 |
| Vocab Size | 200,064 |
| Max Position Embeddings | 196,608 (192K context) |
| Quantization | NVFP4 (4-bit floating point) with GB10-tuned ignore list |
| Format | compressed-tensors (safetensors) |
| Size on Disk | 141.05 GB across 29 shards |
| Deployment | 2× DGX Spark (does not fit in a single 128 GB Spark) |
| License | Non-commercial, inherited from MiniMaxAI/MiniMax-M2.7. See Use & License. |
Quantization Details
- Method: Post-training quantization via NVIDIA TensorRT Model Optimizer (
nvidia-modelopt 0.29.0)
- Transformers: 4.57.6 (with
Conv1D compatibility shim for post-4.57 module relocation)
- Scheme:
mtq.NVFP4_DEFAULT_CFG (algorithm=max, group_size=16) + GB10-tuned disable list applied post-calibration
- Calibration: 7-dataset agentic + coder mix (see table above), 896 samples queued / 888 tokenized @ 49,152 max-seq
- Ignore list (kept in BF16, from published
hf_quant_config.json):
lm_head, *embed_tokens*
*block_sparse_moe.gate — MoE router gate (not per-expert gates)
*model.layers.0.* — first transformer block
*model.layers.61.* — last transformer block
- Quantizer counts: 143,967
TensorQuantizer modules inserted, 51,327 disabled via ignore list, 92,640 active during calibration
- GB10 specialization:
self_attn stays QUANTIZED (vs. the standard NVFP4 reference configuration which keeps attention BF16) — the GB10 ignore list only covers the items listed above
- Calibration run: Hugging Face Jobs, 8× NVIDIA A100 80 GB, ~10 hours wall-clock, single-phase (no wallclock-cap, no deferred samples, no OOMs)
- Starvation check: 0 starved experts at end of calibration (every active quantizer received enough token traffic to produce a valid amax)
- Recipe script:
quantize-ac-protected.py — full three-phase recipe with OOM-defer protection, amax-only checkpointing, and inline export
Running on 2× DGX Spark (Tensor Parallel)
At 141.05 GB this model does not fit in a single DGX Spark's 128 GB unified memory. It runs with tensor-parallel-size=2 across two Sparks connected via their ConnectX-7 200 GbE link, orchestrated by Ray. The community reference container is eugr/spark-vllm-docker.
Quick start: run_vllm.sh is a ready-to-run wrapper — exports the tuned environment variables and invokes vllm serve with the working flag set.
Full deployment reference: DEPLOYMENT.md documents the two deployment profiles we tested, measured numbers, and known hardware/framework quirks specific to GB10 (SM 12.1) and multi-node Ray TP.
The short version: on GB10 the fastest NVFP4 MoE path is the Marlin backend (VLLM_NVFP4_GEMM_BACKEND=marlin, VLLM_USE_FLASHINFER_MOE_FP4=0), and if your workload is agentic (tool-calling, code generation, repeated-token-heavy) you should additionally enable ngram speculative decoding. See DEPLOYMENT.md for the full rationale and benchmark data.
Client-side tips
Every client that calls this endpoint should set max_tokens ≥ 16384. The OpenAI SDK's default of 4096 will silently truncate tool-call JSON mid-string, which appears as "model forgot how to use tools" but is actually just a clipped response. Bump it.
When to choose -AC vs NVFP4-GB10
- Use
-AC for: agent frameworks (OpenClaw, Aider, Claude Code-style), tool-calling workloads, code-generation assistants, multi-turn reasoning over code/math.
- Use
NVFP4-GB10 for: general chat applications, scenarios where the calibration-dataset provenance matches the published NVFP4-GB10 benchmarks exactly.
Both variants are mechanically compatible (same vLLM invocation, same compressed-tensors format). Only the per-layer NVFP4 activation scales differ — size on disk, architecture, ignore list, and deployment are unchanged.
Performance
Benchmarked on 2× NVIDIA DGX Spark (GB10), TP=2 via Ray over QSFP56 RoCE, using llama-benchy v0.3.3. Measured 2026-04-19 with the tuned config shown above (including VLLM_USE_FLASHINFER_MOE_FP4=1, SoC firmware ≥2.148.24, --gpu-memory-utilization 0.88).
Measured on 2× NVIDIA DGX Spark (GB10, SM 12.1), TP=2 over QSFP56 RoCE, post-firmware SoC 2.148.24. vLLM 0.19.1rc1.dev241 via the eugr/spark-vllm-docker nightly image. Two deployment profiles documented in DEPLOYMENT.md. Numbers below are observed on this rig; your mileage depends on build, image, and workload.
Profile 1 — Throughput-stable (Marlin NVFP4 MoE, no speculation)
Benchmarked with llama-benchy v0.3.3, 3 runs per config, warm model, single client.
| Prompt (tok) | Gen (tok) | Prefill (tok/s) | Decode (tok/s) | TTFT (ms) |
|---:|---:|---:|---:|---:|
| 512 | 128 | 1,128 | 35.44 | 454 |
| 512 | 256 | 1,248 | 35.86 | 410 |
| 1024 | 128 | 2,049 | 35.03 | 500 |
| 1024 | 256 | 2,132 | 34.50 | 480 |
| 4096 | 128 | 2,817 | 33.76 | 1,454 |
| 4096 | 256 | 3,314 | 33.45 | 1,236 |
API latency: 1.50 ms. Peak decode: 35.86 tok/s.
Profile 2 — Agentic (Marlin NVFP4 MoE + ngram speculative decoding)
Measured on a 12-prompt agent-flavored set (code generation, tool calls, short chat) — not a standard benchmark; it approximates real agent-framework traffic. Same hardware, same sampling, only the serving config differs.
| Metric | Throughput-stable profile | Agentic profile |
|---|---:|---:|
| Average decode across 12 prompts (tok/s) | 25.20 | 36.44 |
| Peak decode (tok/s) | 35.86 | 48.34 (code-04: async-pattern) |
| Total wall-clock for full prompt set (s) | 250.8 | 162.7 |
| Wall-clock speedup (Agentic vs Throughput-stable) | — | 1.54× |
Per-task wall-clock table, DEPLOYMENT.md has the full breakdown: code-02 (MBPP-style) 2.13× faster; code-04 (async pattern) 1.90× faster; chat-03 (creative writing) 2.06× faster; tool-04 (don't-call-tool trap) 1.96× faster.
Why the two profiles differ: ngram speculative decoding wins big when responses contain repeated tokens (tool names, file paths, variable names, JSON keys reappearing) — which agent/code workloads have abundantly. On synthetic benchmarks with low token repetition (like llama-benchy's generated prompts), ngram's overhead slightly exceeds its savings and decode regresses. DEPLOYMENT.md documents this tradeoff and when to pick each profile.
Qualitative — agentic behavior vs NVFP4-GB10 sibling
Same prompt set, same sampling, compared against the general-chat-calibrated sibling variant:
| Task | NVFP4-GB10 tokens | NVFP4-GB10-AC tokens | Wall-clock speedup (AC) |
|---|---:|---:|---:|
| "Answer directly; don't call the provided tool" (trap) | 718 | 44 | 14.7× |
| Multi-step meeting booking (3 tools) | 385 | 81 | 4.6× |
| Weather (single tool) | 73 | 51 | 2.5× |
| Parallel stock prices (parallel tool calls) | 176 | 121 | 1.4× |
AC is measurably more decisive on tool-use tasks — it emits cleaner, shorter tool calls and, crucially, doesn't over-invoke tools when direct answers suffice. Raw throughput (decode/prefill) is within noise of NVFP4-GB10 as expected — quant format is identical, only activation scales differ. The meaningful delta between AC and GB10 is qualitative on agentic tasks, not numeric on raw throughput.
Notes
- See
DEPLOYMENT.md for the full environment, flags, caveats, and why Marlin is the right MoE backend on SM 12.1 GB10 today.
- For published standardized benchmarks (HumanEval, BFCL, MT-Bench, WildClawBench), see forthcoming evaluation runs.
Recommended Sampling Parameters
Per MiniMax documentation:
{
"temperature": 1.0,
"top_p": 0.95,
"top_k": 40,
"min_p": 0.01
}
Target Hardware
Quantized for and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell-class GPUs with NVFP4 tensor-core support. On Hopper-class hardware (H100/H200) the model will load and run, but the ignore list was tuned for Blackwell and will leave some performance on the table.
If you only have one DGX Spark
At 141.05 GB this model does not fit in a single Spark's 128 GB unified memory — it requires 2× Spark with tensor parallelism. If you have only one Spark, consider the REAP-pruned variant: saricles/MiniMax-M2.7-REAP-172B-A10B-NVFP4-GB10 (98.9 GB, single-node deployment).
Use & License
This derivative inherits the license terms of the base model, MiniMaxAI/MiniMax-M2.7. The full license text is distributed in the LICENSE file in this repo.
Permitted free uses (from §5 of the base license): personal use — including self-hosted deployment for coding, development of applications, agents, tools, integrations, research, experimentation, or other personal purposes; use by non-profit organizations, academic institutions, and researchers for non-commercial research or educational purposes; and modifications for the uses above.
Commercial use requires authorization directly from MiniMax. If you intend to use this model (or any derivative) for commercial purposes — including offering products/services to third parties for a fee, commercial-product APIs, or commercial deployment — you must:
- Obtain prior written authorization from MiniMax by emailing
api@minimax.io with subject line "M2.7 licensing", and
- Prominently display "Built with MiniMax M2.7" on the related website, user interface, blogpost, about page, or product documentation.
Prohibited uses (from the license appendix) — by using this model you agree not to use it to: generate or disseminate content prohibited by applicable laws, support any military purpose, exploit or harm minors, generate harmful misinformation intended to deceive, or promote discrimination or hate speech.
This quantization pipeline and the recipe script in this repo (quantize-ac-protected.py) are released under the same terms as the base model, as a derivative work.
Acknowledgments
Reproducibility
Full recipe script: quantize-ac-protected.py
The script implements a three-phase protected calibration pipeline:
- Phase A — Calibration with per-sample OOM defer, amax-only checkpoints every N samples (60 MB each, versus ~460 GB per checkpoint if saving full state), optional two-phase bucket commit with sha256 markers, wallclock watchdog (soft + hard exit). Inline export at end on successful completion.
- Phase B (fallback) — Resume from the latest good checkpoint, process deferred samples on a larger-memory GPU flavor, rescue starved experts, export.
- Phase C (recovery only) — Re-export from a saved checkpoint if Phases A/B completed calibration but crashed during export.
Env vars consumed by the recipe:
PHASE = A | B | C
INPUT_DIR — path to the BF16 source model
OUTPUT_DIR — export target (Phase A inline export + Phase B/C export)
TARGET_REPO_ID — HF Hub repo to publish the quantized model to
BUCKET_REPO_ID — HF Hub dataset repo used as a workspace for checkpoints (optional; remove Phase B/C if you don't want a bucket)
BUCKET_PREFIX — path prefix inside the bucket repo
NUM_CALIB_PER_DS (default 128)
MAX_SEQ (default 49152)
CKPT_EVERY (default 50)
WALLCLOCK_BUDGET_S (default 21600 = 6h; Phase A exits cleanly before cap)
STARVED_EXPERT_PCT_ABORT (default 1.0%)
Run for this release:
- Job: HF Jobs
a100x8, single Phase A invocation
- Duration: ~10 hours wall-clock (01:40 UTC start → 11:41 UTC Phase A DONE → inline export → publish)
- Outcome:
status=complete-published, deferred=0, starved=0
- 14 amax-only checkpoints written during calibration (60 MB each), ckpt 14 is the final post-rescue state
Tags: transformers, safetensors, minimax_m2, text-generation, minimax, nvfp4, 4-bit, quantized, compressed-tensors, vllm, DGX-Spark, GB10, MoE, agentic, tool-use, code, conversational, custom_code, en, zh, base_model:MiniMaxAI/MiniMax-M2.7, base_model:finetune:MiniMaxAI/MiniMax-M2.7, license:other, endpoints_compatible, 8-bit, region:us